Mining apps for anomalies

A. Zeller    Saarland University, Saarbrücken, Germany

Abstract

Does a program do what it is supposed to do? Once answer to this question could come from app mining – that is, the knowledge encoded into the hundreds of thousands of apps available in app stores. App mining can help to determine what would be normal and abnormal behavior, and thus guide programmers and users toward better security and usability.

Keywords

App mining; CHABADA; Clusters; Malicious behavior; MUDFLOW; User interface analysis; Behavior patterns

The Million-Dollar Question

So you have some program. It can be yours, or it can come from a third party. You want to use it. But before you do so, you may ask the million-dollar question: does the program do what it is supposed to do? And will it continue to do so in the future? To answer this question, we first need to know what it is the program should do, or, conversely, what it should not do. While it is easy to state a couple of desired and undesired properties, there is a huge gray area that is surprisingly hard to define.

Take a simple mobile game like Flappy Bird, for instance. Your aim is to move a little bird up and down such that it does not hit an obstacle. As a developer, you don’t want the game to crash, that’s for sure; so “no crashes” would definitely be on the list of undesired properties—together with “not spying,” “not deleting all my files,” and other generic properties we usually take for granted.

But suppose you want to teach a computer to test the game—to check whether the desired properties are all there. You’d have to specify the basic gameplay, let the computer control the bird, and check whether the game correctly ends if, and only if, the bird has crashed into an obstacle. How would you specify all this? And is it not as complex as writing the program in the first place?

We humans need no such detailed specification of how to play the game. That’s because we humans base ourselves on our expectations, which in turn are based on experience—in this case, our experience with similar games: if you ever played a Jump-And-Run game, you’ll get the grasp of Flappy Bird. But not only will you be able to play the game, you will also be able to check it against its description. All this is based on your experience. The question is: can we teach a computer how to do this—check a program against expectations? To do this, we must find a way to have a computer learn such expectations, or in other words, to learn what program behavior is normal in a given context. And this is where app stores and app mining come into play.

App Mining

The key idea of app mining is to leverage the knowledge encoded into the hundreds of thousands of apps available in app stores—specifically, to determine what would be normal (and thus expected) behavior, to detect what would be abnormal (possibly unexpected) behavior, and thus to guide programmers and users toward better security and usability.

From a researcher’s perspective, app stores are just collections of programs—but here, an app is a bit more than just program code alone. Apps in app stores have three exciting features:

 First, apps come with all sorts of metadata, such as names, descriptions, categories, downloads, reviews, ratings, and user interfaces. All of these can be associated with program features, so you can, for instance, associate program behavior with descriptions. (You can also mine and associate just the metadata, finding that bad reviews correlate with low download numbers; in my opinion, though, this would be product mining, not app mining.)

 Second, apps are pretty much uniform, as they cater to one (mobile) platform only. They use the same binary format and the same libraries, which on top, use fairly recent designs. All this makes apps easy to analyze, execute, and test—and consequently, easy to compare.

 Third, apps are redundant. There are plenty of apps that all address similar problems—in similar or dissimilar ways. This is in sharp contrast to open source programs, where each solution would typically be implemented exactly once, and then reused. This redundancy in apps allows us to learn common patterns of how problems are addressed—and, in return, detect anomalies.

All of this offers plenty of research opportunities; all you need is the data and a means to dig through it. The data, though, is not that easy to get. You cannot simply download a huge app collection from some research server—that would be a huge copyright violation. Instead, you have to download your own collection from the app store of your choice, one app after another, together with the associated metadata.

Depending on your investigation, you may need several thousand apps. Since the offering in the app store does not necessarily match what’s on user’s devices, you should focus on frequently downloaded apps from all app categories, from gaming to productivity. As we usually assume the large majority of these apps to be benign, these apps and their metadata then form your source of knowledge—the very knowledge a computer can and should use as it comes to identifying “normal” behavior.

Detecting Abnormal Behavior

One of the most important applications of app mining is to identify malicious behavior—that is, behavior that is directed against the user’s interests. But how do we know what the user’s interests are? And if we don’t know, how can we tell whether some behavior is malicious or not? By mining a set of benign apps, we can at least tell whether some behavior is normal or abnormal. If it’s normal, then it may well be expected and accepted; if it’s abnormal, though, then it may require further scrutiny.

The problem with “normal” behavior is that it varies according to the app’s purpose. If an app sends out text messages, for instance, that would normally be a sign of malicious behavior—unless it is a messaging application, where sending text messages is one of the advertised features. If an app continuously monitors your position, this might be malicious behavior—unless it is a tracking app that again advertises this as a feature. As a consequence, simply checking for a set of predefined “undesired” features is not enough—if the features are clearly advertised, then it is reasonable to assume the user tolerates, or even wants these features, because otherwise, she would not have chosen the app.

To determine what is normal, we thus must assess program behavior together with its description. If the behavior is advertised (or can be implied by the kind of application), then it’s fine; if not, it may come as a surprise to the user, and thus should be flagged. This is the idea we followed in our first app mining work, the CHABADA tool.

CHABADA stands for “Checking App Behavior Against Descriptions of Apps”; it is a general tool to detect mismatches between the behavior of an app and its description. CHABADA works in two stages:

1. CHABADA starts with a (large) set of apps to be analyzed. It first applies tried-and-proven natural language processing techniques (stemming, LDA (Latent Dirichlet Analysis), topic analysis) to abstract the app descriptions into topics. It builds clusters of those apps whose topics have the most in common. Thus, all apps whose descriptions refer to messaging end up in a “Messaging” cluster. You may also get “Games” clusters, “Office” clusters, or “Travel” clusters, as well as a number of clusters featuring doubtful apps; all these will reproduce realities from the app store.

2. Within each cluster, CHABADA will now search for outliers regarding app behavior. As a proxy for behavior, CHABADA simply uses the set of API calls contained in each app; these are easy to extract using simple static analysis tools. To identify outliers, CHABADA uses tried-and-proven outlier analysis techniques, which provide a ranking of the apps in a cluster, depending on how far away their API usage is from the norm. Those apps that are ranked highest are the most likely outliers.

What do these rankings give you? We identified “Travel” applications that happily shared your device identifier and account information with the world (which is rare for “Travel” applications). Plenty of apps tracked all sorts of user information without mentioning this in their description. (We had a cluster named “Adware” which contained apps focusing on collecting data.) But outliers can also come to be because they are the one good app amongst several dubious ones. In our “Poker” cluster, only one app would not track user data—and would promptly be flagged as an outlier.

The real power of such approaches, however, comes when they are applied to detect malicious apps. Applied on a set of 22,500 apps, CHABADA can detect 74% of novel malware, with a false positive rate below 10%. Our recent MUDFLOW prototype, which learns normal data flows from apps, can even detect more than 90% of novel malware leaking sensitive data. Remember that these recognition rates come from learning from benign samples only. Thus, CHABADA and MUDFLOW can be applied to detect malware even if it is the very first of its kind—simply because it shows unusual behavior compared to the many benign apps found in app stores.

A Treasure Trove of Data …

At this point, API usage and static data barely scratch the surface of the many facets of behavior that can be extracted from apps. To whet your appetite, here’s a number of ideas that app stores all make possible—and that all are based on the idea of mining what is normal across apps:

1. Future techniques will tie program analysis to user interface analysis, for instance, to detect whether the user agreed to specific terms and conditions before starting whatever questionable behavior. (And whether the terms and conditions were actually legible on the screen!)

2. Mining user interaction may reveal behavior patterns we could reuse in various contexts. For instance, we could learn from one app that to check out, one typically has to add items to a shopping cart first—and reapply this pattern when we want to automatically explore another shopping app.

3. Violating behavior patterns may also imply usability issues. If a button named “Login” does nothing, for instance, it would be very different from the other “Login” buttons used in other apps—and hopefully be flagged as an anomaly. (If it takes control over your device, this would hopefully be detected as an even larger anomaly!)

4. Given good test generators, one can systematically explore the dynamic behavior, and gain information on concrete text and resources accessed. For instance, an app that shows a map would typically send the location to a known maps service—but not necessarily to some obscure server we know nothing about.

The fun with apps is that they offer so many different data sources that can all be associated with each other—and there are so many instances of apps that one can indeed learn what makes for normal and expected behavior. And still, we are just at the beginning.

… but Also Obstacles

App mining is different. There’s exciting data available, but there also is data that is normally not available, in particular compared to mining source code repositories. Here are a few obstacles that you may need to be aware of:

1. Getting apps is not hard, but not easy either. Besides the official stores, there is no publicly available repository of apps where you could simply download thousands of apps—simply because this would be a gross violation of copyright. Even researchers cannot share their app collections, for the exact same reason. You will have to download your own collection, and this takes time and effort. (Note that collections of malicious apps can be easily shared—but that’s because it is unlikely that someone would enforce copyright.)

2. For apps, there’s no easily accessible source code, version, or bug information. If you monitor a store for a sufficient time, you may be able to access and compare releases, but that’s it. The vendors maintain their own code, version control, and bug databases, and they normally would not grant you access to these. And, the few apps that are available as open source would be neither popular nor representative. Fortunately, app byte code is not too hard to get through.

3. Metadata is only a very weak indicator of program quality. Lots of one-star reviews may refer to a recent price increase, which is independent of the app itself; or may come from fans collectively criticizing an app for political reasons; or be related to the app actually being nonfunctional. On the other hand, lots of reviews talking about crashes or malicious behavior might give clear signs.

4. Never underestimate developers. Vendors typically have a pretty clear picture of what their users do—by collecting and analyzing lots of usage and installation data, which you don’t have access to. If you think you can mine metadata to predict release dates, reviews, or sentiments: talk to vendors first and check your proposal against the realities of app development.

In practice, overcoming these obstacles is not too hard: get or create a set of scripts that download a representative set of apps and their metadata; use a suitable tool chain for analyzing app code; and talk to app vendors and developers to understand their practice and identify their needs. Then get the data—and enjoy the ride!

Executive Summary

App mining leverages common knowledge in thousands of apps to automatically learn what is “normal” behavior—and in contrast, automatically identify “abnormal” behavior. This classification can guide programmers and users toward quality, productivity, and security. As an emerging field of research, app mining opens lots of opportunities for research that serves users, developers, and vendors alike. Enjoy app mining!

Further Reading

[1] Gorla A., Tavecchia I., Gross F., Zeller A. Checking app behavior against app descriptions. Proceedings of the international conference on software engineering 2014.

[2] Kuznetsov K., Gorla A., Tavecchia I., Gross F., Zeller A. Mining Android apps for anomalies. In: Bird C., Menzies T., Zimmermann T., eds. Art and science of analyzing software data. Elsevier; 2015.

[3] Avdiienko V., Kuznetsov K., Gorla A., Zeller A., Arzt S., Rasthofer S., et al. Mining apps for abnormal usage of sensitive data. Proceedings of the international conference on software engineering 2015.

[4] Saarland University app mining project page. https://www.st.cs.uni-saarland.de/appmining/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset