Android malware classification with API call-grams

Sara Veterini Sapienza, Università di Roma Master in Engineering in Computer Science Android Malware Classification with API call-grams

15/01/2018Android Malware Classification with API call-grams Page 2 Does Android malware exist? Yes. And it is also growing. Symantec observed 18.4 million detections in 2016, double than 2015

15/01/2018Android Malware Classification with API call-grams Page 3 Why does Android malware exist? • Smartphones contain sensitive data monetize! • It is easy to develop it banking trojans are sold in the black market for 200 $ • It is easy to distribute it third party stores

15/01/2018Android Malware Classification with API call-grams Page 4 What is Google doing? • Google Bouncer, an antivirus system that scans both new and existing apps on Google Play Store • Google Play Protect, integration of Google Play application that scans your device But..

15/01/2018Android Malware Classification with API call-grams Page 5 What can we do? • Detection: understanding if the sample is malware or not • Classification: understanding which family the malware belongs to Both detection and classification can be performed with machine learning techniques learn autonomously, with the help of a knowledge base of many analyzed samples

15/01/2018Android Malware Classification with API call-grams Page 6 State of the art - Features

15/01/2018Android Malware Classification with API call-grams Page 7 State of the art - Features

15/01/2018Android Malware Classification with API call-grams Page 8 Distribution of works by feature used

15/01/2018Android Malware Classification with API call-grams Page 9 Distribution of works by feature used

15/01/2018Android Malware Classification with API call-grams Page 10 Thesis contributions • Malware classification system with machine learning algorithms • Static features only: API call-grams (apigrams) extracted from the call graph of the application, that are sequences of API calls in order

15/01/2018Android Malware Classification with API call-grams Page 11 What is an apigram? • Example:

15/01/2018Android Malware Classification with API call-grams Page 12 Proposed approach • Extract call graph of the apk with Androguard library • Create apigrams of length 3 from the graph • Discard apigrams occurring in just one apk: they are not representative! • Build a matrix of size N * M: – Each row of the matrix is a binary vector representing an apk a – each element eag corresponds to a particular apigram ag

15/01/2018Android Malware Classification with API call-grams Page 13 Proposed approach • Apigrams are extracted with two levels of abstraction • 1st level, complete apigrams, containing: – Activity name where the method is called – Method name – Method descriptor • 2nd level, abstracted apigrams: – Activity name – Method name

15/01/2018Android Malware Classification with API call-grams Page 14 Proposed approach • Before building the matrix, feature selection with Chi- Square algorithm is applied • Objective: reduce the number of features (they can reach the magnitude of millions), and keep the best ones • Classification is performed with Random Forest and Decision Tree algorithms

15/01/2018Android Malware Classification with API call-grams Page 15 Datasets • Tests made on two datasets. • Drebin: it contains 5,560 malware samples in the period from 2010 to 2012. Many approaches have been tested on it. • AndroZoo: updated dataset, it currently contains 5,704,998 samples. For each application, it gives its type and its family, collected from VirusTotal reports with an automatic tool.

15/01/2018Android Malware Classification with API call-grams Page 16 Tests • Drebin experiments: – First: we classified taking the 9 biggest families, 100 samples for each, obtaining a perfectly balanced set, both with normal and abstracted apigrams. – Second: we classified taking all samples of those 9 families, obtaining an unbalanced set. • AndroZoo experiments (both with normal and abstracted apigrams) : – First: we classified taking 9 families of type “trojan”, 100 samples each, most of them are the same of the 9 families from Drebin. – Second: we classified taking 9 families of different type, 100 samples each, to see the differences in the two classifications.

15/01/2018Android Malware Classification with API call-grams Page 17 Results – Drebin Features selected Accuracy

15/01/2018Android Malware Classification with API call-grams Page 18 Results – AndroZoo – 9 trojan families Features selected Accuracy

15/01/2018Android Malware Classification with API call-grams Page 19 Results – AndroZoo – 9 different families Features selected Accuracy

15/01/2018Android Malware Classification with API call-grams Page 20 Conclusions and future work • Accuracies of tests with trojans only fall in a range 82-87 % • Accuracies with families of different types reach 94-95% • A future work direction may be trying the same tests I did with trojans on other types of malware • Also, it can be interesting to find more levels of abstraction of apigrams, to see which information must be kept in the string to reach highest accuracy, in parallel with the previous tests I suggested, on other types of malware

Android malware classification with API call-grams

More Related Content

Similar to Android malware classification with API call-grams

Recently uploaded

Android malware classification with API call-grams