Software Analytics: Data Analytics for Software Engineering and Security (Speaker Info) Frodo Baggins Ring Bearer FOTR, LLC Tao Xie Department of Computer Science University of Illinois at Urbana-Champaign, USA taoxie@illinois.edu In Collaboration with Microsoft Research and NC State University
New Era…Software itself is changing... Software Services
How people use software is changing…
Individual Isolated Not much data/content generation How people use software is changing…
How people use software is changing… Individual Isolated Not much data/content generation
How people use software is changing… Individual Social Isolated Not much data/content generation Collaborative Huge amount of data/artifacts generated anywhere anytime
How software is built & operated is changing…
How software is built & operated is changing… Data pervasive Long product cycle Experience & gut-feeling In-lab testing Informed decision making Centralized development Code centric Debugging in the large Distributed development Continuous release … …
How software is built & operated is changing… Data pervasive Long product cycle Experience & gut-feeling In-lab testing Informed decision making Centralized development Code centric Debugging in the large Distributed development Continuous release … …
Software Analytics Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011 http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
Software Analytics Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. http://research.microsoft.com/en-us/groups/sa/ http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
Data sources Runtime traces Program logs System events Perf counters … Usage log User surveys Online forum posts Blog & Twitter … Source code Bug history Check-in history Test cases Eye tracking MRI/EMG …
Target audience – software practitioners
Target audience – software practitioners Developer Tester
Target audience – software practitioners Developer Tester Program Manager Usability engineer Designer Support engineer Management personnel Operation engineer
Output – insightful information • Conveys meaningful and useful understanding or knowledge towards completing the target task • Not easily attainable via directly investigating raw data without aid of analytics technologies • Example – It is easy to count the number of re-opened bugs, but how to find out the primary reasons for these re-opened bugs?
Output – actionable information • “So what” -- enables software practitioners to come up with concrete solutions towards completing the target task • Example – Why bugs were re-opened? • A list of bug groups each with the same reason of re- opening
Research topics & technology pillars Software Users Software Development Process Software System Vertical Horizontal Information Visualization Data Analysis Algorithms Large-scale Computing
Outline • Overview of Software Analytics • Software Engineering Tasks – XIAO: Scalable code clone analysis – SAS: Incident management of online services • Mobile App Security Tasks – WHYPER: NLP on app descriptions – AppContext: Machine learning to classify malware
XIAO Scalable code clone analysis 2012 http://research.microsoft.com/jump/175199
XIAO: Code Clone Analysis • Motivation – Copy-and-paste is a common developer behavior – A real tool widely adopted internally and externally • XIAO enables code clone analysis in the following way – High tunability – High scalability – High compatibility – High explorability
High tunability – what you tune is what you get • Intuitive similarity metric: effective control of the degree of syntactical differences between two code snippets for (i = 0; i < n; i ++) { a ++; b ++; c = foo(a, b); d = bar(a, b, c); e = a + c; } for (i = 0; i < n; i ++) { c = foo(a, b); a ++; b ++; d = bar(a, b, c); e = a + d; e ++; }
High explorability 1. Clone navigation based on source tree hierarchy 2. Pivoting of folder level statistics 3. Folder level statistics 4. Clone function list in selected folder 5. Clone function filters 6. Sorting by bug or refactoring potential 7. Tagging 1 2 3 4 5 6 7 1. Block correspondence 2. Block types 3. Block navigation 4. Copying 5. Bug filing 6. Tagging 1 2 3 4 1 6 5
Scenarios & Solutions Quality gates at milestones • Architecture refactoring • Code clone clean up • Bug fixing Post-release maintenance • Security bug investigation • Bug investigation for sustained engineering Development and testing • Checking for similar issues before check-in • Reference info for code review • Supporting tool for bug triage Online code clone search Offline code clone analysis
Benefiting developer community Available in Visual Studio 2012 RC Searching similar snippets for fixing bug once Finding refactoring opportunity
More secure Microsoft products Code Clone Search service integrated into workflow of Microsoft Security Response Center Over 590 million lines of code indexed across multiple products Real security issues proactively identified and addressed
Example – MS Security Bulletin MS12-034 Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012 3 publicly disclosed vulnerabilities and seven privately reported involved. Specifically, one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document Insufficient bounds check within the font parsing subsystem of win32k.sys Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer Microsoft Technet Blog about this bulletin However, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.
SAS Incident management of online services http://research.microsoft.com/apps/pubs/?id=202451
Motivation Incident Management (IcM) is a critical task to assure service quality • Online services are increasingly popular & important • High service quality is the key
Incident Management: Workflow Detect a service issue Alert On- Call Engineers (OCEs) Investigate the problem Restore the service Fix root cause via postmortem analysis
Incident Management: Characteristics Shrink-Wrapped Software Debugging Root Cause and Fix Debugger Controlled Environment Online Service Incident Management Workaround No Debugger Live Data
Incident Management: Challenges Large volume and noisy data Highly complex problem space No knowledge of entire system Knowledge not well organized
SAS: Incident management of online services SAS, developed and deployed to effectively reduce MTTR (Mean Time To Restore) via automatically analyzing monitoring data 3 3  Design Principle of SAS  Automating Analysis  Handling Heterogeneity  Accumulating Knowledge  Supporting human-in-the-loop (HITL)
Techniques Overview • System metrics – Identifying Incident Beacons • Transaction logs – Mining Suspicious Execution Patterns • Historical incidents – Mining Historical Workaround Solutions
Industry Impact of SAS Deployment • SAS deployed to worldwide datacenters for Service X (serving hundreds of millions of users) since June 2011 • OCEs now heavily depend on SAS Usage • SAS helped successfully diagnose ~76% of the service incidents assisted with SAS
Outline • Overview of Software Analytics • Software Engineering Tasks – XIAO: Scalable code clone analysis – SAS: Incident management of online services • Mobile App Security Tasks – WHYPER: NLP on app descriptions – AppContext: Machine learning to classify malware
“Conceptual” Model 38 APP DEVELOPERS APP USERS App Functional Requirements App Security Requirements User Functional Requirements User Security Requirements informal: app description, etc. permission list, etc. App Code
Requirements: App Description 39 App Code App Permissions
App Security Requirements: Permission List 40
“Conceptual” Model 41 APP DEVELOPERS APP USERS App Functional Requirements App Security Requirements User Functional Requirements User Security Requirements informal: app description, etc. permission list, etc. App Code
Example Andriod App: Angry Birds 42
o Focus on permission  app descriptions o permissions (protecting user understandable resources) should be discussed o What does the users expect (w.r.t. app functionalities)? o GPS Tracker: record and send location o Phone-Call Recorder: record audio during phone call WHYPER: Text Analytics for Mobile Security 43 App Description Sentence Permission Linkage Pandita et al. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX Security 2013 http://web.engr.illinois.edu/~taoxie/publications/usenixsec13-whyper.pdf
WHYPER Overview Application Market WHYPER DEVELOPERS USERS 44 Pandita et al. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX Security 2013 http://web.engr.illinois.edu/~taoxie/publications/usenixsec13-whyper.pdf • Enhance user experience while installing apps • Enforce functionality disclosure on developers • Complement program analysis to ensure justifications
Natural Language Processing on App Description 45 • “Also you can share the yoga exercise to your friends via Email and SMS. – Implication of using the contact permission – Permission sentences • Confounding effects: – Certain keywords such as “contact” have a confounding meaning – E.g., “... displays user contacts, ...” vs “... contact me at abc@xyz.com”. • Semantic inference: – Sentences describe a sensitive action w/o referring to keywords – E.g., “share yoga exercises with your friends via Email and SMS” NLP + Semantic Graphs/Ontologies Derived from Android API Documents
• Synonym analysis • Ex non-permission sentence: “You can now turn recordings into ringtones.” • functionality that allows users to create ringtones from previously recorded sounds but NOT requiring permission to record audio • false positive due to using synonym: (turn, start) • Limitations of Semantic Graphs • Ex. permission sentence: “blow into the mic to extinguish the flame like a real candle” • false negative due to failing to associate “blow into” with “record” • Automatic mining from user comments and forums Challenges 46
Not All Malware Developers Are “Dumb” or “Lazy” 47
Example Malicious App 48
Not All Malware Developers Are “Dumb” or “Lazy” Benign? Malicious?
Our Insight Different goals of benign apps vs. malware. • Benign apps – Meet requirements from users (as delivering utility) • Malware – Trigger malicious behaviors frequently (as maximizing profits) – Evade detection (as prolonging lifetime) 50
Differentiating characteristics Mobile malware (vs. benign apps) – Frequently enough to meet the need: frequent occurrences of imperceptible system events; • E.g., many malware families trigger malicious behaviors via background events. – Not too frequently for users to notice anomaly: indicative states of external environments • E.g., Send premium SMS every 12 hours Balance!!!
ActionReceiver.OnReceive() Date date = new Date(); if(data.getHours>23 || date.getHours< 5 ){ ContextWrapper.StartService(MainService); … MainService.OnCreate() DummyMainMethod() SendTextActivity$4.onClick() SplashActivity.OnCreate() SmsManager.sendTextMessage() long last = db.query(“LastConnectTime"); long current = System.currentTimeMillis(); if(current – last > 43200000 ){ SmsManager.sendTextMessage(); db.save(“LastConnectTime”, current); … SendTextActivity$5.run() MainService.b() ContextWrapper.StartService() The app will send an SMS when • user clicks a button in the app Example of malicious app SendTextActivity$4.onClick SmsManager.sendTextMessage
ActionReceiver.OnReceive() Date date = new Date(); if(data.getHours>23 || date.getHours< 5 ){ ContextWrapper.StartService(MainService); … MainService.OnCreate() DummyMainMethod() SendTextActivity$4.onClick() SplashActivity.OnCreate() SmsManager.sendTextMessage() long last = db.query(“LastConnectTime"); long current = System.currentTimeMillis(); if(current – last > 43200000 ){ SmsManager.sendTextMessage(); db.save(“LastConnectTime”, current); … SendTextActivity$5.run()MainService.b() ContextWrapper.StartService() The app will send an SMS when • phone signal strength changes (frequent) • current time is within 11PM-5 AM (not too frequent, User not around) Example of malicious app if(data.getHours>23 || date.getHours< 5 ){ Android.intent.action.SIG_STR
ActionReceiver.OnReceive() Date date = new Date(); if(data.getHours>23 || date.getHours< 5 ){ ContextWrapper.StartService(MainService); … MainService.OnCreate() DummyMainMethod() SendTextActivity$4.onClick() SplashActivity.OnCreate() SmsManager.sendTextMessage() long last = db.query(“LastConnectTime"); long current = System.currentTimeMillis(); if(current – last > 43200000 ){ SmsManager.sendTextMessage(); db.save(“LastConnectTime”, current); … SendTextActivity$5.run() MainService.b() ContextWrapper.StartService() The app will send an SMS when • user enters the app (frequent) • (current time – time when last msg sent) >12 hours (not too frequent) Example if(current – last > 43200000 ){
AppContext • Capture differentiating characteristics with contexts of security-sensitive behavior. • Leverage contexts in machine learning (classification) to differentiate malware and benign apps. Yang et al. AppContext: Differentiating Malicious and Benign Mobile App Behavior Under Contexts. ICSE 2015. http://taoxie.cs.illinois.edu/publications/icse15-appcontext.pdf
Techniques • Abstraction for expressing context of security- sensitive behaviors, e.g., a permission protected API method. – To precisely capture the differentiating characteristics • Inter-component analysis for extracting contexts – To identify entry point for activation events – To connect control flows for context factors
Context of security-sensitive behavior • Activation events: • E.g., signal strength changes • Context factors: • Environmental attributes for affecting security- sensitive behavior’s invocation (or not) • E.g., current system time
AppContext - Workflow CG: Call Graph; ECG: Extended CG; RICFG: Reduced ICFG
Context-based Security-Behavior Classification Context1: (Event: Signal strength changes), (Factor: Calendar) Context2: (Event: Entering app), (Factor: Database, SystemTime) Context3: (Event: Clicking a button) Transforming Labelling Training ClassifyingStep 1. Transform contexts for each app’s security behavior as features
Context-based Security-Behavior Classification (Cont.) Transforming Labelling Training Classifying Systematically label security-sensitive method calls as malicious based on the existing malware signatures Support Vector Machine (SVM) • SVM is resilient to over-fitting • SVM can handle high dimension data such as our context factor data (dimension reduction may be another option).
Evaluation Subjects: 846 Android apps • 633 benign apps: randomly selected from popular apps on Google Play. • 202 malicious apps: collected through three different malware dataset (Genome, VirusShare, Contagio). • 11 open source apps: randomly selected from F- Droid.
Research Questions • RQ1: How effective is AppContext in identifying malware? • RQ2: How do activation events and context factors in our context definition contribute to the effectiveness of malware identification? • RQ3: How accurate is our static analysis in inferring contexts?
Evaluation Complete Context has higher precision (87.7%) and recall (95.0%)
Evaluation Activation events effectively help identify malicious method calls without context factors
Evaluation Context factors effectively help identify malicious behaviors triggered by UI events or malicious behaviors with no activation events
Limitations • False negatives – Malicious behaviors triggered by UI events and without context factors. • UI events have less indication of the maliciousness of a security-sensitive method call • False positives – Reflective method calls, dynamic code loading in benign apps. – Uncommon security-sensitive method calls used in benign apps.
Conclusion Software Users Software Development Process Software System Vertical Horizontal Information Visualization Data Analysis Algorithms Large-scale Computing
Q & A http://taoxie.cs.illinois.edu/ Contact: taoxie@illinois.edu

Software Analytics: Data Analytics for Software Engineering and Security

  • 1.
    Software Analytics: DataAnalytics for Software Engineering and Security (Speaker Info) Frodo Baggins Ring Bearer FOTR, LLC Tao Xie Department of Computer Science University of Illinois at Urbana-Champaign, USA taoxie@illinois.edu In Collaboration with Microsoft Research and NC State University
  • 2.
    New Era…Software itselfis changing... Software Services
  • 3.
    How people usesoftware is changing…
  • 4.
    Individual Isolated Not muchdata/content generation How people use software is changing…
  • 5.
    How people usesoftware is changing… Individual Isolated Not much data/content generation
  • 6.
    How people usesoftware is changing… Individual Social Isolated Not much data/content generation Collaborative Huge amount of data/artifacts generated anywhere anytime
  • 7.
    How software isbuilt & operated is changing…
  • 8.
    How software isbuilt & operated is changing… Data pervasive Long product cycle Experience & gut-feeling In-lab testing Informed decision making Centralized development Code centric Debugging in the large Distributed development Continuous release … …
  • 9.
    How software isbuilt & operated is changing… Data pervasive Long product cycle Experience & gut-feeling In-lab testing Informed decision making Centralized development Code centric Debugging in the large Distributed development Continuous release … …
  • 10.
    Software Analytics Software analyticsis to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011 http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
  • 11.
    Software Analytics Software analyticsis to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. http://research.microsoft.com/en-us/groups/sa/ http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
  • 12.
    Data sources Runtime traces Programlogs System events Perf counters … Usage log User surveys Online forum posts Blog & Twitter … Source code Bug history Check-in history Test cases Eye tracking MRI/EMG …
  • 13.
    Target audience –software practitioners
  • 14.
    Target audience –software practitioners Developer Tester
  • 15.
    Target audience –software practitioners Developer Tester Program Manager Usability engineer Designer Support engineer Management personnel Operation engineer
  • 16.
    Output – insightfulinformation • Conveys meaningful and useful understanding or knowledge towards completing the target task • Not easily attainable via directly investigating raw data without aid of analytics technologies • Example – It is easy to count the number of re-opened bugs, but how to find out the primary reasons for these re-opened bugs?
  • 17.
    Output – actionableinformation • “So what” -- enables software practitioners to come up with concrete solutions towards completing the target task • Example – Why bugs were re-opened? • A list of bug groups each with the same reason of re- opening
  • 18.
    Research topics &technology pillars Software Users Software Development Process Software System Vertical Horizontal Information Visualization Data Analysis Algorithms Large-scale Computing
  • 19.
    Outline • Overview ofSoftware Analytics • Software Engineering Tasks – XIAO: Scalable code clone analysis – SAS: Incident management of online services • Mobile App Security Tasks – WHYPER: NLP on app descriptions – AppContext: Machine learning to classify malware
  • 20.
    XIAO Scalable code cloneanalysis 2012 http://research.microsoft.com/jump/175199
  • 21.
    XIAO: Code CloneAnalysis • Motivation – Copy-and-paste is a common developer behavior – A real tool widely adopted internally and externally • XIAO enables code clone analysis in the following way – High tunability – High scalability – High compatibility – High explorability
  • 22.
    High tunability –what you tune is what you get • Intuitive similarity metric: effective control of the degree of syntactical differences between two code snippets for (i = 0; i < n; i ++) { a ++; b ++; c = foo(a, b); d = bar(a, b, c); e = a + c; } for (i = 0; i < n; i ++) { c = foo(a, b); a ++; b ++; d = bar(a, b, c); e = a + d; e ++; }
  • 23.
    High explorability 1. Clonenavigation based on source tree hierarchy 2. Pivoting of folder level statistics 3. Folder level statistics 4. Clone function list in selected folder 5. Clone function filters 6. Sorting by bug or refactoring potential 7. Tagging 1 2 3 4 5 6 7 1. Block correspondence 2. Block types 3. Block navigation 4. Copying 5. Bug filing 6. Tagging 1 2 3 4 1 6 5
  • 24.
    Scenarios & Solutions Qualitygates at milestones • Architecture refactoring • Code clone clean up • Bug fixing Post-release maintenance • Security bug investigation • Bug investigation for sustained engineering Development and testing • Checking for similar issues before check-in • Reference info for code review • Supporting tool for bug triage Online code clone search Offline code clone analysis
  • 25.
    Benefiting developer community Availablein Visual Studio 2012 RC Searching similar snippets for fixing bug once Finding refactoring opportunity
  • 26.
    More secure Microsoftproducts Code Clone Search service integrated into workflow of Microsoft Security Response Center Over 590 million lines of code indexed across multiple products Real security issues proactively identified and addressed
  • 27.
    Example – MSSecurity Bulletin MS12-034 Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012 3 publicly disclosed vulnerabilities and seven privately reported involved. Specifically, one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document Insufficient bounds check within the font parsing subsystem of win32k.sys Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer Microsoft Technet Blog about this bulletin However, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.
  • 28.
    SAS Incident management ofonline services http://research.microsoft.com/apps/pubs/?id=202451
  • 29.
    Motivation Incident Management (IcM)is a critical task to assure service quality • Online services are increasingly popular & important • High service quality is the key
  • 30.
    Incident Management: Workflow Detecta service issue Alert On- Call Engineers (OCEs) Investigate the problem Restore the service Fix root cause via postmortem analysis
  • 31.
    Incident Management: Characteristics Shrink-Wrapped SoftwareDebugging Root Cause and Fix Debugger Controlled Environment Online Service Incident Management Workaround No Debugger Live Data
  • 32.
    Incident Management: Challenges Largevolume and noisy data Highly complex problem space No knowledge of entire system Knowledge not well organized
  • 33.
    SAS: Incident managementof online services SAS, developed and deployed to effectively reduce MTTR (Mean Time To Restore) via automatically analyzing monitoring data 3 3  Design Principle of SAS  Automating Analysis  Handling Heterogeneity  Accumulating Knowledge  Supporting human-in-the-loop (HITL)
  • 34.
    Techniques Overview • Systemmetrics – Identifying Incident Beacons • Transaction logs – Mining Suspicious Execution Patterns • Historical incidents – Mining Historical Workaround Solutions
  • 35.
    Industry Impact ofSAS Deployment • SAS deployed to worldwide datacenters for Service X (serving hundreds of millions of users) since June 2011 • OCEs now heavily depend on SAS Usage • SAS helped successfully diagnose ~76% of the service incidents assisted with SAS
  • 36.
    Outline • Overview ofSoftware Analytics • Software Engineering Tasks – XIAO: Scalable code clone analysis – SAS: Incident management of online services • Mobile App Security Tasks – WHYPER: NLP on app descriptions – AppContext: Machine learning to classify malware
  • 37.
    “Conceptual” Model 38 APP DEVELOPERS APPUSERS App Functional Requirements App Security Requirements User Functional Requirements User Security Requirements informal: app description, etc. permission list, etc. App Code
  • 38.
  • 39.
  • 40.
    “Conceptual” Model 41 APP DEVELOPERS APPUSERS App Functional Requirements App Security Requirements User Functional Requirements User Security Requirements informal: app description, etc. permission list, etc. App Code
  • 41.
    Example Andriod App:Angry Birds 42
  • 42.
    o Focus onpermission  app descriptions o permissions (protecting user understandable resources) should be discussed o What does the users expect (w.r.t. app functionalities)? o GPS Tracker: record and send location o Phone-Call Recorder: record audio during phone call WHYPER: Text Analytics for Mobile Security 43 App Description Sentence Permission Linkage Pandita et al. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX Security 2013 http://web.engr.illinois.edu/~taoxie/publications/usenixsec13-whyper.pdf
  • 43.
    WHYPER Overview Application Market WHYPER DEVELOPERS USERS 44 Panditaet al. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX Security 2013 http://web.engr.illinois.edu/~taoxie/publications/usenixsec13-whyper.pdf • Enhance user experience while installing apps • Enforce functionality disclosure on developers • Complement program analysis to ensure justifications
  • 44.
    Natural Language Processingon App Description 45 • “Also you can share the yoga exercise to your friends via Email and SMS. – Implication of using the contact permission – Permission sentences • Confounding effects: – Certain keywords such as “contact” have a confounding meaning – E.g., “... displays user contacts, ...” vs “... contact me at abc@xyz.com”. • Semantic inference: – Sentences describe a sensitive action w/o referring to keywords – E.g., “share yoga exercises with your friends via Email and SMS” NLP + Semantic Graphs/Ontologies Derived from Android API Documents
  • 45.
    • Synonym analysis •Ex non-permission sentence: “You can now turn recordings into ringtones.” • functionality that allows users to create ringtones from previously recorded sounds but NOT requiring permission to record audio • false positive due to using synonym: (turn, start) • Limitations of Semantic Graphs • Ex. permission sentence: “blow into the mic to extinguish the flame like a real candle” • false negative due to failing to associate “blow into” with “record” • Automatic mining from user comments and forums Challenges 46
  • 46.
    Not All MalwareDevelopers Are “Dumb” or “Lazy” 47
  • 47.
  • 48.
    Not All MalwareDevelopers Are “Dumb” or “Lazy” Benign? Malicious?
  • 49.
    Our Insight Different goalsof benign apps vs. malware. • Benign apps – Meet requirements from users (as delivering utility) • Malware – Trigger malicious behaviors frequently (as maximizing profits) – Evade detection (as prolonging lifetime) 50
  • 50.
    Differentiating characteristics Mobile malware(vs. benign apps) – Frequently enough to meet the need: frequent occurrences of imperceptible system events; • E.g., many malware families trigger malicious behaviors via background events. – Not too frequently for users to notice anomaly: indicative states of external environments • E.g., Send premium SMS every 12 hours Balance!!!
  • 51.
    ActionReceiver.OnReceive() Date date =new Date(); if(data.getHours>23 || date.getHours< 5 ){ ContextWrapper.StartService(MainService); … MainService.OnCreate() DummyMainMethod() SendTextActivity$4.onClick() SplashActivity.OnCreate() SmsManager.sendTextMessage() long last = db.query(“LastConnectTime"); long current = System.currentTimeMillis(); if(current – last > 43200000 ){ SmsManager.sendTextMessage(); db.save(“LastConnectTime”, current); … SendTextActivity$5.run() MainService.b() ContextWrapper.StartService() The app will send an SMS when • user clicks a button in the app Example of malicious app SendTextActivity$4.onClick SmsManager.sendTextMessage
  • 52.
    ActionReceiver.OnReceive() Date date =new Date(); if(data.getHours>23 || date.getHours< 5 ){ ContextWrapper.StartService(MainService); … MainService.OnCreate() DummyMainMethod() SendTextActivity$4.onClick() SplashActivity.OnCreate() SmsManager.sendTextMessage() long last = db.query(“LastConnectTime"); long current = System.currentTimeMillis(); if(current – last > 43200000 ){ SmsManager.sendTextMessage(); db.save(“LastConnectTime”, current); … SendTextActivity$5.run()MainService.b() ContextWrapper.StartService() The app will send an SMS when • phone signal strength changes (frequent) • current time is within 11PM-5 AM (not too frequent, User not around) Example of malicious app if(data.getHours>23 || date.getHours< 5 ){ Android.intent.action.SIG_STR
  • 53.
    ActionReceiver.OnReceive() Date date =new Date(); if(data.getHours>23 || date.getHours< 5 ){ ContextWrapper.StartService(MainService); … MainService.OnCreate() DummyMainMethod() SendTextActivity$4.onClick() SplashActivity.OnCreate() SmsManager.sendTextMessage() long last = db.query(“LastConnectTime"); long current = System.currentTimeMillis(); if(current – last > 43200000 ){ SmsManager.sendTextMessage(); db.save(“LastConnectTime”, current); … SendTextActivity$5.run() MainService.b() ContextWrapper.StartService() The app will send an SMS when • user enters the app (frequent) • (current time – time when last msg sent) >12 hours (not too frequent) Example if(current – last > 43200000 ){
  • 54.
    AppContext • Capture differentiatingcharacteristics with contexts of security-sensitive behavior. • Leverage contexts in machine learning (classification) to differentiate malware and benign apps. Yang et al. AppContext: Differentiating Malicious and Benign Mobile App Behavior Under Contexts. ICSE 2015. http://taoxie.cs.illinois.edu/publications/icse15-appcontext.pdf
  • 55.
    Techniques • Abstraction forexpressing context of security- sensitive behaviors, e.g., a permission protected API method. – To precisely capture the differentiating characteristics • Inter-component analysis for extracting contexts – To identify entry point for activation events – To connect control flows for context factors
  • 56.
    Context of security-sensitivebehavior • Activation events: • E.g., signal strength changes • Context factors: • Environmental attributes for affecting security- sensitive behavior’s invocation (or not) • E.g., current system time
  • 57.
    AppContext - Workflow CG:Call Graph; ECG: Extended CG; RICFG: Reduced ICFG
  • 58.
    Context-based Security-Behavior Classification Context1: (Event: Signalstrength changes), (Factor: Calendar) Context2: (Event: Entering app), (Factor: Database, SystemTime) Context3: (Event: Clicking a button) Transforming Labelling Training ClassifyingStep 1. Transform contexts for each app’s security behavior as features
  • 59.
    Context-based Security-Behavior Classification (Cont.) TransformingLabelling Training Classifying Systematically label security-sensitive method calls as malicious based on the existing malware signatures Support Vector Machine (SVM) • SVM is resilient to over-fitting • SVM can handle high dimension data such as our context factor data (dimension reduction may be another option).
  • 60.
    Evaluation Subjects: 846 Androidapps • 633 benign apps: randomly selected from popular apps on Google Play. • 202 malicious apps: collected through three different malware dataset (Genome, VirusShare, Contagio). • 11 open source apps: randomly selected from F- Droid.
  • 61.
    Research Questions • RQ1:How effective is AppContext in identifying malware? • RQ2: How do activation events and context factors in our context definition contribute to the effectiveness of malware identification? • RQ3: How accurate is our static analysis in inferring contexts?
  • 62.
    Evaluation Complete Context hashigher precision (87.7%) and recall (95.0%)
  • 63.
    Evaluation Activation events effectivelyhelp identify malicious method calls without context factors
  • 64.
    Evaluation Context factors effectivelyhelp identify malicious behaviors triggered by UI events or malicious behaviors with no activation events
  • 65.
    Limitations • False negatives –Malicious behaviors triggered by UI events and without context factors. • UI events have less indication of the maliciousness of a security-sensitive method call • False positives – Reflective method calls, dynamic code loading in benign apps. – Uncommon security-sensitive method calls used in benign apps.
  • 66.
  • 67.