Does static analysis need machine learning? Anti-Talk Victoria Khanieva PVS-Studio
Speaker 2 Victoria Khanieva • С++ developer in PVS-Studio • Supported the MISRA standard • Wrote articles in checks of open-source projects khanieva@viva64.com www.viva64.com
 Introduction to static analysis  Existing solutions and approaches they implement  Problems and pitfalls when creating an analyzer:  When learning «manually»  When learning on a real large code base  Most promising approaches Agenda 3
About the analysis 4
 Code review Types of code analysis 5
 Code review  Dynamic analysis Types of code analysis 6
 Code review  Dynamic analysis  Static analysis Types of code analysis 7
 How to reveal errors and flaws in the source code of programs.  Detect errors in programs  Get tips on code formatting  Count metrics  …. Static analysis 8
void createCube(float halfExtentsX, float halfExtentsY, float halfExtentsZ, ....){ .... m_model->addVertex(halfExtentsX, halfExtentsY, halfExtentsY, ....); .... } Diagnostics 9
void createCube(float halfExtentsX, float halfExtentsY, float halfExtentsZ, ....){ .... m_model->addVertex(halfExtentsX, halfExtentsY, halfExtentsY, ....); .... } Diagnostics 10 V751 Parameter 'halfExtentsZ' is not used inside function body. TinyRenderer.cpp 375
You'd think… 11
When ML is useful 12  Useful: Scanning photos and videos
When ML is useful 13  Useful: Scanning photos and videos  Unuseful: Calculator
When ML is useful 14  Useful: Scanning photos and videos  Unuseful: Calculator
Possible result 15
Existing solutions 16
Existing solutions 17
Existing solutions 18
 Java, JS, TS, Python, C, C++  Code review and audit  You can check out demos on an open-source project  Related posts DeepCode 19 Link
DeepCode 20
 Java, C, C++, Objective-C  By Facebook  Open-source code  You can try Infer on your projects  Based on the Хоара and separation logic, bi-abduction, and the abstract interpretation theory Infer 21 Link
 Handles Infer results  Suggests possible edits SapFix 22
 Platform to analyze code quality  System of edits suggestion  Searches for dependencies between functions and methods by NLP Embold 23
 Open-source  Related posts  Repository with dataset for learning  Code-style detection  Platform for collecting metrics and statistics Source{d} 24 Link
Fixing code style in Source{d} 25 Based on the article “STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms” Link
 By Mozilla+Ubisoft  Searches for suspicious commits  Based on the publication: “CLEVER: Combining Code Metrics with Clone Detection for Just-In-Time Fault Prevention and Resolution in Large Industrial Projects” Clever-Commit 26 Link
 Java  By Amazon  Recommendations on best practices from the documentation and code base CodeGuru 27
28
 Analyze code to search for errors  Analyze code to search for deviations from best practices  Analyze artifacts’ code  Collect metrics and data on code  Suggest code-style fixes Main directions 29
 Selected base of open-source repositories  Dataset selected manually  Own project base Ways to learn 30
Problems and pitfalls 31 * in the view of a classic static analyzer developer
How it may look like: • if (X && A == A) • if (A + 1 == A + 1) • if (A[i] == A[i]) • if ((A) == (A)) • … «Manual» dataset selection 32 We need to find: if (A == A)
Example from DeepCode 33
«Manual» learning 34
We need to find: int y = x / 0; In practice 35 How it may look like: template <class T> class numeric_limits { .... } namespace boost { .... } namespace boost { namespace hash_detail { template <class T> void dsizet(size_t x) { size_t length = x / (limits<int>::digits - 31); } } }
@Override public String getText(Mode mode) { StringBuilder sb = new StringBuilder(); .... if (filter.getMessage() .toLowerCase(Locale.ENGLISH) .startsWith("Each ")) { sb.append(" has base power and toughness "); } else { sb.append(" have base power and toughness "); } .... return sb.toString(); } Data flow analysis 36
Data flow analysis 37 uint32_t* BnNew() { uint32_t* result = new uint32_t[kBigIntSize]; memset(result, 0, kBigIntSize * sizeof(uint32_t)); return result; } std::string AndroidRSAPublicKey(crypto::RSAPrivateKey* key) { .... uint32_t* n = BnNew(); .... RSAPublicKey pkey; .... if (pkey.n0inv == 0) return kDummyRSAPublicKey; // <= .... }
 «So many projects on GitHub! The analyzer will learn from their repositories and commits» turns into commits’ collection and markup.  If a manually collected learning base is unreliable, what to expect from an automatically collected one? Learning on many projects 38
 Check out the commit with the word «fix»: Learning on many projects 39
 Analyzer has to be up-to-date in terms of the checked language  Most projects use outdated standards  Most projects don’t use new constructions Outdated code 40
New construction: std::vector<int> numbers; .... for (int num : numbers) foo(num); New error pattern: for (int num : numbers) numbers.push_back(num * 2); Example 41
Documentation 42
 Code example: char check(const uint8 *hash_stage2) { .... return memcmp(hash_stage2, hash_stage2_reassured, SHA1_HASH_SIZE); }  The analyzer hypothetically suggests to fix as follows: int check(const uint8 *hash_stage2) { .... return memcmp(hash_stage2, hash_stage2_reassured, SHA1_HASH_SIZE); } Why documentation matters 43
Classic approach: documentation 44
Code example: ObjectOutputStream out = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); obj.state = 200; out.writeObject(obj); out.close(); Why documentation matters 45
The analyzer suggests: ObjectOutputStream out = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); obj = new SerializedObject(); // Add this line obj.state = 200; out.writeObject(obj); out.close(); Why documentation matters 46
What happens without the edit: ObjectOutputStream out = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); // stores the object with the state = 100 obj.state = 200; out.writeObject(obj); // stores the object with the state = 100 out.close(); Why documentation matters 47
Unambiguous behavior 48
Unambiguous behavior 49
Unambiguous behavior 50
std::vector<int> numbers; .... for (int num : numbers) { if (num < 5) { numbers.push_back(0); break; // or, for example, return } } False positives 51
 Reason for getting a warning may be unclear. Reason for NOT getting a warning may be unclear as well.  How to fix?  Additional learning (will it help?)  Mechanism to hide warnings (not universal) False positives 52
In case of successful analyzer learning 53
 Code style by specific symbols  Collecting additional metrics and information Promising directions 54
 Best-practices for a specific framework/code base/platform Promising directions 55
56 https://pvs-studio.com/en/pvs-studio/download/ Download a PVS-Studio one-month trial version and check your projects using a classic static analysis:
Q&A viva64.com 57 khanieva@viva64.com

Does static analysis need machine learning?

  • 1.
    Does static analysisneed machine learning? Anti-Talk Victoria Khanieva PVS-Studio
  • 2.
    Speaker 2 Victoria Khanieva • С++developer in PVS-Studio • Supported the MISRA standard • Wrote articles in checks of open-source projects khanieva@viva64.com www.viva64.com
  • 3.
     Introduction tostatic analysis  Existing solutions and approaches they implement  Problems and pitfalls when creating an analyzer:  When learning «manually»  When learning on a real large code base  Most promising approaches Agenda 3
  • 4.
  • 5.
     Code review Typesof code analysis 5
  • 6.
     Code review Dynamic analysis Types of code analysis 6
  • 7.
     Code review Dynamic analysis  Static analysis Types of code analysis 7
  • 8.
     How toreveal errors and flaws in the source code of programs.  Detect errors in programs  Get tips on code formatting  Count metrics  …. Static analysis 8
  • 9.
    void createCube(float halfExtentsX, floathalfExtentsY, float halfExtentsZ, ....){ .... m_model->addVertex(halfExtentsX, halfExtentsY, halfExtentsY, ....); .... } Diagnostics 9
  • 10.
    void createCube(float halfExtentsX, floathalfExtentsY, float halfExtentsZ, ....){ .... m_model->addVertex(halfExtentsX, halfExtentsY, halfExtentsY, ....); .... } Diagnostics 10 V751 Parameter 'halfExtentsZ' is not used inside function body. TinyRenderer.cpp 375
  • 11.
  • 12.
    When ML isuseful 12  Useful: Scanning photos and videos
  • 13.
    When ML isuseful 13  Useful: Scanning photos and videos  Unuseful: Calculator
  • 14.
    When ML isuseful 14  Useful: Scanning photos and videos  Unuseful: Calculator
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
     Java, JS,TS, Python, C, C++  Code review and audit  You can check out demos on an open-source project  Related posts DeepCode 19 Link
  • 20.
  • 21.
     Java, C,C++, Objective-C  By Facebook  Open-source code  You can try Infer on your projects  Based on the Хоара and separation logic, bi-abduction, and the abstract interpretation theory Infer 21 Link
  • 22.
     Handles Inferresults  Suggests possible edits SapFix 22
  • 23.
     Platform toanalyze code quality  System of edits suggestion  Searches for dependencies between functions and methods by NLP Embold 23
  • 24.
     Open-source  Relatedposts  Repository with dataset for learning  Code-style detection  Platform for collecting metrics and statistics Source{d} 24 Link
  • 25.
    Fixing code stylein Source{d} 25 Based on the article “STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms” Link
  • 26.
     By Mozilla+Ubisoft Searches for suspicious commits  Based on the publication: “CLEVER: Combining Code Metrics with Clone Detection for Just-In-Time Fault Prevention and Resolution in Large Industrial Projects” Clever-Commit 26 Link
  • 27.
     Java  ByAmazon  Recommendations on best practices from the documentation and code base CodeGuru 27
  • 28.
  • 29.
     Analyze codeto search for errors  Analyze code to search for deviations from best practices  Analyze artifacts’ code  Collect metrics and data on code  Suggest code-style fixes Main directions 29
  • 30.
     Selected baseof open-source repositories  Dataset selected manually  Own project base Ways to learn 30
  • 31.
    Problems and pitfalls 31 *in the view of a classic static analyzer developer
  • 32.
    How it maylook like: • if (X && A == A) • if (A + 1 == A + 1) • if (A[i] == A[i]) • if ((A) == (A)) • … «Manual» dataset selection 32 We need to find: if (A == A)
  • 33.
  • 34.
  • 35.
    We need tofind: int y = x / 0; In practice 35 How it may look like: template <class T> class numeric_limits { .... } namespace boost { .... } namespace boost { namespace hash_detail { template <class T> void dsizet(size_t x) { size_t length = x / (limits<int>::digits - 31); } } }
  • 36.
    @Override public String getText(Modemode) { StringBuilder sb = new StringBuilder(); .... if (filter.getMessage() .toLowerCase(Locale.ENGLISH) .startsWith("Each ")) { sb.append(" has base power and toughness "); } else { sb.append(" have base power and toughness "); } .... return sb.toString(); } Data flow analysis 36
  • 37.
    Data flow analysis 37 uint32_t*BnNew() { uint32_t* result = new uint32_t[kBigIntSize]; memset(result, 0, kBigIntSize * sizeof(uint32_t)); return result; } std::string AndroidRSAPublicKey(crypto::RSAPrivateKey* key) { .... uint32_t* n = BnNew(); .... RSAPublicKey pkey; .... if (pkey.n0inv == 0) return kDummyRSAPublicKey; // <= .... }
  • 38.
     «So manyprojects on GitHub! The analyzer will learn from their repositories and commits» turns into commits’ collection and markup.  If a manually collected learning base is unreliable, what to expect from an automatically collected one? Learning on many projects 38
  • 39.
     Check outthe commit with the word «fix»: Learning on many projects 39
  • 40.
     Analyzer hasto be up-to-date in terms of the checked language  Most projects use outdated standards  Most projects don’t use new constructions Outdated code 40
  • 41.
    New construction: std::vector<int> numbers; .... for(int num : numbers) foo(num); New error pattern: for (int num : numbers) numbers.push_back(num * 2); Example 41
  • 42.
  • 43.
     Code example: charcheck(const uint8 *hash_stage2) { .... return memcmp(hash_stage2, hash_stage2_reassured, SHA1_HASH_SIZE); }  The analyzer hypothetically suggests to fix as follows: int check(const uint8 *hash_stage2) { .... return memcmp(hash_stage2, hash_stage2_reassured, SHA1_HASH_SIZE); } Why documentation matters 43
  • 44.
  • 45.
    Code example: ObjectOutputStream out= new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); obj.state = 200; out.writeObject(obj); out.close(); Why documentation matters 45
  • 46.
    The analyzer suggests: ObjectOutputStreamout = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); obj = new SerializedObject(); // Add this line obj.state = 200; out.writeObject(obj); out.close(); Why documentation matters 46
  • 47.
    What happens withoutthe edit: ObjectOutputStream out = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); // stores the object with the state = 100 obj.state = 200; out.writeObject(obj); // stores the object with the state = 100 out.close(); Why documentation matters 47
  • 48.
  • 49.
  • 50.
  • 51.
    std::vector<int> numbers; .... for (intnum : numbers) { if (num < 5) { numbers.push_back(0); break; // or, for example, return } } False positives 51
  • 52.
     Reason forgetting a warning may be unclear. Reason for NOT getting a warning may be unclear as well.  How to fix?  Additional learning (will it help?)  Mechanism to hide warnings (not universal) False positives 52
  • 53.
    In case ofsuccessful analyzer learning 53
  • 54.
     Code styleby specific symbols  Collecting additional metrics and information Promising directions 54
  • 55.
     Best-practices fora specific framework/code base/platform Promising directions 55
  • 56.
    56 https://pvs-studio.com/en/pvs-studio/download/ Download a PVS-Studioone-month trial version and check your projects using a classic static analysis:
  • 57.