Does static analysis need machine learning?

Does static analysis need machine learning? Anti-Talk Victoria Khanieva PVS-Studio

Speaker 2 Victoria Khanieva • С++ developer in PVS-Studio • Supported the MISRA standard • Wrote articles in checks of open-source projects khanieva@viva64.com www.viva64.com

 Introduction to static analysis  Existing solutions and approaches they implement  Problems and pitfalls when creating an analyzer:  When learning «manually»  When learning on a real large code base  Most promising approaches Agenda 3

 Code review Types of code analysis 5

 Code review  Dynamic analysis Types of code analysis 6

 Code review  Dynamic analysis  Static analysis Types of code analysis 7

 How to reveal errors and flaws in the source code of programs.  Detect errors in programs  Get tips on code formatting  Count metrics  …. Static analysis 8

void createCube(float halfExtentsX, float halfExtentsY, float halfExtentsZ, ....){ .... m_model->addVertex(halfExtentsX, halfExtentsY, halfExtentsY, ....); .... } Diagnostics 9

void createCube(float halfExtentsX, float halfExtentsY, float halfExtentsZ, ....){ .... m_model->addVertex(halfExtentsX, halfExtentsY, halfExtentsY, ....); .... } Diagnostics 10 V751 Parameter 'halfExtentsZ' is not used inside function body. TinyRenderer.cpp 375

When ML is useful 12  Useful: Scanning photos and videos

When ML is useful 13  Useful: Scanning photos and videos  Unuseful: Calculator

When ML is useful 14  Useful: Scanning photos and videos  Unuseful: Calculator

 Java, JS, TS, Python, C, C++  Code review and audit  You can check out demos on an open-source project  Related posts DeepCode 19 Link

 Java, C, C++, Objective-C  By Facebook  Open-source code  You can try Infer on your projects  Based on the Хоара and separation logic, bi-abduction, and the abstract interpretation theory Infer 21 Link

 Handles Infer results  Suggests possible edits SapFix 22

 Platform to analyze code quality  System of edits suggestion  Searches for dependencies between functions and methods by NLP Embold 23

 Open-source  Related posts  Repository with dataset for learning  Code-style detection  Platform for collecting metrics and statistics Source{d} 24 Link

Fixing code style in Source{d} 25 Based on the article “STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms” Link

 By Mozilla+Ubisoft  Searches for suspicious commits  Based on the publication: “CLEVER: Combining Code Metrics with Clone Detection for Just-In-Time Fault Prevention and Resolution in Large Industrial Projects” Clever-Commit 26 Link

 Java  By Amazon  Recommendations on best practices from the documentation and code base CodeGuru 27

 Analyze code to search for errors  Analyze code to search for deviations from best practices  Analyze artifacts’ code  Collect metrics and data on code  Suggest code-style fixes Main directions 29

 Selected base of open-source repositories  Dataset selected manually  Own project base Ways to learn 30

Problems and pitfalls 31 * in the view of a classic static analyzer developer

How it may look like: • if (X && A == A) • if (A + 1 == A + 1) • if (A[i] == A[i]) • if ((A) == (A)) • … «Manual» dataset selection 32 We need to find: if (A == A)

We need to find: int y = x / 0; In practice 35 How it may look like: template <class T> class numeric_limits { .... } namespace boost { .... } namespace boost { namespace hash_detail { template <class T> void dsizet(size_t x) { size_t length = x / (limits<int>::digits - 31); } } }

@Override public String getText(Mode mode) { StringBuilder sb = new StringBuilder(); .... if (filter.getMessage() .toLowerCase(Locale.ENGLISH) .startsWith("Each ")) { sb.append(" has base power and toughness "); } else { sb.append(" have base power and toughness "); } .... return sb.toString(); } Data flow analysis 36

Data flow analysis 37 uint32_t* BnNew() { uint32_t* result = new uint32_t[kBigIntSize]; memset(result, 0, kBigIntSize * sizeof(uint32_t)); return result; } std::string AndroidRSAPublicKey(crypto::RSAPrivateKey* key) { .... uint32_t* n = BnNew(); .... RSAPublicKey pkey; .... if (pkey.n0inv == 0) return kDummyRSAPublicKey; // <= .... }

 «So many projects on GitHub! The analyzer will learn from their repositories and commits» turns into commits’ collection and markup.  If a manually collected learning base is unreliable, what to expect from an automatically collected one? Learning on many projects 38

 Check out the commit with the word «fix»: Learning on many projects 39

 Analyzer has to be up-to-date in terms of the checked language  Most projects use outdated standards  Most projects don’t use new constructions Outdated code 40

New construction: std::vector<int> numbers; .... for (int num : numbers) foo(num); New error pattern: for (int num : numbers) numbers.push_back(num * 2); Example 41

 Code example: char check(const uint8 *hash_stage2) { .... return memcmp(hash_stage2, hash_stage2_reassured, SHA1_HASH_SIZE); }  The analyzer hypothetically suggests to fix as follows: int check(const uint8 *hash_stage2) { .... return memcmp(hash_stage2, hash_stage2_reassured, SHA1_HASH_SIZE); } Why documentation matters 43

Classic approach: documentation 44

Code example: ObjectOutputStream out = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); obj.state = 200; out.writeObject(obj); out.close(); Why documentation matters 45

The analyzer suggests: ObjectOutputStream out = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); obj = new SerializedObject(); // Add this line obj.state = 200; out.writeObject(obj); out.close(); Why documentation matters 46

What happens without the edit: ObjectOutputStream out = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); // stores the object with the state = 100 obj.state = 200; out.writeObject(obj); // stores the object with the state = 100 out.close(); Why documentation matters 47

std::vector<int> numbers; .... for (int num : numbers) { if (num < 5) { numbers.push_back(0); break; // or, for example, return } } False positives 51

 Reason for getting a warning may be unclear. Reason for NOT getting a warning may be unclear as well.  How to fix?  Additional learning (will it help?)  Mechanism to hide warnings (not universal) False positives 52

In case of successful analyzer learning 53

 Code style by specific symbols  Collecting additional metrics and information Promising directions 54

 Best-practices for a specific framework/code base/platform Promising directions 55

56 https://pvs-studio.com/en/pvs-studio/download/ Download a PVS-Studio one-month trial version and check your projects using a classic static analysis:

Q&A viva64.com 57 khanieva@viva64.com

Does static analysis need machine learning?

More Related Content

What's hot

Similar to Does static analysis need machine learning?

More from Andrey Karpov

Recently uploaded

Does static analysis need machine learning?