The document discusses clone detection in Python, identifying duplicated code as a significant issue in software development. It categorizes code clones into four types based on similarity, and outlines various clone detection techniques, including text-based, token-based, syntax-based, and graph-based methods. Additionally, it suggests the use of machine learning to improve clone detection accuracy by analyzing structural and lexical features of code.
Introduction to clone detection in Python presented by Valerio Maggio during a 2013 seminar.
Focuses on the prevalence (5-50%) of duplicated code, its causes, and the need for unification.
Describes various definitions of software clones, identifying different types including exact copy, parameter substituted, structure substituted, and semantic clones.
Lists various clone detection tools and discusses evaluating techniques, including token-based, syntax-based, and graph-based methods.Explores the potential of machine learning to identify code clones through structural and contextual analysis.
Outlines the overall detection process for clones in Python, including preprocessing, extraction, detection, and aggregation.
Discusses empirical evaluation of clone detection precision/recall in Python, highlighting comparisons to previous tools in the field.
Thanks audience for participation, concluding the presentation on clone detection in Python.
DATE: May 13,2013Florence, Italy Clone Detection in Python Valerio Maggio (valerio.maggio@unina.it)
2.
Introduction Duplicated Code Number onein the stink parade is duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them. 2
Introduction Duplicated Code ‣ Exists:5% to 30% of code is similar • In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ‣ Is often created during development 5
13.
Introduction Duplicated Code ‣ Exists:5% to 30% of code is similar • In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ‣ Is often created during development • due to time pressure for an upcoming deadline 5
14.
Introduction Duplicated Code ‣ Exists:5% to 30% of code is similar • In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ‣ Is often created during development • due to time pressure for an upcoming deadline • to overcome limitations of the programming language 5
15.
Introduction Duplicated Code ‣ Exists:5% to 30% of code is similar • In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ‣ Is often created during development • due to time pressure for an upcoming deadline • to overcome limitations of the programming language ‣ Three Public Enemies: 5
16.
Introduction Duplicated Code ‣ Exists:5% to 30% of code is similar • In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ‣ Is often created during development • due to time pressure for an upcoming deadline • to overcome limitations of the programming language ‣ Three Public Enemies: • Copy, Paste and Modify 5
17.
DATE: May 13,2013Part I: Clone Detection Clone Detection in Python
18.
DATE: May 13,2013Part I: Clone Detection Clone Detection in Python
19.
Part I: CloneDetection Code Clones ‣ There can be different definitions of similarity, based on: • Program Text (text, syntax) • Semantics 7 (Def.) “Software Clones are segments of code that are similar according to some definition of similarity” (I.D. Baxter, 1998)
20.
Part I: CloneDetection Code Clones ‣ There can be different definitions of similarity, based on: • Program Text (text, syntax) • Semantics ‣ Four Different Types of Clones 7 (Def.) “Software Clones are segments of code that are similar according to some definition of similarity” (I.D. Baxter, 1998)
21.
Part I: CloneDetection The original one 8 # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines
22.
Part I: CloneDetection Type 1: Exact Copy ‣ Identical code segments except for differences in layout, whitespace, and comments 9
23.
Part I: CloneDetection Type 1: Exact Copy ‣ Identical code segments except for differences in layout, whitespace, and comments 9 # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines def do_something_cool_in_Python (filepath, marker='---end---'): lines = list() # This list is initially empty with open(filepath) as report: for l in report: # It goes through the lines of the file if l.endswith(marker): lines.append(l) return lines
24.
Part I: CloneDetection Type 2: Parameter Substituted Clones ‣ Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments 10
25.
Part I: CloneDetection Type 2: Parameter Substituted Clones ‣ Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments 10 # Type 2 Clone def do_something_cool_in_Python(path, end='---end---'): targets = list() with open(path) as data_file: for t in data_file: if l.endswith(end): targets.append(t) # Stores only lines that ends with "marker" #Return the list of different lines return targets # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines
26.
Part I: CloneDetection Type 3: Structure Substituted Clones ‣ Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments 11
27.
Part I: CloneDetection Type 3: Structure Substituted Clones ‣ Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
28.
Part I: CloneDetection Type 3: Structure Substituted Clones ‣ Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
29.
Part I: CloneDetection Type 3: Structure Substituted Clones ‣ Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
30.
Part I: CloneDetection Type 3: Structure Substituted Clones ‣ Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
31.
Part I: CloneDetection Type 4: “Semantic” Clones ‣ Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants 12
32.
Part I: CloneDetection Type 4: “Semantic” Clones ‣ Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants 12 # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines def do_always_the_same_stuff(filepath, marker='---end---'): report = open(filepath) file_lines = report.readlines() report.close() #Filters only the lines ending with marker return filter(lambda l: len(l) and l.endswith(marker), file_lines)
33.
Part I: CloneDetection What are the consequences? ‣ Do clones increase the maintenance effort? ‣ Hypothesis: • Cloned code increases code size • A fix to a clone must be applied to all similar fragments • Bugs are duplicated together with their clones ‣ However: it is not always possible to remove clones • Removal of Clones is harder if variations exist. 13
Part I: CloneDetection 14 Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ‣ Syntax Based Tools: • Syntax subtrees are compared to each other Clone Detection Tools
38.
Part I: CloneDetection 14 Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ‣ Graph Based Tools: • (sub) graphs are compared to each other Clone Detection Tools
39.
Part I: CloneDetection Clone Detection Techniques 15 ‣ String/Token based Techiniques: • Pros: Run very fast • Cons: Too many false clones ‣ Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones ‣ Graph based Techniques: • Pros: The only one able to deal with Type 4 Clones • Cons: Performance Issues
40.
Part I: CloneDetection The idea: Use Machine Learning, Luke ‣ Use Machine Learning Techniques to compute similarity of fragments by exploiting specific features of the code. ‣ Combine different sources of Information • Structural Information: ASTs, PDGs • Lexical Information: Program Text 16
41.
Part I: CloneDetection Kernel Methods for Structured Data ‣ Well-grounded on solid and awful Math ‣ Based on the idea that objects can be described in terms of their constituent Parts ‣ Can be easily tailored to specific domains • Tree Kernels • Graph Kernels • .... 17
42.
Part I: CloneDetection Defining a Kernel for Structured Data 18
43.
Part I: CloneDetection Defining a Kernel for Structured Data The definition of a new Kernel for a Structured Object requires the definition of: 18
44.
Part I: CloneDetection Defining a Kernel for Structured Data The definition of a new Kernel for a Structured Object requires the definition of: ‣ Set of features to annotate each part of the object 18
45.
Part I: CloneDetection Defining a Kernel for Structured Data The definition of a new Kernel for a Structured Object requires the definition of: ‣ Set of features to annotate each part of the object ‣ A Kernel function to measure the similarity on the smallest part of the object 18
46.
Part I: CloneDetection Defining a Kernel for Structured Data The definition of a new Kernel for a Structured Object requires the definition of: ‣ Set of features to annotate each part of the object ‣ A Kernel function to measure the similarity on the smallest part of the object • e.g., Nodes for AST and Graphs 18
47.
Part I: CloneDetection Defining a Kernel for Structured Data The definition of a new Kernel for a Structured Object requires the definition of: ‣ Set of features to annotate each part of the object ‣ A Kernel function to measure the similarity on the smallest part of the object • e.g., Nodes for AST and Graphs ‣ A Kernel function to apply the computation on the different (sub)parts of the structured object 18
48.
Part I: CloneDetection Kernel Methods for Clones: Tree Kernels Example on AST ‣ Features: We annotate each node by a set of 4 features • Instruction Class - i.e., LOOP, CONDITIONAL_STATEMENT, CALL • Instruction - i.e., FOR, IF, WHILE, RETURN • Context - i.e. Instruction Class of the closer statement node • Lexemes - Lexical information gathered (recursively) from leaves - i.e., Lexical Information 19 FOR
49.
Part I: CloneDetection Kernel Methods for Clones: Tree Kernels Example on AST ‣ Kernel Function: • Aims at identify the maximum isomorphic Tree/Subtree 20 K(T1, T2) = X n2T1 X n02T2 (n, n0 ) · Ksubt(n, n0 ) block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f Ksubt(n, n0 ) = sim(n, n0 ) + (1 ) X (n1,n2)2Ch(n,n0) k(n1, n2)
50.
DATE: May 13,2013Part II: Clones and Python Clone Detection in Python
51.
DATE: May 13,2013Part II: Clones and Python Clone Detection in Python
52.
Part II: InPython The Overall Process Sketch 22 1. Pre Processing
53.
Part II: InPython The Overall Process Sketch 22 block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f 1. Pre Processing 2. Extraction
54.
Part II: InPython The Overall Process Sketch 22 block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f 1. Pre Processing 2. Extraction 3. Detection
55.
Part II: InPython The Overall Process Sketch 22 block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f block print p0.0s = 1.0 = p s f block print y1.0x = x = y x f 1. Pre Processing 2. Extraction 3. Detection 4. Aggregation
Part II: InPython Empirical Evaluation ‣ Comparison with another (pure) AST-based: Clone Digger • It has been the first Clone detector for and in Python :-) • Presented at EuroPython 2006 ‣ Comparison on a system with randomly seeded clones 24 ‣ Results refer only to Type 3 Clones ‣ On Type 1 and Type 2 we got the same results
58.
Part II: InPython Precision/Recall Plot 25 0 0.25 0.50 0.75 1.00 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Precision, Recall and F-Measure Precision Recall F1 Precision: How accurate are the obtained results? (Altern.) How many errors do they contain? Recall: How complete are the obtained results? (Altern.) How many clones have been retrieved w.r.t. Total Clones?
59.
Part II: InPython Is Python less clone prone? 26 Roy et. al., IWSC, 2010