{"url": "http://mondego.ics.uci.edu/projects/SourcererCC/", "content": "\n\n\nCode Clone Detection \n\n\n\n\n\t\t\n\t\t\n\t\t\t\n\t\t\t\t\t\t\n\n\t\t\t\n\t\t\t\t\n \n \n\t\t
\n\t\t\t

SourcererCC: Scaling Type-3 Clone Detection to Large Software Repositories

\n\t\t\t\tTeam @UC Irvine: Hitesh Sajnani, Vaibhav Saini, Cristina Lopes
\n\t\t\t\tTeam @University of Saskatchewan: Jeff Svejlanko, Chanchal Roy\n\t\t\t\n\t\t
\n\t\t\n\t\t\t\n\t\t\t\t\n\n\t\n\n\n\n

Project Description

\n\t\t\t\t\nGiven the availability of large-scale source-code repositories,\nthere have been a large number of applications for clone detection. \nUnfortunately, despite a decade of active\nresearch, there is a marked lack in clone detectors that scale\nto large software repositories. In particular for detecting near-miss clones\nwhere significant editing activities may take place in the\ncloned code.
\nWe present SourcererCC, a token-based clone detector\nthat targets the first three clone types, and exploits an index\nto achieve scalability to large inter-project repositories\nusing a standard workstation. SourcererCC uses an optimized\ninverted-index to quickly query the potential clones\nof a given code block. Filtering heuristics based on token ordering\nare used to significantly reduce the size of the index,\nthe number of code-block comparisons needed to detect the\nclones, as well as the number of required token-comparisons\nneeded to judge a potential clone.
\nWe evaluate the scalability, execution time, recall and precision\nof SourcererCC, and compare it to four publicly available\nand state-of-the-art tools. To measure recall, we use\ntwo recent benchmarks, (1) an exhaustive benchmark of real\nclones, BigCloneBench, and (2) a Mutation/Injection-based\nframework of thousands of fine-grained artificial clones. We\nfind SourcererCC has both high recall and precision, and is\nable to scale to a large inter-project repository (250MLOC)\nusing a standard workstation.\n\n\n\n
\t\t\t\t\n

\n
\t\t\n

Tool Download and Usage

\nIn order to run the tool please follow the steps below:\n

\nA. Generating the input file of the project for which you want to detect clones\n
    \n
  1. Click here to download input generator for the code clone detector (ast.zip).
  2. \n
  3. Unzip ast.zip and import the project ast in your eclipse workspace.
  4. \n
  5. Run it as an \"Eclipse Application\". This should open another eclipse instance where you will import the projects for which you want to generate the input file.
  6. \n
  7. After importing the project in the workspace of the new eclipse instance, click on the \"Sample Menu\" in the top menu bar and then click on the \"Sample command\" to run. This should generate the output (desired input file) in the path specified by variable \"outputdirPath\".
  8. \n
  9. Please note that you will have to change the location of output directory on line 61 of SampleHandler.java.this.outputdirPath = \"/Users/vaibhavsaini/Documents/codetime/repo/ast/output/\"; to your desired output directory.
  10. \n
  11. The generated input file name will be of the format: <ProjectName>-clone-INPUT.txt. For example, if your project name is jython, then the generated input file name should be jython-clone-INPUT.txt
  12. \n
\n
\nB. Running the clone detection tool on the generated input file\n
    \n
  1. Click here to download the CloneDetector (tool.zip).
  2. \n
  3. Unzip tool.zip and navigate to tool/ using terminal
  4. \n
  5. Copy the input file generated above (<ProjectName>-clone-INPUT.txt) into input/dataset directory.
  6. \n
  7. Open cd.sh, and assign <ProjectName> as value to the variable arrayname (line #5). For example, If your generated input file is jython-clone-INPUT.txt, line #5 should be arrayname=(jython)
  8. \n
  9. Execute the command ./cd.sh
  10. \n
\n
\nC. Generated output\n
    \n
  1. The generated output will be in the ./output folder.
  2. \n
  3. Files with extension .txt will have the computed clones and the files with .csv extension will have the time taken to detect clones
  4. \n
\nD. Source Code
\nThe source code of SourcererCC can be found here on github.

\n\nE. SourcererCC-I
\nSourcererCC-I is an interactive version of the tool integrated with Eclipse IDE to help developers instantly find clones during software development\nand maintenance.
\n\nA short video of Sourcerer-I in action can be found here and link to install the Eclipse plug-in is available here.\n\n\n \n\n
\n\n\n\n\n

Precision data as reported in the paper

\n\nWe randomly selected 390 of clone pairs detected by SourcererCC for\nmanual inspection. This is a statistically significant sample\nwith a 95% confidence level and a +/- 5% confidence interval.\nWe split the validation efforts across three clone experts.\nThis prevents any one judge's personal subjectivity from influencing the entire measurement. \nThe judges found 355 to be true positives, and 35 to be false positives, for a precision\nof 91%.\n\n\n\t\n\t\t\n\t\n\t\t\n\t\t\n\t\t\n\n\t\n\t\t\n\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t
Reviewer\n\t\tTrue Positives\n\t\tFalse Positives\n\t\t
Judge 1TP-1FP-1
Judge 2TP-2FP-2
Judge 3TP-3FP-3
\t\n\n\n\n\n

Effectiveness of Filtering Heuristics (Figure 1 in paper)

\nThe effectiveness of the filtering heuristics to eliminate candidate comparisons\nis demonstrated on 35 open source Apache Java projects. \nThese projects are of varied size and span across various\ndomains including search and database systems, server systems,\ndistributed systems, machine learning and natural language processing\nlibraries, network systems, etc. Most of these subject systems are\nhighly popular in their respective domain. Such subject systems\nexhibiting variety in size and domain help counter a potential bias\nof our study towards any specific kind of software system\n\nThe details of the projects including project name, size and the\nnumber of methods is reported in Table II below. Column 3 ( # Methods)\nshows total number of methods (total), number of methods after\nremoving methods with size < 25 tokens (>25 tokens), and methods\nthat are not exact duplicates (unique). Column 5 (Time Taken),\nColumn 6 (# Candidates) and Column 7 (Terms Compared) show\ntime taken to detect clones, number of candidates compared and total number of tokens compared\nfor:
\n(i) Naive - No filtering heuristics;
\n(ii) Prefix - Sub-block filtering heuristic; and
\n(iii) Pos - Both Sub-block and Token Position filtering heuristics together

\n\nThe tabulated data is also charted below.\nThe horizontal axis shows the 35 subject systems sorted by the number of methods they contain \n(smallest on the left) . The vertical axis shows the performance metric value. The black circles,\nthe red triangles, and the green plus marks show the performance metric values of when no filtering\n is applied, only sub-block filtering is applied, and sub-block and token position filtering applied respectively.
\n\n\n\n\n\n\t\t\n\n
\n\n", "encoding": "ascii"}