Threshold-free Code Clone Detection for a Large-scale Heterogeneous Java Repository


Code clones are unavoidable entities in software ecosystems. A variety of clone-detection algorithms are available for finding code clones. For Type-3 clone detection at method granularity (i.e., similar methods with changes in statements), dissimilarity threshold is one of the possible configuration parameters. Existing approaches use a single threshold to detect Type-3 clones across a repository. However, our study shows that to detect Type-3 clones at method granularity on a large-scale heterogeneous repository, multiple thresholds are often required. We find that the performance of clone detection improves if selecting different thresholds for various groups of clones in a heterogeneous repository (i.e., various applications). In this paper, we propose a threshold-free approach to detect Type-3 clones at method granularity across a large number of applications. Our approach uses an unsupervised learning algorithm, i.e., k-means, to determine true and false clones. We use a clone benchmark with 330,840 tagged clones from 24,824 open source Java projects for our study. We observe that our approach improves the performance significantly by 12% in terms of Fmeasure. Furthermore, our threshold-free approach eliminates the concern of practitioners about possible misconfiguration of Type-3 clone detection tools.



Iman Keivanloo, Feng Zhang, Ying Zou, Threshold-free Code Clone Detection for a Large-scale Heterogeneous Java Repository, Proceedings of the 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER'15), 2015, Montreal, Quebec, Canada. (Acceptance Rate = 27%)


 title={Threshold-free code clone detection for a large-scale heterogeneous Java repository},
 author={Keivanloo, Iman and Zhang, Feng and Zou, Ying},
 booktitle={Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on},
 keywords={Java;public domain software;F-measure;clone benchmark;dissimilarity threshold;heterogeneous Java repository;method granularity;open source Java projects;software ecosystems;threshold-free code clone detection algorithms;type-3 clone detection tools;Benchmark testing;Cloning;Clustering algorithms;Google;Java;Optimization methods;Software systems;clone detection;clone search;clustering;large-scale repository;threshold-free;unsupervised learning},
Download the bibtex