Incremental Clustering of Source Code : a Machine Learning Approach

Sammanfattning: Technical debt at the architectural level is a severe threat to software development projects. Uncontrolled technical debt that is allowed to accumulate will undoubtedly hinder speedy development and maintenance, introduce bugs and problems in the software product, and may ultimately result in the abandonment of the source code. It is possible to detect debt accumulation by analyzing the source code and intended modules in the software architecture. However, this is seldom done in practice since it requires a correct and up-to-date mapping from source code to intended modules in the architecture. This mapping requires significant manual effort to create and maintain, something often considered too costly and laborsome. We investigate how to automate the mapping from source code to intended modules. The state-of-the-art considers it an incremental clustering problem, where source code entities should be clustered to the intended modules based on some similarity measure. As the system evolves and source code entities are added or modified, the clustering needs to be updated. The state-of-the-art techniques determine similarity based on either syntactic or semantic features, e.g., dependencies or identifier names. Large sets of parameters modify these features, e.g., weights for various types of dependencies. These parameters have a significant impact on how well the clustering performs. Unfortunately, we have not been able to identify any heuristics to help human experts determine a good set of parameters for a given system. Based on the parameters determined by, e.g., genetic optimization, it seems unlikely that general heuristics exist.Instead, we compute the similarity using a multinomial na\"ive Bayes text classifier trained on tokens from the source code entities. We also include a novel feature that captures dependencies as text to add syntactic features. Our classifier, which relies on significantly fewer parameters, outperforms the state-of-the-art techniques, with their parameters set to near-optimal values.We find that machine learning provides better mapping performance with fewer required parameters. We can successfully combine syntactic information with semantic information without additional parameters. We provide an open-source tool suite with a reference implementation of different techniques and a curated set of systems that can act as a ground truth benchmark.

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)