Towards Higher Code Quality in Scientific Computing

Sammanfattning: In scientific computing and data science, computer programs employing mathematical and statistical models are used for obtaining knowledge in different application domains. The results of these programs form the basis of among other things scientific papers and important desicions that may e.g. affect people's health. Consequently, correctness of the programs is of great importance. To reduce the risk of defects in the source code, and to not waste human resources, it is important that the code is maintainable, i.e. not unnecessarily hard to analyze, test, modify or reuse. For these reasons, this thesis strives towards increased maintainability and correctness in code bases for scientific computing and data science.Object-oriented programming is a programming paradigm that facilitates writing maintainable code, by providing mechanisms for reuse and for division of code into smaller components with restricted access to each others data. Further, it makes extending a code base without changing the existing code possible, increasing flexibility and decreasing the risk of breaking existing functionality. However, in many cases, object-orientation trades its benefits for performance. For some scientific computing programs, performance is essential, e.g. because the results are unusable if they are produced too late. In the first part of this thesis, it is shown that object-oriented programming can be used to improve the quality of an important group of scientific computing programs, with only a small impact on performance.The aim of the second part of the thesis is to contribute to understanding of, and improve quality in, source code for data science. A large corpus of Jupyter notebooks, a tool frequently used by data scientists for writing code, is studied. Results presented suggest that cloned code, i.e. identical or close to identical code that recurs in different places, is common in Jupyter notebooks. Code cloning is important from a perspective of maintenance as well as for research. Additionally, the most frequently called library functions from Python, the language used in the vast majority of the notebooks, are studied. A large number of combinations of parameters for which it is possible to pass values that may lead to unexpected behavior or decreased maintainability are identified. The existence and consequences of occurrences of such combinations of values in the corpus are evaluated. To reduce the risk of future defects in source code calling these functions, improvements are suggested to the developers of the functions.

  Denna avhandling är EVENTUELLT nedladdningsbar som PDF. Kolla denna länk för att se om den går att ladda ner.