Deciphering the complexity of biological systems using machine learning applied to big data

Sammanfattning: With the rising of the omics era and the recent technological advancements in biology and medicine, the generated data is growing exponentially. Increasing the volume of the data in parallel to complexity and heterogeneity of the data results in challenges in data analysis, extraction of information, and interpretation of the results. This thesis aims to reveal the hidden structures and knowledge of complex biological data and make use of them in fundamental biological and clinical studies using machine learning techniques and pipelines. Specifically, to map subcellular localization of proteins by building a mass spectrometry (MS)- and machine learning-based pipeline, to automate the pipeline by developing a package, and to reveal the proteome subtypes of non-small cell lung cancer (NSCLC), and to build classifier pipelines for NSCLC subtyping. In study I we developed a mass spectrometry (MS)- and machine learning-based pipeline to generate a proteome-wide resource of protein subcellular localization across multiple human cancer cell lines. Furthermore, we analyzed protein-domain and variant-dependent localization, and tested proteome-wide relocalization driven by EGFR inhibition. In study II we performed in-depth quantitative molecular phenotyping of a NSCLC cohort and used unsupervised learning-based clustering to reveal six subtypes with distinct mutation, immune infiltration and proliferation profiles. Additionally, supervised learning-based classifiers were built (support vector machine and k-top scoring pairs) and validated using independent validation datasets. In study III we present comprehensive wet-lab and dry-lab protocols for performing protein subcellular localization analysis. Specifically, an R programming-based package was developed for the downstream analysis including preprocessing the MS-output data, building the classifier, and visualization of the output. In study IV we prototype and develop a MS-based, peptide-centric machine learning classifier for subtyping and stratification of NSCLC patients for improved therapy selection. As a summary, the presented work in this thesis shows how to strategically analyze big and complex biological and clinical data to extract useful biological insights. Furthermore, pipelines for data-driven approaches including machine learning and data science methods were developed to understand the biology and to transfer that knowledge into the clinic. The development of such methods and building pipelines for the analysis of big biological data will indisputably be the core element for a range of applications in the fields of biology and medicine.

  Denna avhandling är EVENTUELLT nedladdningsbar som PDF. Kolla denna länk för att se om den går att ladda ner.