Variational methods for phylogeny and single-cell genomics

Sammanfattning: The investigation of the evolutionary history of organisms, both at the cellular level and at the species level, is a relevant research topic in computational biology. These investigations lead to a deeper understanding of developmental history, cancer progression, the genetic similarity of species, and more. One way to study the relations between single cells or species is to examine the differences in their genomes, including single nucleotide and copy number variations. The genetic materials need to be extracted and sequenced to be used in the analyses, but this data preparation is prone to errors. The development of sophisticated, probabilistic models is of the utmost importance in handling technological artifacts and including uncertainty in the analysis. In this compilation thesis, we studied various questions and presented four papers to address different challenges. First, we focused on single cells from healthy tissue and developed a probabilistic model to reconstruct the cell lineage tree. This task is challenging in several aspects; i) the healthy cells have a low mutation rate and, therefore, do not introduce many mutations at each cell division, ii) healthy cells usually do not have significant structural variations to improve the analysis, and iii) the sequencing technology introduces errors, and some of these errors are hard to distinguish from the mutations. With the experimental studies, we showed that our model is fast, robust, and accurately reconstructs lineage trees.   Second, we focused on cancer cells. One research topic is identifying structural variations in the cancer cells' genomes and subsequently grouping the cells with similar genome profiles. This two-step process is vulnerable; the imperfections in the first step can irreversibly impact the analysis in the second step. To address this problem, we developed a variational inference-based model that simultaneously does copy number profiling and cell clustering. In addition, we extended the model to incorporate single nucleotide variations to improve the performance. Third, we approached the phylogenetic tree inference problem and developed a variational inference-based model to make the inference. The tree topology space, which contains all possible phylogenetic tree structures, is enormous, and the consideration of each unique tree is intractable. Typically, the existing variational inference-based methods need to constrain their analysis to a much smaller subset of the tree space. Our proposed model does not require such constraints and can obtain similar performance while requiring significantly less time and memory. Finally, we addressed a challenge in variational inference. The variational inference methods target a complex, usually multimodal posterior distribution and try to approximate it using simpler, often unimodal distributions. This design choice causes the variational models to fit one out of many modes of the target distribution; hence they do not capture the overall pattern of the target distribution. We proposed a simple yet effective way to use separately trained variational models to capture the multimodality of the target distribution and demonstrated the approximation performance using several variational methods and data types. We addressed various challenges in computational biology with these four papers and contributed to the progress of the field by developing probabilistic models.