Dimension Reduction and Signal Decomposition for Genotype–Phenotype Relations

Sammanfattning: Over the last few decades, DNA sequencing has developed from costing billions of dollars to get the complete sequence of the human genome, to being a routine procedure performed in labs all around the world. This has transformed the field of experimental biology since measurements can be done at a level of detail that was not possible before. Still, the relationship between genotype and low-level cellular processes on one hand, and high-level phenotypic traits on the other, tends to be very complex; measuring does not equal understanding. In the large data sets that are being gathered, it is often hard to uncover patterns that are truly meaningful, and not just arising by random chance.In this work, we present novel methods for representing, exploring and visualizing genotype-phenotype data sets, with a particular focus on tracking changes driven by evolutionary processes as they occur. One challenge is to be able to quickly search for specific patterns in data coming from large genomes. We have adapted algorithms and data structures from the field of Information Retrieval, relying on inherent genomic structure to make efficient searches. In Paper I, we showcase these techniques with visualization of gene fusions in a study of paediatric B-cell precursor acute lymphoblastic leukaemia.The complexity of biological processes, taken together with the fact that high-throughput measurements, such as DNA/RNA sequencing data, measure many different things at once, means that these data sets will often contain multiple overlaid signals. If data is collected in the field, rather than produced entirely under controlled conditions in the lab, it is practically unavoidable. In Paper III, we present SMSSVD – SubMatrix Selection Singular Value Decomposition, a parameter-free unsupervised signal decomposition and dimension reduction method, particularly useful for data sets with many variables. By adaptively reducing the noise for each signal, SMSSVD creates a representation with many desirable properties inherited from the ordinary SVD, while being able to discover signals closer to the limit of detection.In Paper II and Paper IV we describe models for representing genetically related but still heterogeneous microbial populations and show how the composition of the population determines the interaction with the host. The DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) developed in Paper IV, covers the entire workflow from read alignment to visualization of results. We model each population as a positive measure over sequence space and apply SMSSVD to get a robust representation. Using our model, we follow and visualize the evolutionary trajectories of the populations through time, highlighting important minority variants emerging. Finally, we demonstrate the relevance of our population model by showing that it can accurately predict the population fitness, whereas a model based on the consensus sequence fails.

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)