Development of New Methods for Inferring and Evaluating Phylogenetic Trees

Detta är en avhandling från Uppsala : Acta Universitatis Upsaliensis

Sammanfattning: Inferring phylogeny is a difficult computational problem. Heuristics are necessary to minimize the time spent evaluating non optimal trees. In paper I, we developed an approach for heuristic searching, using a genetic algorithm. Genetic algorithms mimic the natural selections ability to solve complex problems. The algorithm can reduce the time required for weighted maximum parsimony phylogenetic inference using protein sequences, especially for data sets involving large number of taxa. Evaluating and comparing the ability of phylogenetic methods to infer the correct topology is complex. In paper II, we developed software that determines the minimum subtree prune and regraft (SPR) distance between binary trees to ease the process. The minimum SPR distance can be used to measure the incongruence between trees inferred using different methods. Given a known topology the methods could be evaluated on their ability to infer the correct phylogeny given specific data. The minimum SPR software the intermediate trees that separate two binary trees. In paper III we developed software that given a set of incongruent trees determines the median SPR consensus tree i.e. the tree that explains the trees with a minimum of SPR operations. We investigated the median SPR consensus tree and its possible interpretation as a species tree given a set of gene trees. We used a set of ?-proteobacteria gene trees to test the ability of the algorithm to infer a species tree and compared it to previous studies. The results show that the algorithm can successfully reconstruct a species tree.Expressed sequence tag (EST) data is important in determining intron-exon boundaries, single nucleotide polymorphism and the coding sequence of genes. In paper IV we aligned ESTs to the genome to evaluate the quality of EST data. The results show that many ESTs are contaminated by vector sequences and low quality regions. The reliability of EST data is largely determined by the clustering of the ESTs and the association of the clusters to the correct portion of genome. We investigate the performance of EST clustering using the genome as template compared to previously existing methods using pair-wise alignments. The results show that using the genome as guidance improves the resulting EST clusters in respect to the extent ESTs originating from the same transcriptional unit are separated into disjunct clusters.