Methods and applications in DNA sequence alignments

Detta är en avhandling från Stockholm : Karolinska Institutet, Department of Cell and Molecular Biology

Sammanfattning: DNA sequence alignment is one of the most common bioinformatics tasks. Alignment analysis for eukaryotic genomes is challenging because the datasets are large. Repeat sequences also make the analysis difficult. This thesis describes new methods which we have developed for DNA sequence alignment that address these problems. We have applied these new methods in chicken and Trypanosoma cruzi genome analysis projects, and this publication also describes the result from these projects. Most alignment programs use a seed and extend method, where subsequences (seeds) are used to locate potential alignments that are verified. There is a tradeoff between sensitivity and specificity in the seeding process, as short seeds are inefficient in eliminating spurious matches and long seeds are more likely to omit true alignments in the presence of sequencing errors and polymorphisms. We developed an approximate seed matching algorithm which reduces the impact of this tradeoff by allowing mismatches within the seeds. Approximate seed matching allows the use of long seeds, which results in high specificity in the seeding and a faster alignment program. At the same time, sequencing errors and polymorphisms between the sequences do not reduce sensitivity. The chicken is both an important agricultural source of protein and model organism in biological research. The genome sequencing of the wild ancestor of domestic chickens have offered an opportunity to study genetic factors involved in domestication. Sequences from three domestic chicken breeds were available for comparison to the genome sequence. We used this data to find signs of selective sweeps between wild and domestic chickens by searching for regions with low diversity within domestic breeds. The results showed no evidence of large, domestic-specific sweeps. These findings indicate substantial sequence variation within chicken breeds. Copy number variation is emerging as an important source of genotypic and phenotypic variation in humans. We investigated the presence of such structural variation in the chicken genome through array comparative genome hybridizations of different chicken breeds. The results show extensive copy number variation, in some cases unique to domestic chickens. Trypanosoma cruzi is a protozoan parasite which causes Chagas disease. It has interesting biological features, including a genome structure with many repeated genes. Genes are often repeated in tandem arrays, including surface antigen genes and housekeeping genes. The genome assembly shows numerous gaps and collapsed gene copies. We investigated the copy number of the annotated genes and found the gene content of T. cruzi to be even more repetitive than previously thought. The genome analysis studies described in this thesis validated the DNA sequence alignment methods we have developed, and have provided important information for the chicken and T. cruzi research communities.

  HÄR KAN DU HÄMTA AVHANDLINGEN I FULLTEXT. (följ länken till nästa sida)