Tools and pipelines for interpreting the impacts of genetic variants

Sammanfattning: Next generation sequencing (NGS) methods have been widely used for diagnosis. As time and cost of sequencing has reduced sharply during the last decade, genome and exome-wide sequencing have increasingly been used. The genome and exome projects produce large amounts of variation data and the clinical relevance of large proportions of them are not known. Among various types of genetic variations, the single nucleotide variations (SNVs) that lead to amino acid substitutions (AASs) are the most challenging to interpret. The best way to characterize the impacts of variations is by experimental studies. Since these experiments are expensive and time consuming, they cannot be performed for all identified variants. Computational tools can be used for scoring and ranking the variants and prioritizing them for experimental studies. Reliable and fast tools are necessary for accurate variation interpretation and to cope with the amounts of generated data. Several tools are available for predicting impacts of genetic variations. These tools use various types of information and have different performances. Various performance assessment studies have shown that most of the widely used tools have inconsistent and sub-optimal performance.In this study, we implemented a systematic approach to develop four computational tools for interpreting the impacts of genetic variations. The tools are based on machine learning algorithm. Benchmark variation datasets were obtained from various sources for training and testing the tools. A systematic feature selection technique was employed to identify relevant and non-redundant features for predicting variation impact. The benchmark datasets and the features were used for training the tools. Finally, the tools were tested by using independent datasets to estimate their performance for unseen data. The tools PON-P2, PON-MMR2, and PON-PS predict impacts of AASs in human proteins and the PON-mt-tRNA tool predicts the impacts of SNVs in human mitochondrial transfer RNAs (mt-tRNAs). All the tools showed better performance when compared with state-of-the-art tools. These tools have consistently shown the best performance in our studies as well as in independent studies.The tools developed in this study are useful for ranking variations and prioritizing the likely harmful ones for further evaluation. These tools were developed for different purposes. Three of the tools (PON-P2, PON-MMR2, and PON-mt-tRNA) predict pathogenicity of variations. While PON-P2 is a generic tool for predicting pathogenicity of AASs in all human proteins, PON-MMR2 and PON-mt-tRNA are specific tools for predicting pathogenicity of variations in mismatch repair proteins and mt-tRNA genes, respectively. PON-PS is the first tool for predicting disease severity due to AASs. Pathogenicity of variations indicate the relevance of variation to a disease but cannot predict severity of phenotype. Early identification of disease severity promotes personalized medicine by facilitating early interventions, such as preventive measures, clinical monitoring, and molecular tests, for patients and their family members.The developed computational tools were used for analysing the impacts of variations in DNA mismatch repair proteins, mt-tRNA genes, and somatic variations in cancer. The impacts of all possible AASs in four mismatch repair proteins (MLH1, MSH2, MSH6, and PMS2) were predicted using PON-MMR2 and the impacts of all possible SNVs in 22 human mt-tRNAs were predicted using PON-mt-tRNA. We also studied the distribution of predicted pathogenic and benign variations in the protein domains and 3-dimensional structures of proteins and mt-tRNAs. PON-P2 was used to identify harmful somatic AASs from among 5 million somatic variations from 7,042 genomes or exomes grouped into 30 types of cancer. Only a small fraction of the somatic variations were identified to be harmful. Although known cancer genes contained higher numbers of harmful variations, the proportion of harmful variations was only 40%. We prioritized the proteins that were implicated (containing harmful AASs) in the largest number of samples in each cancer type and studied the networks and pathways affected by them. In the functional interaction network, the prioritized proteins were centrally located. The significantly enriched pathways included several new pathways and previously known pathways implicated in cancer. Our findings facilitates prioritization of experimental studies in various cancer types as well as interpretation of variation impacts in mismatch repair proteins and mt-tRNA genes.