Structured Learning for Structural Bioinformatics : Applications of Deep Learning to Protein Structure Prediction

Sammanfattning: Proteins are the basic molecular machines of the cell, performing a broad range of tasks, from structural support to catalysisof chemical reactions. Their function is determined by their 3D structure, which in turn is dictated by the order of their components, the amino acids.This thesis is dedicated to applications of machine learning to the problems of contact prediction, ab-initio, and model quality assessment. In particular, my research has been focused on developing methods that are both effective, and easy to use.In the first paper, we improved the already state-of-the-art model quality assessment (MQA) program ProQ3 replacing the underlying machine learning algorithm from svm to Deep Learning, baptised ProQ3D. The correlation between predicted and true scores was improved from 0.85 to 0.90, using the same training data and features.The second paper joined several programs into a single pipeline for ab-initio structure prediction: contact prediction,folding, and model selection. We attempted to predict the structures of all 6379 PFAM families with unknown structure, ofwhich 558 we believe to be accurate. Of these, 415 had not been reported before.The third paper uses advances in machine learning to build a contact predictor, PconsC4, that is fast and easy to deployin large-scale studies, since it requires a single Multiple Sequence Alignment (MSA), and no external dependencies. The predictions are state-of-the-art, yielding a 12% improvement in precision over PconsC3, and 244 times faster.With ProQ4, in the fourth paper, we introduce a novel way of training deep networks for MQA in a way that minimises the bias of the training data, and emphasises model ranking, and demonstrate its viability with a minimal description ofthe protein. The ranking correlation was improved with respect to ProQ3D from 0.82 to 0.90.Lastly, in the fifth paper, weshow the results of ProQ3D and ProQ4 in a completely blind test: CASP13.