Reliable Uncertainty Quantification in Statistical Learning

Sammanfattning: Mathematical models are powerful yet simplified abstractions used to study, explain, and predict the behavior of systems of interest. This thesis is concerned with their latter application as predictive models. Predictions of such models are often inherently uncertain, as exemplified in weather forecasting and experienced with epidemiological models during the COVID-19 pandemic. Missing information, such as incomplete atmospheric data, and the very nature of models as approximations ("all models are wrong") imply that predictions are at most approximately correct.Probabilistic models alleviate this issue by reporting not a single point prediction ("rain"/"no rain") but a probability distribution of all possible outcomes ("80% probability of rain"), representing the uncertainty of a prediction, with the intention to be able to mark predictions as more or less trustworthy. However, simply reporting a probabilistic prediction does not guarantee that the uncertainty estimates are reliable. Calibrated models ensure that the uncertainty expressed by the predictions is consistent with the prediction task and hence the predictions are neither under- nor overconfident. Calibration is important in particular in safety-critical applications such as medical diagnostics and autonomous driving where it is crucial to be able to distinguish between uncertain and trustworthy predictions. Mathematical models do not necessarily possess this property, and in particular complex machine learning models are susceptible to reporting overconfident predictions.The main contribution of this thesis are new statistical methods for analyzing the calibration of a model, consisting of calibration measures, their estimators, and statistical hypothesis tests based on them. These methods are presented in the five scientific papers in the second part of the thesis. In the first part the reader is introduced to probabilistic predictive models, the analysis of calibration, and positive definite kernels that form the basis of the proposed calibration measures. The contributed tools for calibration analysis cover in principle any predictive model and are applied specifically to classification models, with an arbitrary number of classes, models for regression problems, and models arising from Bayesian inference. This generality is motivated by the need for more detailed calibration analysis of increasingly complex models nowadays. To simplify the use of the statistical methods, a collection of software packages for calibration analysis written in the Julia programming language is made publicly available and supplemented with interfaces to the Python and R programming languages.