Generalization under Model Mismatch and Distributed Learning

Sammanfattning: Machine learning models are typically configured by minimizing the training error over a given training dataset. On the other hand, the main objective is to obtain models that can generalize, i.e., perform well on data unseen during training. A fundamental challenge is that a low training error does not guarantee a low generalization error. While the classical wisdom from statistical learning theory states that models which perfectly fit the training data are unlikely to generalize, recent empirical results show that massively overparameterized models can defy this notion, leading to double-descent curves. This thesis investigates this phenomenon and characterizes the generalization performance of linear models across various scenarios.The first part of this thesis focuses on characterizing the generalization performance as a function of the level of model mismatch between the patterns in the data and the assumed model. Model mismatch is undesirable but often present in practice. We reveal the trade-offs between the number of samples, the number of features, and the possibly incorrect statistical assumptions on the assumed model. We show that fake features, i.e., features present in the model but not in the data, can significantly improve the generalization performance even though they are not correlated with the features in the underlying system.The second part of this thesis focuses on generalization under distributed learning. We focus on the scenario where the model parameters are distributed over a network of learners. We consider two settings: the single-task setting, where the network learns a single task, and the continual learning setting, where the network learns multiple tasks sequentially. The obtained results show that for a wide family of feature probability distributions, the generalization performance heavily depends on how the model parameters are partitioned over the network. In particular, if the number of parameters per learner and the number of training samples are close, then the generalization error may be extremely large compared to other data partitioning schemes. For continual learning, our results quantify how the most favorable network structure for the generalization performance depends on the task similarities as well as the number of tasks.

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)