On Effectively Creating Ensembles of Classifiers : Studies on Creation Strategies, Diversity and Predicting with Confidence

Sammanfattning: An ensemble is a composite model, combining the predictions from several other models. Ensembles are known to be more accurate than single models. Diversity has been identified as an important factor in explaining the success of ensembles. In the context of classification, diversity has not been well defined, and several heuristic diversity measures have been proposed. The focus of this thesis is on how to create effective ensembles in the context of classification. Even though several effective ensemble algorithms have been proposed, there are still several open questions regarding the role diversity plays when creating an effective ensemble. Open questions relating to creating effective ensembles that are addressed include: what to optimize when trying to find an ensemble using a subset of models used by the original ensemble that is more effective than the original ensemble; how effective is it to search for such a sub-ensemble; how should the neural networks used in an ensemble be trained for the ensemble to be effective? The contributions of the thesis include several studies evaluating different ways to optimize which sub-ensemble would be most effective, including a novel approach using combinations of performance and diversity measures. The contributions of the initial studies presented in the thesis eventually resulted in an investigation of the underlying assumption motivating the search for more effective sub-ensembles. The evaluation concluded that even if several more effective sub-ensembles exist, it may not be possible to identify which sub-ensembles would be the most effective using any of the evaluated optimization measures. An investigation of the most effective ways to train neural networks to be used in ensembles was also performed. The conclusions are that effective ensembles can be obtained by training neural networks in a number of different ways but that high average individual accuracy or much diversity both would generate effective ensembles. Several findings regarding diversity and effective ensembles presented in the literature in recent years are also discussed and related to the results of the included studies. When creating confidence based predictors using conformal prediction, there are several open questions regarding how data should be utilized effectively when using ensembles. Open questions related to predicting with confidence that are addressed include: how can data be utilized effectively to achieve more efficient confidence based predictions using ensembles; how do problems with class imbalance affect the confidence based predictions when using conformal prediction? Contributions include two studies where it is shown in the first that the use of out-of-bag estimates when using bagging ensembles results in more effective conformal predictors and it is shown in the second that a conformal predictor conditioned on the class labels to avoid a strong bias towards the majority class is more effective on problems with class imbalance. The research method used is mainly inspired by the design science paradigm, which is manifested by the development and evaluation of artifacts.