Opportunities, Challenges and Solutions for Automatic Labeling of Data Using Machine Learning

Sammanfattning: Context: Supervised learning is the most common machine learning paradigm and requires labeled data. Because much data in the industry is unlabeled, data labeling is an essential step in the data preparation process. As data labeling can take time and require domain knowledge, many companies need more resources for in-house personnel with domain-specific knowledge to perform labeling. Therefore it is relevant for companies to explore cheaper, more accurate, and automated approaches to labeling. Objective: This research aims to identify industry challenges and mitigation strategies for Data Labeling. Method: The research was conducted using multidisciplinary research. We performed a systematic mapping study to understand and identify the main approaches to labeling and their application domains. We performed a case study with two companies to understand the challenges and mitigation strategies in the industry. The case study consisted of an internship with one of the companies and interviews with data scientists from both companies. A thematic analysis was then utilized to formulate challenges based on collected data. For each of the challenges, a mitigation strategy was formulated. The rest of the research consists of simulations and the Bayesian Bradley-Terry Model and Item Response Theory to study what labeling approaches are best in accuracy and evaluate the sustainability of the datasets used to evaluate the labeling approaches. Results: In this thesis, we present four main findings. First, we present an overview of the most popular data labeling approaches used in different applications, and we provide an overview of the datasets used to evaluate these. Second, we define and categorize the different industry challenges that data scientists face. We then define and formulate mitigation strategies for these challenges. Third, we present the best automated labeling approaches for accuracy and how much manual effort these algorithms need to achieve the best accuracy. Fourth, we present the best benchmark datasets for evaluating automatic labeling approaches. Outlook: In future work, we want to examine safe and deep semi-supervised learning and how they are used in practice, as we have noticed that semi-supervised learning based on Deep Learning has become more prevalent in recent years.

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)