Document Image Processing for Handwritten Text Recognition : Deep Learning-based Transliteration of Astrid Lindgren’s Stenographic Manuscripts

Sammanfattning: Document image processing and handwritten text recognition have been applied to a variety of materials, scripts, and languages, both modern and historic. They are crucial building blocks in the on-going digitisation efforts of archives, where they aid in preserving archival materials and foster knowledge sharing. The latter is especially facilitated by making document contents available to interested readers who may have little to no practice in, for example, reading a specific script type, and might therefore face challenges in accessing the material.  The first part of this dissertation focuses on reducing editorial artefacts, specifically in the form of struck-through words, in manuscripts. The main goal of this process is to identify struck-through words and remove as much of the strikethrough artefacts as possible in order to regain access to the original word. This step can serve both as preprocessing, to aid human annotators and readers, as well as in computerised pipelines, such as handwritten text recognition. Two deep learning-based approaches, exploring paired and unpaired data settings, are examined and compared. Furthermore, an approach for generating synthetic strikethrough data, for example, for training and testing purposes, and three novel datasets are presented. The second part of this dissertation is centred around applying handwritten text recognition to the stenographic manuscripts of Swedish children's book author Astrid Lindgren (1907 - 2002). Manually transliterating stenography, also known as shorthand, requires special domain knowledge of the script itself. Therefore, the main focus of this part is to reduce the required manual work, aiming to increase the accessibility of the material. In this regard, a baseline for handwritten text recognition of Swedish stenography is established. Two approaches for improving upon this baseline are examined. Firstly, a variety of data augmentation techniques, commonly-used in handwritten text recognition, are studied. Secondly, different target sequence encoding methods, which aim to approximate diplomatic transcriptions, are investigated. The latter, in combination with a pre-training approach, significantly improves the recognition performance. In addition to the two presented studies, the novel LION dataset is published, consisting of excerpts from Astrid Lindgren's stenographic manuscripts. 

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)