Interpreting the Script : Image Analysis and Machine Learning for Quantitative Studies of Pre-modern Manuscripts

Sammanfattning: The humanities have for a long time been a collection of fields that have not gained from the advancements in computational power, as predicted by Moore´s law.  Fields like medicine, biology, physics, chemistry, geology and economics have all developed quantitative tools that take advantage of the exponential increase of processing power over time.  Recent advances in computerized pattern recognition, in combination with a rapid digitization of historical document collections around the world, is about to change this.The first part of this dissertation focuses on constructing a full system for finding handwritten words in historical manuscripts. A novel segmentation algorithm is presented, capable of finding and separating text lines in pre-modern manuscripts.  Text recognition is performed by translating the image data of the text lines into sequences of numbers, called features. Commonly used features are analysed and evaluated on manuscript sources from the Uppsala University library Carolina Rediviva and the US Library of Congress.  Decoding the text in the vast number of photographed manuscripts from our libraries makes computational linguistics and social network analysis directly applicable to historical sources. Hence, text recognition is considered a key technology for the future of computerized research methods in the humanities.The second part of this thesis addresses digital palaeography, using a computers superior capacity for endlessly performing measurements on ink stroke shapes. Objective criteria of character shapes only partly catches what a palaeographer use for assessing similarity. The palaeographer often gets a feel for the scribe's style.  This is, however, hard to quantify.  A method for identifying the scribal hands of a pre-modern copy of the revelations of saint Bridget of Sweden, using semi-supervised learning, is presented.  Methods for production year estimation are presented and evaluated on a collection with close to 11000 medieval charters.  The production dates are estimated using a Gaussian process, where the uncertainty is inferred together with the most likely production year.In summary, this dissertation presents several novel methods related to image analysis and machine learning. In combination with recent advances of the field, they enable efficient computational analysis of very large collections of historical documents.