Unsupervised Learning of Morphology: Survey, Model, Algorithm and Experiments
Sammanfattning: This thesis contains work on a specific problem in field of Language Technology. The problem can be described as follows: "Can a computer extract a description of word conjugation in a natural language using only written text in the language?" The problem is often referred to as Unsupervised Learning of Morphology (ULM) and has a wide variety of Language Technology applications, including Machine Translation, Document Categorization and I nformation Retrieval. The ULM problem is also relevant for linguistic theory, and can serve to boost empirical investigations in subfields as Quantitative Linguistics and Linguistic Typology, The first part of the thesis contains a comprehensive survey of work done on the ULM problem. All the minor and major lines of work are mentioned with a reference and a very brief characterization. Different approaches that have been prevalent in the field as a whole are highlighted and critically discussed. The general picture resulting from the survey is that much work has been repeated over and over, with little exchange and evolution of techniques. The second part of the thesis describes a simple model of concatenative affixation, i.e., how stems and affixes are stringed together to form words. The model says that words consist of high-frequency strings (``affixes'') attached to low-frequency strings (``stems''), e.g., as in the English play-ing. Then it is shown that from a set words constructed according to the model, the affixes can be extracted with their correct segmentation. The algorithm for extraction is impressionistically evaluated on a diverse set of natural languages. The affix extraction algorithm does not output a full-fledged description of conjugational patterns -- it only produces a list of affixes. The third and fourth parts of the thesis show how it can be used in further morphological analysis. In the third part, an algorithm is presented that decides if two given words are conjugations of the same stem. The key part is the development of a metric for quantifying which endings tend to attach to the same set of stems. The algorithm has no parameters or human input and works equally well for languages with widely different morphological typology. It achieves almost perfect accuracy on word pairs selected from running text. In the fourth part, the affix extraction model is exploited for the written language identification problem, i.e., to decide which natural language a given text is written in. Existing state-of-the-art techniques to identify the language of a written text most often use a 3-gram frequency table as basis for 'fingerprinting' a language. While this approach performs very well in practice (99\%-ish accuracy) if the text to be classified is of size, say, 100 characters or more, it cannot be reliably used to classify even shorter input, nor can it detect if the input is a concatenation of text from several languages. Therefore a more fine-grained model is presented which aims at reliable classification of input as short as one word. In essence, the language of an unseen word is guessed based on any salient affixes that appear on it. Many practical applications do not need this fine level of granularity, but Multilingual Information Retrieval is a major target area where input is usually only one or a few words. The algorithm is given a rigorous evaluation on a 32-language parallel bible corpus showing competitive accuracy on short input as well as multi-lingual input, and not only for a set of European languages with similar morphological typology.
KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)