Automated subject classification of textual web pages, for browsing

Detta är en avhandling från Digital Information Systems Group, Department of Information Technology, Lund University

Sammanfattning: With the exponential growth of the World Wide Web, automated subject classification of Web pages has become a major research issue in information and computer sciences. Organizing Web pages into a hierarchical structure for subject browsing is gaining more recognition as an important tool in information-seeking processes. In this thesis, different automated classification approaches, focusing on organizing textual Web pages into a browsable hierarchical structure, were critically examined and compared. Three major approaches to automated subject classification have been recognized, each coming from a different research community: machine learning, information retrieval and library science. While these approaches have common research aims and a number of methods and techniques, and as such could benefit from each other, it has been shown that authors belonging to the three communities do not communicate with authors from the other two communities to a large extent. The two biggest differences between the approaches are whether they employ a vector space model (machine learning and information retrieval), and whether they make use of controlled vocabularies such as, for example, classification schemes, thesauri, or ontologies (library science). Certain special characteristics of Web pages (e.g. metadata and structural elements such as title, headings, main text) were investigated as to how they could be best used in automated classification. The study indicated that all the structural information and metadata available in Web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important. It has been claimed that well-structured, high-quality controlled vocabularies, could serve as good browsing structures. The degree and nature of subject browsing conducted by users of a large Web-based service (Renardus) was studied, using log analysis. The study showed that browsing is used to a much larger degree than searching, indicating the usefulness of browsing in such services and possibly implying the suitability of such a controlled vocabulary (Dewey Decimal Classification) for browsing.