Learning Spatiotemporal Features in Low-Data and Fine-Grained Action Recognition with an Application to Equine Pain Behavior
Sammanfattning: Recognition of pain in animals is important because pain compromises animal welfare and can be a manifestation of disease. This is a difficult task for veterinarians and caretakers, partly because horses, being prey animals, display subtle pain behavior, and because they cannot verbalize their pain. An automated video-based system has a large potential to improve the consistency and efficiency of pain predictions.Video recording is desirable for ethological studies because it interferes minimally with the animal, in contrast to more invasive measurement techniques, such as accelerometers. Moreover, to be able to say something meaningful about animal behavior, the subject needs to be studied for longer than the exposure of single images. In deep learning, we have not come as far for video as we have for single images, and even more questions remain regarding what types of architectures should be used and what these models are actually learning. Collecting video data with controlled moderate pain labels is both laborious and involves real animals, and the amount of such data should therefore be limited. The low-data scenario, in particular, is under-explored in action recognition, in favor of the ongoing exploration of how well large models can learn large datasets.The first theme of the thesis is automated recognition of equine pain. Here, we propose a method for end-to-end equine pain recognition from video, finding, in particular, that the temporal modeling ability of the artificial neural network is important to improve the classification. We surpass veterinarian experts on a dataset with horses undergoing well-defined moderate experimental pain induction. Next, we investigate domain transfer to another type of pain in horses: less defined, longer-acting and lower-grade orthopedic pain. We find that a smaller, recurrent video model is more robust to domain shift on a target dataset than a large, pre-trained, 3D CNN, having equal performance on a source dataset. We also discuss challenges with learning video features on real-world datasets.Motivated by questions arisen within the application area, the second theme of the thesis is empirical properties of deep video models. Here, we study the spatiotemporal features that are learned by deep video models in end-to-end video classification and propose an explainability method as a tool for such investigations. Further, the question of whether different approaches to frame dependency treatment in video models affect their cross-domain generalization ability is explored through empirical study. We also propose new datasets for light-weight temporal modeling and to investigate texture bias within action recognition.
KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)