Self-supervised Representation Learning for Visual Domains Beyond Natural Scenes

Sammanfattning: Supervised learning undoubtedly obtains higher performance, whereas it imposes limitations due to human supervision. To reduce human supervision, end-to-end learning, i.e., transfer learning, remains proven for fine-tuning tasks but does not leverage unlabeled data. Representation learning in a self-supervised manner has successfully reduced the need for labeled data in the natural language processing and vision domain. Advances in learning effective visual representations without human supervision through a self-supervised learning approach is thought-provoking. Specifically, Joint Embedding Architecture \& Method (JEAM) paradigm isprimarily conceptualized over multiple views of input images,the exploitation of embedding capabilities of joint architectures,and specialized loss objectives have shown recent advances inself-supervision. Recent methods in JEAM paradigm have significantly improved the downstream task performance while the need for human-labeled data has been reduced. These JEAM-based methods are sensitive to trivial solutions during pre-training; hence, different approaches have been applied to prevent trivial solutions. In general, JEAM-based self-supervised methods are categorized based on their objective to avoid trivial solutions in representation learning, i.e., similarity maximization and redundancy reduction. One common fact about all these methods is that they are well-optimized for natural scene-based visual domains, e.g., ImageNet. Besides the difference, representation learning in all the above-stated methods critically depends on human-designed augmentation pipelines to generate views. This dependency makes those methods sub-optimal to adapt in specialized visual domains, e.g., medical imaging, remote sensing, bio images, beyond RGB spectral domains.Proposed works in the thesis conceptualized a hypothesis that limitation imposed due to lack of human knowledge on visual concepts of specialized visual domain hinders the capability of networks to learn efficient representations in a self-supervised manner. To overcome the stated limitation, it is proposed to shift the focus from human inducted prior i.e., designing data augmentations to data prior e.g., supervision signal from data itself to adapt the self-supervised representation learning to aforementioned specialized domains. The stated proposal has been empirically evaluated in two separate works based on contrastive learning method in microscopic medical imaging and 3-dimensional particle measurement sensor domain in mining material.First work presents a novel self-supervised pre-training method to learn efficient representations without labels on histopathology medical images are utilizing magnification factors, a supervision signal present in data. The proposed method, Magnification Prior Contrastive Similarity (MPCS),  enables self-supervised learning of representations without labels on small-scale breast cancer dataset BreakHis by exploiting magnification factor, inductive transfer, and reducing human prior. The proposed method matches fully supervised learning state-of-the-art performance in malignancyclassification when only 20\% of labels are used in finetuning and outperform previous works in fully supervised learning settings for three public breast cancer datasets, includingBreakHis.In the same direciton, another work presents a novel self-supervised representation learning method to learn efficient representations without labels on images from a 3DPM sensor (3-Dimensional Particle Measurement; estimates the particle size distribution of material) utilizing RGB images and depth maps of mining material on the conveyor belt. The proposed method, Depth Contrast, enables self-supervised learning of representations without labelson the 3DPM dataset by exploiting depth maps and inductive transfer.The proposed method outperforms material classification over ImageNettransfer learning performance in fully supervised learning settings andachieves 11\% improvement over ImageNet transfer learningperformance in a semi-supervised setting when only 20\% of labels areused in fine-tuning.Some common trends are observed in both work. First, Using supervision signal from data allows to adapt self-supervised representation learning on beyond the natural scenes domain. Second, proposed methods enables self-supervised pretraining on small-scale datasets, unlike previous works. Finally, learnt representations performed improved knowledge transfer on downstream task then supervised knowledge transfers. Although adapting self-supervised methods for beyond the natural scenes focusing supervision signal from data shown consistent results however it needs to be evaluated on larger evaluations criteria and more complex downstream tasks. 

  Denna avhandling är EVENTUELLT nedladdningsbar som PDF. Kolla denna länk för att se om den går att ladda ner.