A Picture is Worth a Thousand Words: Natural Language Processing in Context

Sammanfattning: Modern NLP models learn language from lexical co-occurrences. While this method has allowed for significant breakthroughs, it has also exposed potential limitations of modern NLP methods. For example, NLP models are prone to hallucinate, represent a biased world view and may learn spurious correlations to solve the data instead of the task at hand. This is to some extent the consequence of training the models exclusively on text. In text, concepts are only defined by the words that accompany them and the information in text is incomplete due to reporting bias. In this work, we investigate whether additional context in the form of multimodal information can be used to improve on the representations of modern NLP models. Specifically, we consider BERT-based vision-and-language models that receive additional context from images. We hypothesize that visual training primarily should improve on the visual commonsense knowledge, i.e. obvious knowledge about visual properties, of the models. To probe for this knowledge we develop the evaluation tasks Memory Colors and Visual Property Norms. Generally, we find that the vision-and-language models considered do not outperform unimodal model counterparts. In addition to this, we find that the models switch their answer depending on prompt when evaluated for the same type of knowledge. We conclude that more work is needed on understanding and developing vision-and-language models, and that extra focus should be put on how to successfully fuse image and language processing. We also reconsider the usefulness of measuring commonsense knowledge in models that cannot represent factual knowledge.

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)