Publication Date

8-2022

Date of Final Oral Examination (Defense)

4-12-2022

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Major Advisor

Cassey Kennington, Ph.D.

Advisor

Francesca Spezzano, Ph.D.

Advisor

Tim Anderson, Ph.D.

Abstract

Transformer-based Language Models (LMs), learn contextual meanings for words using a huge amount of unlabeled text data. These models show outstanding performance on various Natural Language Processing (NLP) tasks. However, what the LMs learn is far from what the meaning is for humans, partly due to the fact that humans can differentiate between concrete and abstract words, but language models make no distinction. Concrete words are words that have a physical representation in the world such as “chair”, while abstract words are ideas such as “democracy”. The process of learning word meanings starts from early childhood when children acquire their first language. Children learn their first language through interacting with the physical world by using simple referring expressions. They do not need many examples to learn from, and they learn concrete words first from interacting with their physical world and abstract words later, yet language models are not capable of referring to objects or learning concrete aspects of words.

In this thesis, I derived motivation from the way children acquire language and combined a concrete representation of certain words into LMs while leveraging its existing training regime. My methodology involves using referring expressions to visual objects as a way of linking the visual world representations (images) with text. This takes place by extracting word-level visual embeddings for concrete words from images, while extracting word-level contextual embeddings for abstract words from text and then using them to train language models. In order to enable the model to differentiate between concrete and abstract words, I use a dataset that gives an indication of the level of concreteness for words to determine how information about each word was applied during training.

The work presented in this thesis is evaluated using a standard language understanding benchmark by analyzing the effect of using the proposed training regime on the language model and comparing its performance with traditional language models trained on large corpus data. In the final analysis, the results demonstrate that using referring expressions as the input text to train language models yields better performance on some language understanding tasks than using traditional, corpus-based text. However, the proposed approach cannot affirm that adding visual knowledge and/or concreteness distinction knowledge enriches LMs.

Comments

Transformer-based Language Models (LMs), learn contextual meanings for words using a huge amount of unlabeled text data. These models show outstanding performance on various Natural Language Processing (NLP) tasks. However, what the LMs learn is far from what the meaning is for humans, partly due to the fact that humans can differentiate between concrete and abstract words, but language models make no distinction. Concrete words are words that have a physical representation in the world such as “chair”, while abstract words are ideas such as “democracy”. The process of learning word meanings starts from early childhood when children acquire their first language. Children learn their first language through interacting with the physical world by using simple referring expressions. They do not need many examples to learn from, and they learn concrete words first from interacting with their physical world and abstract words later, yet language models are not capable of referring to objects or learning concrete aspects of words.

In this thesis, I derived motivation from the way children acquire language and combined a concrete representation of certain words into LMs while leveraging its existing training regime. My methodology involves using referring expressions to visual objects as a way of linking the visual world representations (images) with text. This takes place by extracting word-level visual embeddings for concrete words from images, while extracting word-level contextual embeddings for abstract words from text and then using them to train language models. In order to enable the model to differentiate between concrete and abstract words, I use a dataset that gives an indication of the level of concreteness for words to determine how information about each word was applied during training.

The work presented in this thesis is evaluated using a standard language understanding benchmark by analyzing the effect of using the proposed training regime on the language model and comparing its performance with traditional language models trained on large corpus data. In the final analysis, the results demonstrate that using referring expressions as the input text to train language models yields better performance on some language understanding tasks than using traditional, corpus-based text. However, the proposed approach cannot affirm that adding visual knowledge and/or concreteness distinction knowledge enriches LMs.

DOI

https://doi.org/10.18122/td.1975.boisestate

Share

COinS