Publication Date

5-2025

Date of Final Oral Examination (Defense)

3-7-2025

Type of Culminating Activity

Dissertation

Degree Title

Doctor of Philosophy in Computing

Department

Computer Science

Supervisory Committee Chair

Casey Kennington, Ph.D.

Supervisory Committee Member

Edoardo Serra, Ph.D.

Supervisory Committee Member

Timothy Andersen, Ph.D.

Abstract

The impressive results that have recently been achieved in natural language processing and artificial intelligence have been primarily driven by the introduction of the transformer deep learning architecture, increasingly large models with many parameters and using enormous datasets. The size of models and their training data requirements present costly demands that freeze many researchers out of training with cutting edge models. Beyond these practical implications, current methods learn from text alone, without the rich array of sensory information that human beings use in learning language. This means that language models are often incapable of reasoning about the concrete world that we inhabit, which is a fundamental aspect of intelligence. In this dissertation, I present a body of research aimed at addressing these concerns spread across four research objectives. As preliminary objectives and proof of concept, we explore the capacity of transformer models to perform natural language understanding tasks using ``human scale" data with a minimum of model parameters. We also present a proof of concept study that shows that sensory information can, to some degree, improve the performance of a language model. After these preliminary studies in pure natural language processing, we shift our focus to jointly processing vision alongside language. The emerging nature of the field means that vision-language modeling software is often not up the standard required for research. Therefore we created a versatile software platform for answering research questions about VL models as part of this dissertation. In the final phase of research we use the software that we created to design and test a compact model that handles both vision and language inputs and can be trained on a university-scale budget.

DOI

10.18122/td.2405.boisestate

Share

COinS