Publication Date
5-2025
Date of Final Oral Examination (Defense)
3-7-2025
Type of Culminating Activity
Dissertation
Degree Title
Doctor of Philosophy in Computing
Department
Computer Science
Supervisory Committee Chair
Casey Kennington, Ph.D.
Supervisory Committee Member
Edoardo Serra, Ph.D.
Supervisory Committee Member
Timothy Andersen, Ph.D.
Abstract
The impressive results that have recently been achieved in natural language processing and artificial intelligence have been primarily driven by the introduction of the transformer deep learning architecture, increasingly large models with many parameters and using enormous datasets. The size of models and their training data requirements present costly demands that freeze many researchers out of training with cutting edge models. Beyond these practical implications, current methods learn from text alone, without the rich array of sensory information that human beings use in learning language. This means that language models are often incapable of reasoning about the concrete world that we inhabit, which is a fundamental aspect of intelligence. In this dissertation, I present a body of research aimed at addressing these concerns spread across four research objectives. As preliminary objectives and proof of concept, we explore the capacity of transformer models to perform natural language understanding tasks using ``human scale" data with a minimum of model parameters. We also present a proof of concept study that shows that sensory information can, to some degree, improve the performance of a language model. After these preliminary studies in pure natural language processing, we shift our focus to jointly processing vision alongside language. The emerging nature of the field means that vision-language modeling software is often not up the standard required for research. Therefore we created a versatile software platform for answering research questions about VL models as part of this dissertation. In the final phase of research we use the software that we created to design and test a compact model that handles both vision and language inputs and can be trained on a university-scale budget.
DOI
10.18122/td.2405.boisestate
Recommended Citation
Fields, Clayton, "Modeling Language and Vision at Human Scales" (2025). Boise State University Theses and Dissertations. 2405.
10.18122/td.2405.boisestate