Publication Date

5-2023

Date of Final Oral Examination (Defense)

May 2023

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Supervisory Committee Chair

Casey Kennington, Ph.D.

Supervisory Committee Member

Tim Andersen, Ph.D.

Supervisory Committee Member

Michael Ekstrand, Ph.D.

Abstract

A common metric for evaluating Automatic Speech Recognition (ASR) is Word Error Rate (WER) which solely takes into account discrepancies at the word-level. Although WER is useful, it is not guaranteed to correlate well with intelligibility or performance on downstream tasks that make use of ASR. Meaningful assess- ment of ASR mistakes becomes even more important in high-stake scenarios such as health-care. I propose 2 general measures to evaluate the quality or severity of mistakes made by ASR systems, one based on sentiment analysis and another based on text embeddings. Both have the potential to overcome the limitations of WER. I evaluate these measures on simulated patient-doctor conversations. Measures of severity based on sentiment ratings and text embeddings correlate with human ratings of severity. Measures based on text embeddings have the capability to predict human severity ratings better than WER. These measures are used in metrics in the overall evaluation of 5 ASR engines alongside WER. Results show that these metrics capture characteristics of ASR errors that WER does not. Furthermore, I train an ASR system using severity as a penalty in the loss function and demonstrate the potential for using severity not only in the evaluation, but in the development of ASR. Advantages and limitations of this methodology are analyzed and discussed.

DOI

https://doi.org/10.18122/td.2086.boisestate

Share

COinS