Data Quality Relevance in Linguistic Analysis: The Impact of Transcription Errors on Multiple Methods of Linguistic Analysis

Steven Pentland, Boise State University
Lee Spitzley, SUNY Albany
Christie Fuller, Boise State University
Doug Twitchell, Boise State University


There is an enormous amount of recorded speech generated daily, and quickly transcribing and analyzing the text of this speech could have tremendous value to organizations and researchers. However, the speech transcription process has historically been laborious, expensive, and slow. Automatic speech recognition (ASR) tools have matured a great deal in the last decade and may be a suitable method to generate large scale, high quality transcriptions. These tools are are fast and economical, but generally produce errors at a much greater rate than human transcribers. It is unknown whether these errors matter when conducting psycholinguistic research. In this study, we will investigate the accuracy of earnings conference call transcripts produced by multiple tools and the impact of that transcription accuracy on the results of subsequent text mining analysis. While prior studies have focused on a single form of text mining, we will conduct three types of text analysis: bag-of-words based classification, lexicon-based classification and sentiment analysis. The results will show whether a different level of transcription quality is required for different types of text mining and the feasibility of using automated transcription services across a range of text mining applications.