Electrical and Computer Engineering Faculty Publications and Presentations

A Study of Style Effects on OCR Errors in the MEDLINE Database

Penny Garrison, Boise State University
Diane Davis, Boise State University
Tim Andersen, Boise State UniversityFollow
Elisa Barney Smith, Boise State UniversityFollow

Document Type

Conference Proceeding

Publication Date

1-19-2005

DOI

http://dx.doi.org/10.1117/12.589408

Abstract

The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process works by feeding words that have characters with less than 100% confidence (as determined automatically by the OCR engine) to a human operator who then must manually verify the word or correct the error. The majority of these errors are contained in the affiliation information zone where the characters are in italics or small fonts. Therefore only affiliation information data is used in this research. This paper examines the correlation between OCR errors and various character attributes in the MEDLINE database, such as font size, italics, bold, etc. and OCR confidence levels. The motivation for this research is that if a correlation between the character style and types of errors exists it should be possible to use this information to improve operator productivity by increasing the probability that the correct word option is presented to the human editor. We have determined that this correlation exists, in particular for the case of characters with diacritics.

Copyright Statement

Copyright 2005 Society of Photo-Optical Instrumentation Engineers. One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited. DOI: 10.1117/12.589408

Publication Information

Garrison, Penny; Davis, Diane; Andersen, Tim; and Barney Smith, Elisa. (2005). "A Study of Style Effects on OCR Errors in the MEDLINE Database". Proceedings of SPIE-IS&T Electronic Imaging, 5676.

Download

Included in

Electrical and Computer Engineering Commons

COinS

ScholarWorks

Electrical and Computer Engineering Faculty Publications and Presentations

A Study of Style Effects on OCR Errors in the MEDLINE Database

Document Type

Publication Date

DOI

Abstract

Copyright Statement

Publication Information

Included in

Browse

Links

Search

Author Corner

ScholarWorks

Electrical and Computer Engineering Faculty Publications and Presentations

A Study of Style Effects on OCR Errors in the MEDLINE Database

Authors

Document Type

Publication Date

DOI

Abstract

Copyright Statement

Publication Information

Included in

Share

Browse

Links

Search

Author Corner