Publication Date

5-2010

Date of Final Oral Examination (Defense)

6-28-2010

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Major Advisor

Elisa Barney Smith, Ph.D.

Major Advisor

Timothy Andersen, Ph.D.

Advisor

Amit Jain, Ph.D.

Abstract

With the advent of more powerful personal computers, inexpensive memory, and digital cameras, curators around the world are working towards preserving historical documents on computers. Since many of the organizations for which they work have limited funds, there is world-wide interest in a low-cost solution to obtaining these digital records in a computer-readable form. An open source layout analysis system called OCRopus is being developed for such a purpose. In its original state, though, it could not process documents that contained information other than text. Segmenting the page into regions of text and non-text areas is the first step of analyzing a mixedcontent document, but it did not exist in OCRopus. Therefore, the goal of this thesis was to add this capability so that OCRopus could process a full spectrum of documents. By default, the RAST page segmentation algorithm processed text-only documents at a target resolution of 300 DPI. In a separate module, the Voronoi algorithm divided the page into regions, but did not classify them as text or non-text. Additionally, it tended to oversegment non-text regions and was tuned to a resolution of 300 DPI. Therefore, the RAST algorithm was improved to recognize non-text regions and the Voronoi algorithm was extended to classify text and non-text regions and merge non-text regions appropriately. Finally, both algorithms were modified to perform at a range of resolutions. Testing on a set of documents consisting of different types showed an improvement of 15-40% for the RAST algorithm, giving it at an average segmentation accuracy of about 80%. Partially due to the representation of the ground truth, the Voronoi algorithm did not perform as well as the improved RAST algorithm, averaging around 70% overall. Depending on the layout of the historical documents to be digitized, though, either algorithm could be sufficiently accurate to be utilized.

Recommended Citation

Winder, Amy Alison, "Extending the Page Segmentation Algorithms of the OCRopus Documentation Layout Analysis System" (2010). Boise State University Theses and Dissertations. 122.
https://scholarworks.boisestate.edu/td/122

Download

Included in

Other Computer Sciences Commons

COinS

ScholarWorks

Boise State University Theses and Dissertations

Extending the Page Segmentation Algorithms of the OCRopus Documentation Layout Analysis System

Publication Date

Date of Final Oral Examination (Defense)

Type of Culminating Activity

Degree Title

Department

Major Advisor

Major Advisor

Advisor

Abstract

Recommended Citation

Included in

Browse

Links

Search

Author Corner

Links

ScholarWorks

Boise State University Theses and Dissertations

Extending the Page Segmentation Algorithms of the OCRopus Documentation Layout Analysis System

Author

Publication Date

Date of Final Oral Examination (Defense)

Type of Culminating Activity

Degree Title

Department

Major Advisor

Major Advisor

Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Links

Search

Author Corner

Links