Publication Date

5-2013

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Supervisory Committee Chair

Tim Andersen, Ph.D.

Abstract

We present an overview of the document classification process and present research conducted against the newly constructed SBIR-STTR corpus. Specifically, the current methods in use for annotation, corpus construction, feature construction, feature weighting, and classifier algorithms are surveyed. We introduce a new dataset derived from public data downloaded from sbir.gov and the Text Annotation Toolkit (TAT)₁ for use in classification research.

TAT is a collection of independent components packaged together into one open source software application. TAT was engineered to support the document classification process and workflow. Tracking of changes in a working corpus, saving data used in the training of classifiers to ensure reproducibility, and providing a mechanism for interacting with copyright protected corpora are all fundamental issues that TAT addresses. TAT is built using the robust Open IDE [35] framework that allows plug-in developers access to standard well tested libraries saving years of development time. The main goal of TAT is to minimize the labor intensive process of creating labelled data that can be used to train, test, and deploy machine learning models for automated text annotation. Additionally, TAT allows researchers an easy method to automatically reproduce prior results. The toolkit can facilitate the annotation of text using different machine learning packages as well as corpora with different metadata specifications.

₁TAT is freely available for download from trac.boisestate.edu

Recommended Citation

Panter, Shane K., "Document Classification" (2013). Boise State University Theses and Dissertations. 362.
https://scholarworks.boisestate.edu/td/362

Download

Included in

Computer Sciences Commons

COinS

Boise State University Theses and Dissertations

Document Classification

Publication Date

Type of Culminating Activity

Degree Title

Department

Supervisory Committee Chair

Abstract

Recommended Citation

Included in

Browse

Links

Search

Author Corner

Links

Boise State University Theses and Dissertations

Document Classification

Author

Publication Date

Type of Culminating Activity

Degree Title

Department

Supervisory Committee Chair

Abstract

Recommended Citation

Included in

Share

Browse

Links

Search

Author Corner

Links