Boise State University Theses and Dissertations

On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability

Nathan O. Schmidt, Boise State UniversityFollow

Publication Date

8-2012

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Major Advisor

Tim Andersen

Abstract

A regular expression and region-specific filtering system for biological records at the National Center for Biotechnology database is integrated into an object oriented sequence counting application, and a statistical software suite is designed and deployed to interpret the resulting k-mer frequencies|with a priority focus on nullomers. The proteome k-mer frequency spectra of ten model organisms and the genome k-mer frequency spectra of two bacteria and virus strains for the coding and non-coding regions are comparatively scrutinized. We observe that the naturally-evolved (NCBI/organism) and the artificially-biased (randomly-generated) sequences exhibit a clear deviation from the artificially-unbiased (randomly-generated) histogram distributions. Furthermore, a preliminary assessment of prime predictability is conducted on chronologically ordered NCBI genome snapshots over an 18-month period using an artificial neural network; three distinct supervised machine learning algorithms are used to train and test the system on customized NCBI data sets to forecast future prime states|revealing that, to a modest degree, it is feasible to make such predictions.

Recommended Citation

Schmidt, Nathan O., "On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability" (2012). Boise State University Theses and Dissertations. 346.
https://scholarworks.boisestate.edu/td/346

Download

Included in

Numerical Analysis and Scientific Computing Commons

COinS

ScholarWorks

Boise State University Theses and Dissertations

On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability

Publication Date

Type of Culminating Activity

Degree Title

Department

Major Advisor

Abstract

Recommended Citation

Included in

Browse

Links

Search

Author Corner

Links

ScholarWorks

Boise State University Theses and Dissertations

On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability

Author

Publication Date

Type of Culminating Activity

Degree Title

Department

Major Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Links

Search

Author Corner

Links