Publication Date
8-2012
Type of Culminating Activity
Thesis
Degree Title
Master of Science in Computer Science
Department
Computer Science
Supervisory Committee Chair
Tim Andersen
Abstract
A regular expression and region-specific filtering system for biological records at the National Center for Biotechnology database is integrated into an object oriented sequence counting application, and a statistical software suite is designed and deployed to interpret the resulting k-mer frequencies|with a priority focus on nullomers. The proteome k-mer frequency spectra of ten model organisms and the genome k-mer frequency spectra of two bacteria and virus strains for the coding and non-coding regions are comparatively scrutinized. We observe that the naturally-evolved (NCBI/organism) and the artificially-biased (randomly-generated) sequences exhibit a clear deviation from the artificially-unbiased (randomly-generated) histogram distributions. Furthermore, a preliminary assessment of prime predictability is conducted on chronologically ordered NCBI genome snapshots over an 18-month period using an artificial neural network; three distinct supervised machine learning algorithms are used to train and test the system on customized NCBI data sets to forecast future prime states|revealing that, to a modest degree, it is feasible to make such predictions.
Recommended Citation
Schmidt, Nathan O., "On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability" (2012). Boise State University Theses and Dissertations. 346.
https://scholarworks.boisestate.edu/td/346