Publication Date

8-2012

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Supervisory Committee Chair

Tim Andersen

Abstract

A regular expression and region-specific filtering system for biological records at the National Center for Biotechnology database is integrated into an object oriented sequence counting application, and a statistical software suite is designed and deployed to interpret the resulting k-mer frequencies|with a priority focus on nullomers. The proteome k-mer frequency spectra of ten model organisms and the genome k-mer frequency spectra of two bacteria and virus strains for the coding and non-coding regions are comparatively scrutinized. We observe that the naturally-evolved (NCBI/organism) and the artificially-biased (randomly-generated) sequences exhibit a clear deviation from the artificially-unbiased (randomly-generated) histogram distributions. Furthermore, a preliminary assessment of prime predictability is conducted on chronologically ordered NCBI genome snapshots over an 18-month period using an artificial neural network; three distinct supervised machine learning algorithms are used to train and test the system on customized NCBI data sets to forecast future prime states|revealing that, to a modest degree, it is feasible to make such predictions.

Share

COinS