Fuzzy Conservation-Based Algorithm for Protein Family Classification

Publication Date


Type of Culminating Activity


Degree Title

Master of Science in Engineering, Computer Engineering


Electrical and Computer Engineering

Major Advisor

Scott F. Smith


The development of advanced computational techniques to classify protein sequences into evolutionarily relationships is an important problem in bioinformatics. Most protein classification methods rely upon patterns of residue conservation within sequences to identify evolutionary relationships; however, the sequences of related proteins can vary dramatically, while their structure remains conserved. These remotely homologous sequences are difficult to classify using traditional sequence-only methods, but using both residue and structure for classification may indicate a relationship.

This thesis presents a protein classification method that uses standard fuzzy logic methods to combine the residue, secondary structure, and solvent accessibility conservation patterns found in the multiple sequence alignments (MSA) of protein families. The combined conservation of each alignment position is used to weight a position-specific scoring matrix (PSSM) of the protein family.

Statistical randomization methods were used for reliability tests. The results were excellent, with 99.84% of the PSSMs able to differentiate between family and non-family members. Several potential remote homologs were identified and the conservation patterns for the three families that performed poorly may help researchers identify alternative classifications for the sequences in these families.

Files over 30MB may be slow to open. For best results, right-click and select "save as..."