Predicting Functional Activity of Ribozyme Mutations Using Machine Learning
Additional Funding Sources
This research has been sponsored by the National Science Foundation under Award No. 1950599.
Abstract
The risk associated with new variants of RNA viruses has revealed the importance of RNA mutations for national security. SARS and the flu are two examples of the many viruses that contain RNA genomes. The ability to predict how mutations to an RNA sequence will change its function has been limited due to the difficulty of predicting the functional consequence of combinations of mutations and by the danger of working with potentially dangerous viruses. Here, we worked with mutation data from a small non-coding RNA found in humans called the CPEB3 ribozyme. We used a machine learning approach called LSTM (Long Short-Term Memory) trained on a data set containing the ribozyme function of all the possible individual and pairs of mutations of the 69 nt long CPEB3 ribozyme. This trained model was then used to predict combinations of three or more mutations. The LSTM approach is often applied to text/string recognition. In order to feed the model the data, each RNA sequence was encoded such that each nucleotide was a separate string, separated by spaces, enabling the model to learn how changing combinations of strings altered the functional output. Our initial results show considerable predictive power (R2 and pearson coefficients) for combinations of two and three mutations that were withheld from the training data. Our results suggest that data sets with relatively small numbers of RNA mutations can be used to accurately predict very large numbers of mutational combinations. Future work will be aimed at optimizing models and data sets and testing the generalizability of the approach with other experimental systems.
Predicting Functional Activity of Ribozyme Mutations Using Machine Learning
The risk associated with new variants of RNA viruses has revealed the importance of RNA mutations for national security. SARS and the flu are two examples of the many viruses that contain RNA genomes. The ability to predict how mutations to an RNA sequence will change its function has been limited due to the difficulty of predicting the functional consequence of combinations of mutations and by the danger of working with potentially dangerous viruses. Here, we worked with mutation data from a small non-coding RNA found in humans called the CPEB3 ribozyme. We used a machine learning approach called LSTM (Long Short-Term Memory) trained on a data set containing the ribozyme function of all the possible individual and pairs of mutations of the 69 nt long CPEB3 ribozyme. This trained model was then used to predict combinations of three or more mutations. The LSTM approach is often applied to text/string recognition. In order to feed the model the data, each RNA sequence was encoded such that each nucleotide was a separate string, separated by spaces, enabling the model to learn how changing combinations of strings altered the functional output. Our initial results show considerable predictive power (R2 and pearson coefficients) for combinations of two and three mutations that were withheld from the training data. Our results suggest that data sets with relatively small numbers of RNA mutations can be used to accurately predict very large numbers of mutational combinations. Future work will be aimed at optimizing models and data sets and testing the generalizability of the approach with other experimental systems.