Title: Deep Learning of millions of randomized APA variants
Advisors: Georg Seelig and Larry Ruzzo
Abstract: Interpreting the gene-regulatory code is integral for predicting the impact of mutations and for engineering synthetic constructs. With recent advances in large-scale genetic data collection, researchers are studying regulatory mechanisms more and more in a data-driven way, employing state-of-the-art Neural Networks to learn about them. At the same time, new methods for interpreting Neural Nets are constantly being developed, mainly in the Image recognition community. This paper investigates the regulation of Alternative Polyadenylation (APA), a post-transcriptional mechanism that is linked to both transcriptome variability and genetic disease. Using millions of synthetically engineered randomized gene variants, a high-performing Neural Network is trained to predict APA site selection and cleavage specificity. The network is capable of inferring mutational impacts and when fine-tuned on a small number of naturally occurring samples, it can predict APA isoform levels in the human genome. Finally, by adapting visualization techniques to the Sequence-based architecture, the regulatory code learned by the network is deciphered.