2020 |
Davuluri, Ramana V. |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Informatics Platform For Mammalian Gene Regulation At Isoform-Level @ State University New York Stony Brook
SUMMARY With each successive discovery in genetics, the true dynamic complexity of the human genome has become increasingly apparent, requiring relatively consistent updates to the technical definition of the word ?gene?. It is now understood that the notion of ?one gene makes one protein that functions in one signaling pathway? in human cells is overly simplistic, because majority of the human genes produce multiple functional products (transcript variants and protein isoforms), through alternative transcription and/or alternative splicing. Therefore, our central hypothesis is that the isoform-level gene products ? ?transcript variants? and ?protein isoforms? are the basic functional units in a mammalian cell, and accordingly, the informatics platforms for managing and analyzing gene regulation data both in normal and disease cells should adopt ?gene isoform centric? rather than ?gene centric? approaches. Towards the goal of broadly impacting gene regulation and functional studies at gene isoform-level, we have been developing novel algorithms for analyses of genome- wide transcriptome (RNA-seq and exon-array) and protein-DNA binding (ChIP-seq) data, and for extending the gene-level orthology mapping to exon- and transcript-level mapping between the orthologous human and mouse genes. By applying these novel algorithms on public datasets, we have observed significant expression differences between different sample groups (e.g., developmental stages, cancer subtypes, normal vs cancer) for numerous genes at the isoform-level but not at the overall gene-level, and experimentally validated the `significant' isoforms using RT-qPCR in independent bio-specimens. While the application of these algorithms has led to the development of new methods for diagnosis of glioblastoma or a sub-type thereof, the isoform- level transcriptome analyses results also led to some challenging questions ? for example ? How are the alternative promoters of a gene show switch-like opposing patterns of activity (while one promoter is up- the other is down-regulated in one condition vs the other), and how are different splice-variants of a gene show opposing expression patterns in cancer versus normal tissue samples? We currently lack informatics methods to address these challenging questions. Therefore, we propose to develop novel statistical methods (1) for integrative cluster analysis of isoform-level gene expression information from exon-array and RNA-seq platforms, (2) for identification of differential transcript/isoform usage in heterogeneous cancer samples, and (3) for identification of alternative transcription/splicing quantitative trait locus (sQTL) in tumor adjusted by somatic genetic and epigenetic changes. And, (4) the novel predictions from these algorithms will be experimentally validated by performing Chromatin immunoprecipitation (ChIP), dual-luciferase reporter assay and CRISPR/Cas9 genome editing in U87 and A172 cells. The novel bioinformatics methods developed by this project will help in silico discovery and research for accelerating the linkage of phenotypic and genomic information, at gene-isoform level.
|
0.933 |
2021 |
Davuluri, Ramana V |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Developing Novel Deep-Learning Based Methods For Deciphering Non-Coding Gene Regulatory Code @ State University New York Stony Brook
SUMMARY This project will contribute novel pre-trained DNA Bidirectional Encoder Representations from Transformers, called DNABERT, and associated deep-learning tools to decipher the language of non-coding DNA and facilitate integration of gene regulatory information from rapidly accumulating sequence data with NLM?s genetic databases (for example, dbSNP, dbGaP and ClinVar), which serve both scientists and the public health by helping identify the genetic components of disease. While the genetic code explaining how DNA is translated into proteins is universal, the regulatory code that determines when and how the genes are expressed varies across different cell-types and organisms. Non-coding DNA is highly complex due to the existence of polysemy and distant semantic relationship, from a language modeling perspective. Recently, deep learning methods have been used in unraveling the gene regulatory code, but failed to globally and robustly model such language features in the genome, especially in data-scarce scenarios. To address this challenge, we propose DNABERT to model DNA as a language, by adapting the idea of Bidirectional Encoder Representations from Transformers (BERT). Based on recent observations in natural language processing research, we hypothesize that pre-trained transformer-based neural network model offer a promising, and yet not fully explored, deep learning approach for a variety of sequence prediction tasks in the analysis of non-coding DNA. Our preliminary results showed that DNABERT on the human genome achieved state-of-the-art performance on promoter and splice-site prediction tasks, after easy fine-tuning on small task-specific data (Ji, Y. et al. 2020). The goal of our proposed research is to develop DNABERT for a variety of sequence prediction tasks, and benchmark with existing state- of-the-art deep-learning based methods. Specific aims are (1) develop novel deep-learning methods by adapting BERT; (2) apply the proposed deep-learning methods to specifically target non-coding DNA sequence analyses and predictions; and (3) predict and validate functional non-coding genetic variants by applying DNABERT prediction models. A major contribution of the proposed research is development of pre-trained DNABERT model and prediction algorithms, which present new powerful methods for analyses and predictions of DNA sequences. Since the pre-training of DNABERT is resource-intensive, we will provide the source code and pre-trained model at Github for future academic research. We will also develop an integrated web server to (1) deploy DNABERT model, (2) database to store the identified sequence features and predictions, and (3) tutorials to help users to apply DNABERT to their specific research problems. We anticipate that DNABERT can bring new advancements and insights to the bioinformatics community by bringing advanced language modeling perspective to gene regulation analyses.
|
0.933 |