cached image

Mark B. Gerstein - US grants

Affiliations:

Molecular Biophysics and Biochemistry

Yale University, New Haven, CT

Area:

Computational Biology and Bioinformatics

Website:

https://sites.gersteinlab.org/research/

Tree Info Publications Similar researchers PubMed Report error

We are testing a new system for linking grants to scientists.

The funding information displayed below comes from the NIH Research Portfolio Online Reporting Tools and the NSF Award Database.
The grant data on this page is limited to grants awarded in the United States and is thus partial. It can nonetheless be used to understand how funding patterns influence mentorship networks and vice-versa, which has deep implications on how research is done.
You can help! If you notice any innacuracies, please sign in and mark grants as correct or incorrect matches.

Sign in to see low-probability grants and correct any errors in linkage between grants and researchers.

High-probability grants

According to our matching algorithm, Mark B. Gerstein is the likely recipient of the following grants.

Years	Recipients	Code	Title / Keywords	Matching score
1997 — 2001	Gerstein, Mark	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Development of a Database of Protein Motions and Associated Tools @ Yale University The goal of this proposal is to construct a working database of protein motions. This database will be made into a publicly accessible WWW community resource, of general use to the molecular biophysics community. It will meet the needs of researchers trying to understand the principles of protein structure and function (since motions are often the link between structure and function). The database will be associated with a classification of protein motions, based on the packing at mobile interfaces. The construction of this database has recently become feasible because of a number of developments: the great increase in the number of solved protein structures, the creation of an infrastructure of linked biological databases, and work showing how packing can be used to rationalize mechanisms for protein motions. The development of the database will proceed in two phases with two objectives in each phase (one computational and the other biological). The principal objective in the first phase will be to establish a working, prototype database of protein motions. Motions in this database will initially be arranged hierarchically in terms of size (i.e. fragment, domain, and subunit) and then packing. The packing classification will depend on whether or not the motion involves sliding over a continuously maintained interface. To achieve the first objective it will be necessary to develop a standardized nomenclature and conceptual- information model for protein motions. This will include information such as the number of hinges, the magnitude of the rigidbody rotation, and the number and size of the mobile interfaces. Building upon the framework established in the first phase, the second phase of the project will expand the database and associated motions classification. The goal will be to have an entry for every known protein motion. In addition, "inferred motions," in sequence and structure homologues of a protein with known motion, will also be included. To han dle the large amount of data this will involve analyzing and to help "populate" the database, an automatic conformation comparison tool will be developed. This will rapidly align and compare two arbitrary protein conformations, identifying rigid core regions, flexible hinges, interface packing differences, and so forth. It will also determine whether these structures have any sequence or structure homologues that share their motion. The expansion of the database will require a more sophisticated structure for describing the relationships between motions and a more detailed motion classification scheme than the size-packing hierarchy initially used. The motions classification will be expanded through detailed geometric analysis and calculation, principally focusing on the packing at mobile interfaces. Show summary Hide summary	0.915
1999 — 2002	Gerstein, Mark Bender	P01Activity Code Description: For the support of a broadly based, multidisciplinary, often long-term research program which has a specific major objective or a basic theme. A program project generally involves the organized efforts of relatively large groups, members of which are conducting research projects designed to elucidate the various aspects or components of this objective. Each research project is usually under the leadership of an established investigator. The grant can provide support for certain basic resources used by these groups in the program, including clinical components, the sharing of which facilitates the total research effort. A program project is directed toward a range of problems having a central research focus, in contrast to the usually narrower thrust of the traditional research project. Each project supported through this mechanism should contribute or be directly related to the common theme of the total research effort. These scientifically meritorious projects should demonstrate an essential element of unity and interdependence, i.e., a system of research activities and projects directed toward a well-defined research program goal.	Structural Genomics of Membrane Proteins @ Yale University Our goal is to compare microbial genomics in terms of membrane protein structure, work falling into the emerging field of structural genomics. We will focus on the occurrence of membrane proteins composed of transmembrane (TM) helices and the interactions between pairs of these proteins. (I) Our first aim will be to inventory all the TM-helix proteins in the recently sequenced microbial genomes. Initially, we will use membrane- protein prediction methods based on transfer energy scales. Then we will try to improve upon these by building a Hidden Markov Model to identify membrane proteins. This probabilistic approach will allow us to systematically combine, in a Bayesian framework, prior information from biophysical scales with statistical information from the known membrane proteins. (Ii) Our second aim is to look at protein-protein interactions among helical membrane proteins from a database perspective. We will find all the common helix-helix interfaces in the database of known structures and compare these to the TM-helix oligomerization motifs found in genetic screens by the Beckwith and Engelman groups. In particular, we will measure the packing efficiency for all the helix-helix interfaces, trying to determine whether membrane-protein interfaces are packed less tightly than soluble ones. We will also see how often sequence motifs associated with TM-helix oligomerization occur in a number of genomes, estimating the fraction of proteins in a genome that could potentially interact via these motifs. (iii) Our final aim is to integrate into a comprehensive database the information on the occurrence and interaction of membrane proteins from the first two parts with further information, e.g. related to expression. This will allow us to compare genomes in terms of membrane-protein fold usage and look for TM-proteins common to many diverse organisms. It will also allow us to put the patterns of occurrence of TM-proteins into context, by comparing them to those of soluble proteins. We expect our analysis will initially involve approximately 10 genomes with this number increasing to approximately 100 during the funding period. Show summary Hide summary	1
2002 — 2007	Gerstein, Mark (co-PI) Cheung, Kei-Hoi	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Exploring Approaches For Microarray Databases That Enable Flexible Design and Integrative Analysis @ Yale University This proposal will explore a number of technologies for representing, storing, and processing microarray data, applied to development of the Yale Microarray Database (YMD). Microarray data sets, while providing information about gene expression or protein interactions, present one of the significant challenges in data management. Three specific technologies are involved. Collectively these approaches will allow us to address a number of the important issues in microarray database design: (i) dealing with large-scale and diverse information; (ii) allowing data dissemination; (iii) facilitating interoperation and exchange of data among heterogeneous information resources; and (iv) supporting integrative data analysis. 1. The use of the flexible EAV data modeling approach. The entity-attribute-value (EAV) approach in modeling a wide variety of sparsely populated attributes in a flexible fashion will be utilized in YMD to represent heterogeneous data including experimental conditions and multi-level analysis results. 2. XML-based interoperability. XML can facilitate the conversion and merging of heterogeneous (but related) data sets from different information sources into a common format so that integrated data access can be facilitated. MAGEML (an XML-based markup language for describing gene expression data) as a common data exchange mechanism between YMD and other MIAME-compliant microarray data repositories will be developed. 3. Parallel computing. A distributed, parallel computing approach to help speed up complex queries and analyses of large amounts of gene expression data will be used, providing users with integrated access to a variety of tools. Show summary Hide summary	0.915
2003 — 2006	Snyder, Michael (co-PI) [⬀] Schultz, Martin (co-PI) [⬀] Gerstein, Mark (co-PI) Zhao, Hongyu [⬀]	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Statistical and Computational Approaches For Integrated Genomics and Proteomics Analysis and Their Applications to Modeling G1/S Transition During Yeast Cell Cycle @ Yale University Advances in technologies are changing the field of biology to move beyond genomes to transcriptomes, proteomes and metabolomes. It has become clear that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting and bioengineering. Although the importance of integrating various types of biological data to address scientific questions is well recognized and appreciated, the potential information carried in different types of data may not be fully realized without a sound and comprehensive statistical framework to integrate these data. In addition, close collaborations among statisticians, biologists, bioinformaticians, and computer scientists are essential to ensure that these statistical methods provide a reasonable description of the biological processes studied and the validity of these methods should be rigorously tested through biological experiments. In this project, a team of researchers with expertise in statistics, genomics and proteomics, bioinformatics, and computer science will develop an integrated approach to reconstructing biological pathways. Statistical and computational methods will be developed to better identify transcription factor targets, to integrate yeast two-hybrid data, protein complex data, protein localization data, and gene expression data to infer protein interaction networks, and to further integrate DNA- protein binding data to reconstruct transcriptional regulatory networks. This project focuses on the G1/S transition during the yeast cell cycle to statistically model and experimentally validate inferred regulatory networks. In addition, parallel computing methods will be developed to overcome the computing bottleneck in the analysis of large-scale networks. The resources generated from this project, both computer programs and network information will be made available to the scientific community. It is anticipated that this project will lead to a statistical framework that can be utilized to dissect biological pathways and also will lead to an approach to integrating expertise from diverse disciplines to address important scientific problems in the post-genome era. With recent progresses in biotechnologies, it has become reality to collect tens of thousands of gene expression and protein expression levels in humans and other organisms. In addition, scientists now are able to monitor interactions among proteins and interactions between proteins and DNA sequences, to investigate the location that each gene is expressed, and to study the overall effects on the whole organism of individual genes through large collections of mutation strains. The availability of such data has led to a revolution in biological and biomedical sciences. Although there is a great potential and an enormous amount of information in these data, the major challenge is how to best integrate, analyze, and interpret these data to understand biological pathways. In this project, statistical and computational methods will be developed to integrate various types of data in an effort to reconstruct biological pathways with a focus on the understanding of gene regulations in cell cycle. The statistical models to be developed will be validated with biological experiments. Computer programs will be developed and distributed to the scientific community after extensive testing to allow biologists and medical researchers to use these tools to study other biological pathways. This project will also develop high-performance computing approaches to implementing the developed methods and will involve training activities in the general area of computational biology and bioinformatics. This grant is made under the Joint DMS/NIGMS Initiative to Support Research Grants in the Area of Mathematical Biology. This is a joint competition sponsored by the Division of Mathematical Sciences (DMS) at the National Science Foundation and the National Institute of General Medical Sciences (NIGMS) at the National Institutes of Health. Show summary Hide summary	0.915
2003 — 2006	Snyder, Michael [⬀] Dinesh-Kumar, Savithramma (co-PI) [⬀] Dinesh-Kumar, Savithramma (co-PI) [⬀] Deng, Xing-Wang Gerstein, Mark (co-PI)	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Arabidopsis 2010 Project: Development of Arabidopsis Proteome Chip @ Yale University The complete genome sequence of Arabidopsis has revealed a large number of novel genes, however DNA sequence alone offers few clues as to their specific functions. Currently, much effort is being devoted toward studying gene function and expression. Although important, these studies are not sufficient to predict the structure, function, and activity of proteins in the cell. Proteomics, the global analysis of proteins, is emerging as an important area of research. In this pilot project, a small collection of expression clones of Arabidopsis open reading frames will be generated. Using this collection, optimal protein expression systems will be developed to produce and purify Arabidopsis proteins. The proteins will be printed to generate pilot protein chips, which will then be screened for various biochemical assays, including protein-protein interaction assays, protein-DNA/RNA interactions, protein-phospholipids interactions and the identification of substrates for kinases and other enzymes. Broader Impacts: The rapid progress in the field of large-scale biology has provided the opportunity to understand the function of biological networks as a whole. Since the biochemical function of a gene is manifested through its encoded protein, recently much emphasis has been devoted to the development of new tools for proteomics. Therefore, this project will ultimately provide an enormous valuable resource to the scientific community for a variety of applications aimed at the high-throughput study of protein function in Arabidopsis. The advantage of proteome chips is that a comprehensive set of individual proteins can be directly screened in vitro for a wide variety of activities. Furthermore, once the proteins are prepared, proteome screening is significantly faster and cheaper than other methods. The reagents and information generated from this project will be available to the scientific community (http://bioinfo.mbb.yale.edu/genome/plant) on a regular basis and are expected to significantly enhance the analysis of protein function in Arabidopsis. Show summary Hide summary	0.915
2004 — 2009	Ronald, Pamela (co-PI) [⬀] Deng, Xing-Wang Gerstein, Mark (co-PI)	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Virtual Center For Analysis of Rice Genome Transcription @ Yale University As the international effort to sequence the rice genome is completed, it becomes essential to define each and every transcriptional unit (or gene) encoded in the rice genome. Computer assisted genome annotation suggests that there may be about 60,000 genes in the rice genome. However, it is estimated that the combination of all experimental data available provides expression validation for only approximately half of the predicted genes. Further, due to the high GC content in rice gene coding regions, existing gene prediction program will likely miss a significant fraction of the active genes. Thus there is an urgent need for experimental verification of all predicted genes and discovery of rice genes that were missed by current prediction programs. This project will attempt to discover all possible transcription units of the just completely sequenced Japonica rice. A high-density genome tiling oligonucleotide array produced with the Maskless Array Synthesizer (MAS) technology will be used, which permits reiteration of the oligo design for each subsequent array slide production. The workflow will be optimized first with rice chromosome 10 and then applied to all remaining chromosomes. Pooled probes derived from representative RNA samples from both normal grown rice and those subjected to various inductive treatments will be included to maximize transcription unit discovery. The data will be integrated into the current and ongoing genome annotation to test predicted gene models and to define structures of the novel transcription units. The raw data will be deposited in the MIAME compliant GEO database as soon as the results are subject to quality control. All the original data, including raw hybridization data as well as gene models, will be made available to the public through a project web site (www.plantgenomics.yale.edu) once they are quality controlled and verified. This web site will be online at the end of this first funding year (August 30, 2005) and the last data set will be made available by August 30, 2007. All the new or improved rice gene models created from this work be posted on our project web site and will also be forwarded to the NSF-supported rice genome annotation group at The Institute for Genome Research at bimonthly intervals for their incorporation into their public rice gene annotation web site. The full data set will be made available to TIGR and the public by the end of this project (August, 2007). Broader Impacts: The experimental identification of all Japonica rice transcription units will provide an essential foundation for rice functional genomics and proteomics research in the future and provide useful comparative data for other cereals. In the process of our proposed studies, a number of postdoctoral researchers and students (both graduate and undergraduate) will obtain training in plant genomics and informatics. The project will work with the Peabody Museum at Yale University to teach the general public and school children about the role of rice cultivation in cultures around the world, and on the potential benefits of rice genomic advances for society. Show summary Hide summary	0.915
2005 — 2009	Gerstein, Mark Bender	U54Activity Code Description: To support any part of the full range of research and development from very basic to clinical; may involve ancillary supportive activities such as protracted patient care necessary to the primary research or R&D effort. The spectrum of activities comprises a multidisciplinary attack on a specific disease entity or biomedical problem area. These differ from program project in that they are usually developed in response to an announcement of the programmatic needs of an Institute or Division and subsequently receive continuous attention from its staff. Centers may also serve as regional or national resources for special research purposes, with funding component staff helping to identify appropriate priority needs.	Sub-Project 9 @ Rutgers the St Univ of Nj New Brunswick ABSTRACT NOT PROVIDED Show summary Hide summary	0.9
2005 — 2009	Gerstein, Mark Bender	U54Activity Code Description: To support any part of the full range of research and development from very basic to clinical; may involve ancillary supportive activities such as protracted patient care necessary to the primary research or R&D effort. The spectrum of activities comprises a multidisciplinary attack on a specific disease entity or biomedical problem area. These differ from program project in that they are usually developed in response to an announcement of the programmatic needs of an Institute or Division and subsequently receive continuous attention from its staff. Centers may also serve as regional or national resources for special research purposes, with funding component staff helping to identify appropriate priority needs.	Sub 13 At Yale @ New York Structural Biology Center ABSTRACT NOT PROVIDED Show summary Hide summary	0.894
2005 — 2012	Snyder, Michael [⬀] Dinesh-Kumar, Savithramma (co-PI) [⬀] Dinesh-Kumar, Savithramma (co-PI) [⬀] Gerstein, Mark (co-PI)	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Arabidopsis 2010: Development of An Arabidopsis Proteome Chip @ Yale University The genome sequence of Arabidopsis suggests that it contains approximately 30,700 protein-coding genes. A major goal of the 2010 program is to understand how each of these genes function during plant growth, development, and responses to biotic and abiotic cues. Global studies to analyze gene and protein function have largely focused on gene expression, gene disruption, protein interactions and protein localization. To elucidate biochemical activities of the proteome at the global scale, this project will continue to develop a protein chip for Arabidopsis. In the pilot project funded by NSF 2010 program, high throughput techniques for cloning and expression of 1000 Arabidopsis ORFs were optimized to produce high quality active proteins for the generation of an Arabidopsis protein chip. In this project, a collection of expression clones for 4,000 predicted Arabidopsis ORFs will be generated for tandem affinity purification (TAP) tag fusions. These proteins will be expressed in the plant-based transient expression system to produce and purify proteins. The proteins will be printed on various printing surfaces to produce protein microarrays, which will then be used to optimize protocols for analysis of protein activities. The reagents generated will be made available to the entire scientific community and the information will be available through our website (http://www.gersteinlab.org/proj/atpchip/). Broader Impacts: This project will provide a valuable resource for a variety of applications aimed at the high-throughput study of protein function in Arabidopsis. The project will generate a suite of protocols specific to plant proteomes and generate a resourceful set of plant expression clones and other reagents that are expected to significantly enhance the analysis of protein function in Arabidopsis. Furthermore, the methods developed and information gained from this study can also be directly applied to the analysis of other agriculturally and horticultural plants. In addition to generating a community resource, the project provides the unique opportunity to elevate the awareness and importance of genomics and proteomics to a wide range of individuals including visiting scholars from around the world, educators from small colleges, public secondary school teachers in and around the New Haven area and under represented students at the undergraduate and graduate level. Specifically, participation in Yale University STARS program will provide opportunities for under represented students to conduct cutting edge research in proteomics lab. Show summary Hide summary	0.915
2009 — 2010	Gerstein, Mark Bender (co-PI) Grigorenko, Elena L. (co-PI) [⬀] Vaccarino, Flora M [⬀] Weissman, Sherman Morton (co-PI) [⬀]	R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Biological Correlates of Altered Brain Growth in Autism @ Yale University DESCRIPTION (provided by applicant): A consistently replicated biological phenotype in autism spectrum disorders (ASD) is a larger head circumference (HC) in the first years of life. We hypothesize that increased brain size in ASD is attributable to altered dynamics of cell proliferation and/or differentiation due to genetic changes intrinsic to neural cells. In this application, we will derive induced pluripotent stem cells (iPSC) from skin fibroblasts in individuals with ASD and typically developing children with macrocephaly. Whole genome studies examining structural genetic variation in DNA isolated from iPSC as compared to lymphocytes of the same individuals will ensure genetic stability of the reprogrammed cells. In Specific Aim 1, iPSC lines obtained from 23 participants with ASD and 11 typically developing individuals with macrocephaly will be characterized with respect to cell proliferation, cell survival and genome wide structural variation such as copy number variations (CNVs) by paired end mapping (PEM) and array capture and sequencing. In Specific Aim 2, genome wide CNV as well as sequence variation datasets will be obtained in blood lymphocyte DNA taken from the same 23 participants with ASD and 11 typically developing individuals, plus a limited number of their family members. This will involve (1) PEM (2) array capture for exons and promoter regions with sequencing, and (3) genome-wide mapping of retroelement patterns. Genetic regions potentially important for ASD that will emerge from this study will be validated by targeted resequencing in two larger, independent cohorts of ASD probands and their family members, each comprising about 500 individuals. The immediate goal of our project is to create a new resource and analytical tool. The genetic studies comparing DNA sequence variation in iPSC and blood samples are essential to establish that the iPSC genomic structure corresponds to that identified in the patients. In future studies, iPSC lines generated in this project will be specifically differentiated along the neural lineage and further analyzed with respect their proliferation, differentiation and survival, allowing us to test whether increased brain size in ASD is attributable to altered dynamics of cell proliferation and/or differentiation. These neural cells derived from iPSC lines will be characterized at the transcript and epigenetic levels, for which the basic characterization proposed in this project will provide a necessary platform. Our ultimate goal is to link neurobiological phenotypes and changes in gene expression during the neural differentiation process, with the underlying genetic structure of the individuals to elucidate disease pathogenesis. Therefore, the proposed project will provide a resource for correlating, in future studies, genomic sequence, regulation and intensity of gene expression, cellular (biological) consequences, and patient behavior. PUBLIC HEALTH RELEVANCE: This project will develop lines of pluripotent cells (iPSC) from individuals with autism spectrum disorders (ASD) with macrocephaly and typically developing children, using cells obtained by a skin biopsy. We will produce several iPSC lines per individual and characterize them with respect to their biology and their structural genetic variation. The aim is create a resource and analytical tool, which will allow us to examine neuronal differentiation in autism spectrum disorders. Show summary Hide summary	1
2012 — 2015	Gerstein, Mark Bender (co-PI) Gunel, Murat Lifton, Richard P [⬀] Mane, Shrikant M	U54Activity Code Description: To support any part of the full range of research and development from very basic to clinical; may involve ancillary supportive activities such as protracted patient care necessary to the primary research or R&D effort. The spectrum of activities comprises a multidisciplinary attack on a specific disease entity or biomedical problem area. These differ from program project in that they are usually developed in response to an announcement of the programmatic needs of an Institute or Division and subsequently receive continuous attention from its staff. Centers may also serve as regional or national resources for special research purposes, with funding component staff helping to identify appropriate priority needs.	Yale Center For Mendelian Disorders @ Yale University DESCRIPTION (provided by applicant): The identification of mutations causing Mendelian diseases has revolutionized the understanding of diseases of every organ system. While over 3,000 such diseases have been solved at the molecular level, with 21,000 genes in the human genome and about 15% embryonic lethal loci, it is clear that many remain to be discovered. This includes both described and presently undescribed human traits that contribute to both health and disease. With the spectacular 6-log drop in the cost of DNA sequencing over the last 12 years, it has become apparent that selectively sequencing all of the genes in the genome, which comprise only ~1 % of the human genome represents a very cost-effective means for discovering the basis of new Mendelian diseases. We have pioneered the development of the exome sequencing method as well as the tools for analysis, and have shown that both are scalable, with current cost under $1,500 per exome and expected to be under $1,000 in the near future. We have demonstrated the utility of this approach with the identification of a range of disease genes that were previously intractable due to difficulties in gene mapping owing to high locus heterogeneity, de novo mutations, or small one-of-a-kind families. These considerations motivate new efforts to efficiently solve substantially all Mendelian traits using these technologies. To this end we have established the Yale Center for Mendelian Disorders which will ascertain and acquire samples from patients and families with known or suspected Mendelian diseases, sequence exomes to high coverage sufficient to call 95% of all variants with high specificity and use new analytic approaches we have devised to identify new Mendelian trait genes. We will make all sequences available to the research community as allowed and will establish a Web interface to enable physicians and investigators to submit research samples and retrieve annotated results. These studies will rapidly expand our understanding of the genes and pathways underlying human disease. Show summary Hide summary	1
2013 — 2018	Galas, David J. Gerstein, Mark Bender (co-PI) Milosavljevic, Aleksandar	U54Activity Code Description: To support any part of the full range of research and development from very basic to clinical; may involve ancillary supportive activities such as protracted patient care necessary to the primary research or R&D effort. The spectrum of activities comprises a multidisciplinary attack on a specific disease entity or biomedical problem area. These differ from program project in that they are usually developed in response to an announcement of the programmatic needs of an Institute or Division and subsequently receive continuous attention from its staff. Centers may also serve as regional or national resources for special research purposes, with funding component staff helping to identify appropriate priority needs.	Diac Core @ Baylor College of Medicine The last decade has seen the discovery of miRNAs in body fluids including plasma and serum. This lead to the hypothesis that RNAs play roles as extracellular signaling molecules . It is now clear that these RNAs exist not only in exosomes or other vesicles, but also outside vesicles bound to carrier proteins, such as Argonaute (Ago) proteins In vesicle-mediated RNA signaling, the vesicles are secreted by a large variety of cells, and contain RNAs (e.g. mRNA, miRNA and other ncRNA) that can be taken up by other cells. Strong evidence shows that they can be functional in other cells . In vesicle-free RNA signaling, exRNAs (e.g. siRNA and miRNA) could also be transported across cell membranes by specific receptors or channels, but evidence is much weaker for this mode . . Signaling via RNAs has the potential to play roles as autocrine, paracrine or endocrine signaling , and is therefore of great potential significance in many biological processes. It is now clear that most body fluids contain miRNA and other ncRNAs Moreover, these exRNAs are markers for various pathological states, including cancers and toxicity. Although previous studies have observed widespread signaling exRNAs, it is still not fully understood how and why source cells emit RNA, how they are transported, and how the target cells uptake and interpret the RNAs. It is also not clear when and why exosomal inclusion is needed, and how the inclusion affects biological function. In order to develop a comprehensive understanding of the complex mechanisms of this intercellular communication, a set of reference maps, an exRNA Atlas, is needed including a map of proteins and other carrier molecules that bind exRNA, as well as profiles of RNA content of exosomes and body fluids. To enable ^ the construction of an exRNA Atlas we will construct an automated en-exRNA pipeline for creating an exRNA Atlas Database by adapting existing RNA-seq workflows for mRNAs and miRNAs and develop advanced enexRNA analysis methods and tools for the community and for creating and exRNA Atlas Knowledge Base. Show summary Hide summary	0.88
2013 — 2018	Galas, David J. Gerstein, Mark Bender (co-PI) Milosavljevic, Aleksandar	U54Activity Code Description: To support any part of the full range of research and development from very basic to clinical; may involve ancillary supportive activities such as protracted patient care necessary to the primary research or R&D effort. The spectrum of activities comprises a multidisciplinary attack on a specific disease entity or biomedical problem area. These differ from program project in that they are usually developed in response to an announcement of the programmatic needs of an Institute or Division and subsequently receive continuous attention from its staff. Centers may also serve as regional or national resources for special research purposes, with funding component staff helping to identify appropriate priority needs.	Data Management and Resource Repository For the Exrna Atlas @ Baylor College of Medicine DESCRIPTION (provided by applicant): Extra-cellular RNAs (exRNAs) are emitted into the human bloodstream and other body fluids by different types of cells in the human body and may be uptaken by other cells. The exRNAs may also originate from edible plants and from microbes that inhabit the human body. The Extracellular RNA Communication Program (ERCP) will explore this newly discovered mechanism of communication in healthy individuals and in pathological conditions such as cancer. The Data Management Resource and Repository for the exRNA Atlas (DMRR) will integrate the efforts of the FRCP and serve as a community-wide resource for the development of the exRNA Atlas database. DMRR will consist of three components and an Administrative Core. The Data Coordination Component (DCC) will develop data and metadata standards, establish data flow into the exRNA Atlas database; develop tools for download, visualization and analysis of exRNA data; and integrate exRNA Atlas database with other relevant resources. The Scientific Outreach Component (SOC) will develop the exRNA Atlas Web Portal to disseminate and provide for visualization of the exRNA Atlas data; ensure accessibility of ERCP-generated resources; and initiate community engagement in exRNA biology using leading biological Wiki sites. In close coordination with the DCC, the SOC will engage the community through knowledge curation jamborees, scientific workshops and symposia. The Data Integration and Analysis Component (DIAC) will provide large-scale integrative and analytic support; evaluate tools and build pipelines to be hosted by DCC and used to populate the exRNA Atlas; build tools to be deployed and distributed by the DCC for use by other consortium participants and the wider scientific community for exRNA data; and lead consortium-wide advanced integrative analyses. Through these coordinated efforts of its DCC, SOC, and DIAC components, the DMRR will help organize the ERCP consortium and open opportunities for rapid progress in the nascent field of exRNA biology. Show summary Hide summary	0.88
2013 — 2018	Galas, David J. Gerstein, Mark Bender (co-PI) Milosavljevic, Aleksandar	U54Activity Code Description: To support any part of the full range of research and development from very basic to clinical; may involve ancillary supportive activities such as protracted patient care necessary to the primary research or R&D effort. The spectrum of activities comprises a multidisciplinary attack on a specific disease entity or biomedical problem area. These differ from program project in that they are usually developed in response to an announcement of the programmatic needs of an Institute or Division and subsequently receive continuous attention from its staff. Centers may also serve as regional or national resources for special research purposes, with funding component staff helping to identify appropriate priority needs.	Soc Core @ Baylor College of Medicine DESCRIPTION (provided by applicant): Advanced genetic and genomic technologies promise to transform our understanding and approach to human health and disease. Such genomic analyses are now common in Western populations of European descent. Studies of host genetic factors underlying long-term non-progressors of HIV infection have led to new therapies through the identification of loci that are important to in vivo control of virus pathogenicity. Similar studies of host genetic factors influencing active TB infection have also identified important loci that could significantly impact the future development of more effective therapeutic and prophylactic strategies. Most of these studies were undertaken in non-African, adult populations, although there are more than 2 million new cases of HIV and HIV-TB in Sub-Saharan Africa every year, including more than half a million in children. HIV-infected children - who differ from their adult counterparts in their route of acquisition, clinical course, and pathophysiology - have been conspicuously absent, although they potentially have more to ultimately contribute and gain from therapeutic advances. The Collaborative African Genomics Network (CAfGEN) aims to redress this scientific imbalance by integrating genetic and genomics technologies to probe host factors that are important to the progression of HIV and HIV-TB infection in sub-Saharan African children. The network will incorporate five sites - the Botswana and the Uganda Children's Clinical Centers of Excellence will provide clinical expertise for patient recruitment; Makerere University and the University of Botswana will provide local molecular genetic expertise; and Baylor College of Medicine will provide access to genomics expertise and resources that will ultimately be transitioned to African researchers and institutions in a sustainable manner. The CAfGEN research agenda includes the recruitment of prospective and retrospective cohorts of HIV and HIV-TB infected children; the development of core genomic facilities for sample processing and storage; candidate gene re-sequencing, HLA allelotyping and whole-exome sequencing of patients at the extremes of HIV disease progression; and integrated genomic analyses of active TB progression and associated clinical outcomes using expression quantitative trait loci. These projects will be undertaken through an extensive training and career development plan that will also see significant upgrades in local genomics infrastructure, in so doing, CAfGEN will create a unique, highly synergistic African alliance that can contribute novel and important mechanistic insights to pediatric HIV and HIV-TB disease progression while establishing sustainable genomics technology, expertise, and capacity on the African continent. Show summary Hide summary	0.88
2014 — 2017	Gerstein, Mark Bender (co-PI) Vaccarino, Flora M [⬀] Weissman, Sherman Morton (co-PI) [⬀]	U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Gene Regulatory Elements and Transcriptome in Ipscs and Embryonic Human Cortex @ Yale University DESCRIPTION (provided by applicant): Drawing data from a variety of cell lines, the ENCODE project found that more than 60% of the human genome is transcribed, and that the majority of these messages are not translated into proteins and are likely to have regulatory functions. Over two thirds of protein non-coding RNAs are novel and specific to a particular cell type and developmental stage. In this project we will (1) provide a genome-wide catalogue of all known and novel transcripts and their regulatory elements in progenitors and early neurons from the embryonic human frontal cerebral cortex~ (2) understand whether their expression is recapitulated in vitro during neuronal differentiation from induced pluripotent stem cells (iPSC)~ (3) establish the brain specificity of novel and known transcripts nd their regulatory elements by comparing our dataset with those of the ENCODE project, and (4) identify and catalogue those transcripts and their regulatory elements that are in loci previously implicated in schizophrenia and autism. Gene expression of the embryonic brain is different from that of the postnatal brain. The systematic discovery and analysis of all active genomic elements that we propose here for the mid-gestational embryonic cerebral cortex has not yet been performed, neither is planned under the ENCODE tier 3 projects. The selected histone marks and transcription factors will identify a large fraction of enhancers/promoters active in any specific cell type. Only some were previously ascertained in the developing brain. Abnormalities in very early aspects of brain development, and specifically the developing cortex, are likely to underlie the pathogenesis of common neuropsychiatric disorders like schizophrenia and autism. Human genomic variants that have been linked to these disorders often lay in poorly annotated regions of the genome. These variants could play a direct role in disease pathogenesis by modifying the coding regions of novel, non- annotated transcripts and/or modifying transcription factor binding to their promoters/enhancers. Hence, our first priority is to discover, as well as provide a catalogue of such elements. Their functinal role in development must then be established. The iPSC model system offers an opportunity to begin answering the question of whether these novel transcripts may have an important biological effect. However, the validity of iPSCs as a true representational model of neurodevelopment needs to be established by performing a direct comparison of their transcripts and epigenetic regulators with those that are active in neural cells in vio at comparable stages of neuronal development, ideally in the same genetic background. Hence, in this project we will provide the first rigorous validation of the iPSC model by comparing all transcripts and chromatin marks of progenitors and neurons that are derived from iPSC with those that are present in the brain at comparable stages of development. This will allow the future use of iPSCs to elucidate the function of non-coding elements of the genome and their potential relevance to psychiatric disorders. Show summary Hide summary	1
2014 — 2018	Gerstein, Mark Bender Vaccarino, Flora M [⬀]	R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Genomic Mosaicism in Developing Human Brain @ Yale University DESCRIPTION (provided by applicant): Emerging evidence suggest that not all cells of the human body have identical DNA sequence, a phenomenon called somatic mosaicism. Dividing cells can accumulate single nucleotide variations (SNVs) as well as larger structural variants (SVs), such as copy number variations (CNVs). Our recent studies suggest that somatic mosaicism normally occurs in at least 30% of human skin fibroblasts. The human cerebral cortex displays a very high degree of mitotic expansion during ontogenesis and may be particularly susceptible to accumulating somatic variation during development. Somatic mosaicism could be an adaptive or maladaptive phenomenon, accounting for inter-individual human genetic variability and shaping individual susceptibility and resilience to neuropsychiatric disorders. Yet, the extent of somatic mosaicism in the normal human brain is unknown. In this proposal we will investigate the degree of somatic variation in the developing human brain, using postmortem fetal human tissue. The ideal way to study somatic mosaicism would be to sequence the genome of single cells, however, the extreme degree of amplification that is required creates inevitable artifacts. Our principal appoach will be to sequence the genome of clonal cell populations derived from single brain cells, identify genomic variants manifested in each clone, and verify the presence and frequency of these variants in the original brain tissue to verify that it is, indeed, mosaic Using this comprehensive dataset, we will then evaluate and refine variant calls obtained by whole genome amplification of single brain cells. In Aim 1, we will construct a map of somatic variations in human brain progenitor cells and estimate their frequency in the developing cerebral cortex and basal ganglia. We will compare the genomes of clonal cell populations and single cells extracted from brain tissue, followed by high resolution analyses to verify their presence and allele frequency in the original brain tissue as well as in th blood. In Aim 2, we will determine the impact of somatic mosaicism on gene expression by assessing whether clone-manifested genomic variants have consequences at the level of gene transcription and/or have effects on biological functions that may confer adaptve advantage to the cells. In Aim 3, we will investigate the most likely biological origin of somatic variants by analyzing sequence features at variation sites, correlating variants with recombination hotspots, CpG islands and histone marks. Together, these specific aims will provide the first comprehensive estimate of the number and allelic frequency of genomic variation in somatic cells of the brain and will yield hypotheses about mechanisms responsible for their creation as well as their significance for brain development. Show summary Hide summary	1
2014 — 2018	Freedman, Jane E Gerstein, Mark Bender (co-PI) Mukamal, Kenneth Jay Odonnell, Christopher J	U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Racial and Ethnic Diversity in Human Extracellular Rna @ Univ of Massachusetts Med Sch Worcester DESCRIPTION (provided by applicant): Project Summary: RNA may be extracellular (exRNA) and this diverse population of exRNA includes microRNAs, small nucleolar RNAs, piRNA, non-coding, and environmentally-derived RNAs. In humans, exRNAs are found in various body fluids, including plasma and urine. Specific exRNAs regulate key processes central to normal homeostasis and the pathogenesis of disease. Although a rapidly growing number of reports demonstrate that select exRNAs may reflect or regulate disease, a standard referent group of exRNAs has yet to be generated. In this proposal, we postulate that, in healthy adults, circulating plasma and urine levels of exRNAs; i) are associated with sex, race and ethnicity; ii) are associated with cellular gene expression; iii) may vary with age, and; iv) are associated with genomic variability. The primary goal of this RFA is the generation of exRNA profiles in healthy individuals. These profiles will both define populations and be used as a reference to facilitate disease diagnosis and discovery. To accomplish this goal several criteria are paramount, specifically; samples must; (i) come from a comprehensively characterized cohort that can accurately establish absence of disease, (ii) be racially and ethnically referent to the US population, and (iii) have genomic and phenotypic data available. Thus, we will utilize two community-based cohort studies representative of the U.S. population, the Framingham Heart Study (FHS) and the Multi-Ethnic Study of Atherosclerosis (MESA). exRNA will be isolated from plasma and urine from 800 study participants using an optimized non-commercial isolation method for high-yield plasma RNA extraction. We will conduct high-throughput sequencing on all 1600 samples to identify known and as-yet undiscovered circulating exRNAs. There will be a formal performance, communication, and data sharing plan and all data will be made publically available in collaboration with the ExRNA Communication Program (ERCP) Data Management & Resource Repository (DMRR). Throughout this proposal, the studies will use state-of-the art extraction techniques, new technologies, and well-defined, comprehensively characterized observational cohorts to identify exRNA from plasma and urine and determine their patterns of expression in healthy adults. Show summary Hide summary	0.895
2016 — 2020	Gerstein, Mark Bender (co-PI) Gunel, Murat Lifton, Richard P [⬀] Mane, Shrikant M	UM1Activity Code Description: To support cooperative agreements involving large-scale research activities with complicated structures that cannot be appropriately categorized into an available single component activity code, e.g. clinical networks, research programs or consortium. The components represent a variety of supporting functions and are not independent of each component. Substantial federal programmatic staff involvement is intended to assist investigators during performance of the research activities, as defined in the terms and conditions of the award. The performance period may extend up to seven years but only through the established deviation request process. ICs desiring to use this activity code for programs greater than 5 years must receive OPERA prior approval through the deviation request process.	Yale Center For Mendelian Genomics @ Yale University ? DESCRIPTION (provided by applicant): This is a renewal application for the Yale Center for Mendelian Genomics. The biology linking Mendelian mutations to traits has transformed our understanding of every organ system, identifying therapeutic targets, and allowing preclinical diagnosis and mitigation of disease risk. We know the consequence of mutation of fewer than 3,000 genes. With ~19,000 protein-coding genes, the vast majority of which are conserved across phylogeny, even allowing for 30% lethality, there are doubtless thousands of Mendelian loci awaiting discovery. The full utility of clinical sequencing will not be realized without bette understanding of the consequence of mutation of every gene. The advent of robust exome and genome sequencing allows unprecedented opportunity for discovery of new Mendelian trait loci. In the current cycle, by sequencing more than 7000 exomes from investigators world-wide we have identified 180 new Mendelian trait loci with high confidence, 35 phenotypic expansions, and hundreds more that are likely new trait loci across a range of traits and genetic mechanisms, including de novo mutations, incomplete penetrance, and complex rare recessive traits. Several new loci have immediate therapeutic implications. These results underscore that many new trait loci remain to be described and solved, motivating efforts to complete the human `knock out' map. We now propose, by building upon the current studies and through reduction in high quality exome cost to $330, to identify at least another 500 trait loci via the sequencing of more than 20,000 samples, advancing the understanding of genomes, health and disease. Show summary Hide summary	1
2016 — 2018	Gerstein, Mark Bender	U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Methods and Software to Enhance Genomic Privacy and Sharing of Rna-Seq Data @ Yale University Abstract Privacy is receiving much attention with the unprecedented increase in the breadth and depth of biomedical datasets, particularly personal genomics datasets. Most studies on genomic privacy are focused on protection of variants in personal genomes. Molecular phenotype datasets, however, can also contain substantial amount of sensitive information. Although there is no explicit genotypic information in them, subtle genotype-phenotype correlations can be used to statistically link the phenotype and genotype datasets. We will study the methodologies for analysis of sensitive information leakage from phenotype datasets. We will focus on the RNA-seq datasets and the associated sources of sensitive information leakage. These leakages are mediated by the expression quantitative trait loci. We will approach the privacy analysis under 3 aims. We will first aim at proposing statistical metrics that can be used for quantification of the sensitive information leakage from phenotype datasets. These quantifications can be used to evaluate the risks of privacy breaches. In the second aim, we will focus systematical analysis of how linking attacks can be instantiated and analyzed. We will study how one can generalize linking attacks that enables the privacy researchers study the risks associated with these attacks more systematically. We will then evaluate different models of genotype prediction and assess how these can be used in linking attacks. We will focus, specifically, on the outlier gene expression levels and evaluate how the outliers can be used for genotype prediction and in the linking attacks. In the third aim, we will develop tools that implement the quantification, risk estimation, and risk management methodologies and integrate these in a coherent software suite for a comprehensive privacy analysis, which enables protecting RNA-seq datasets at different levels of summarizations of the datasets, e.g., reads, gene and transcript quantifications. We will aim at increasing the number of software tools for genomic privacy analysis. We will study different algorithmic approaches to tackle with the high computational complexity of anonymization techniques in the literature. We will study sources of sensitive information leakage other than gene expression levels, e.g. splicing and non-coding transcription. These sources of information will be studied in the context of risk quantification and management strategies presented in the previous aims. We will finally use the tools to quantify the sensitive information in the publicly available datasets from large sequencing projects, for example ENCODE, 1000 Genomes, TCGA, GEUVADIS, and GTex. Show summary Hide summary	1
2016 — 2018	Gerstein, Mark Bender	R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Prioritizing Rare Variants Associated With Cancer Using Non-Coding Annotation @ Yale University ? DESCRIPTION (provided by applicant): We will investigate potential disease-associated genetic variants in the non-coding regions of the human genome. Recent work in the ENCODE project and in population-scale RNA sequencing has contributed significantly to our knowledge of non-coding elements. Thus, given the focus on coding variation in many previous disease studies, there is much untapped potential in exploring the non-coding variation associated with disease. We plan to prioritize rare, germline non-coding variants for connection to disease, using a generalized framework that we will tune specifically to Prostate Cancer as a test case. Our approach will build upon our existing tool, FunSeq, which prioritizes rare somatic variants in cancer, to create eleVAR - elevating germline VARiants. FunSeq was developed to prioritize somatic variants in regions of the genome depleted of common variants in the general population, based on data from the 1000 Genomes project. eleVAR will use this general principle to analyze germline variations, and build upon it by adding several key features, including: (i) prioritizing variants leading to gain of new transcription-factor (TF) binding sites(in addition to disruption of existing sites), (ii) annotating variants in enhancers and connecting them to target genes, (iii) prioritizing variants highly connected in a variety of biological networks, (iv) annotating variants in non-coding RNAs similarly to those in TF binding sites, and (v) prioritizing variants associated with variable, allele-specific activity. Our second objective s to use eleVAR to prioritize variants in whole genome sequences from the TCGA/ICGC consortium. Our efficient implementation of eleVAR will include a module for updating parameters in response to high throughput experimental data. We will progressively tune and evaluate eleVAR, first using publicly available data, and then using multiple rounds of high throughput experimental characterization of variants occurring specifically in prostate cancer. Our last objective is to functionally validate a subset of variants in details. First, we will idenify variants in the 6 representative eleVAR positives and look at their frequency of occurrence in a large prostate cancer cohort using targeted re-sequencing. We will use the CRISPR/Cas system to generate endogenous mutations, determining their effects on target gene expression, cell morphology and tumorigenicity, and TF binding by EMSA and chromatin immunoprecipitation. Show summary Hide summary	1
2017 — 2020	Gerstein, Mark	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Collaborative Proposal: Abi Innovation:a Graph Based Approach For the Genome Wide Prediction of Conditionaly Essential Genes @ Yale University How does one identify, and characterize at the genome scale, the set of genes that is essential for an organism to grow and thrive under particular conditions? Predicting such sets of genes is a fundamental goal in bioinformatics; this project aims to create methods and tools for making accurate lists of such functional genes. The approach combines phenotype prediction with knowledge about the functional biological networks in cells to infer new knowledge. The network analysis methods developed here can be easily transferred and applied to a large variety of datasets to answer a wide range of questions from inferring gene-phenotype associations to detecting communities on social networks, extensions highly relevant to the network science community. Moreover, the project's state-of-the-art analysis of temporal gene expression data using state-space models and dimensionality reduction techniques is universally applicable to any groups of genes - e.g. tissue specific vs universally expressed genes. In addition to advancing functional genomics knowledge in the study organism, yeast, the tools will have an impact on research in fields like personal genomics research, by providing a large-scale system-level identification and molecular characterization of phenotypes. Finally, this project provides new and innovative tools for education in bioinformatics. In more technical terms, this project's major goal is to develop new mathematical models and methods that, given a set of genes or an entire genome, can infer their phenotypes and suggest whether or not these genes are necessary for the organism survival. Specifically, information will be integrated on two levels: phenotypic and molecular. At the phenotypic level the structure of biological networks will be used to assign phenotypic attributes to genes and identify sets of genes that share similar essential phenotypes. At the molecular level, the resulted phenotype predictions will be refined by identifying groups of essential genes governed by similar activity patterns. The integration of the information on these two levels will result in a comprehensive gene-phenotype characterization and a refined group of conditionally essential genes. The resulting predictions will be validated experimentally in two yeast systems. All the tools and datasets associated with this project will be made freely available through genopheno.gersteinlab.org. Show summary Hide summary	0.915
2020 — 2021	Gerstein, Mark Bender	R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Enhancing Open Data Sharing For Functional Genomics Experiments: Measures to Quantify Genomic Information Leakage and File Formats For Privacy Preservation @ Yale University Project Summary/Abstract: With the surge of large genomics data, there is an immense increase in the breadth and depth of different omics datasets and an increasing importance in the topic of privacy of individuals in genomic data science. Detailed genetic and environmental characterization of diseases and conditions relies on the large-scale mining of functional genomics data; hence, there is great desire to share data as broadly as possible. However, there is a scarcity of privacy studies focused on such data. A key first step in reducing private information leakage is to measure the amount of information leakage in functional genomics data, particularly in different data file types. To this end, we propose to to derive information-theoretic measures for private information leakage in different data types from functional genomics data. We will also develop various file formats to reduce this leakage during sharing. We will approach the privacy analysis under three aims. First, we will develop statistical metrics that can be used to quantify the sensitive information leakage from raw reads. We will systematically analyze how linking attacks can be instantiated using various genotyping methods such as single nucleotide variant and structural variant calling from raw reads, signal profiles, Hi-C interaction matrices, and gene expression matrices. Second, we will study different algorithms to implement privacy-preserving transformations to the functional genomics data in various forms. Particularly, we will create privacy-preserving file formats for raw sequence alignment maps, signal track files, three-dimensional interaction matrices, and gene expression quantification matrices that contain information from multiple individuals. This will allow us to study the sources of sensitive information leakages other than raw reads, for example signal profiles, splicing and isoform transcription, and abnormal three-dimensional genomic interactions. Third, we will investigate the reads that can be mapped to the microbiome in the raw human functional genomics datasets. We will use inferred microbial information to characterize private information about individuals, and then combine the microbial information with the information from human mapped reads to increase the re-identification accuracy in the linking attacks described in the second aim. We will use the tools to quantify the sensitive information and privacy-preserving file formats in the available datasets from large sequencing projects, such as the ENCODE, The Cancer Genome Atlas, 1,000 Genomes, gEUVADIS, and Genotype-Tissue Expression projects. Show summary Hide summary	1
2020 — 2021	Gerstein, Mark Bender Kluger, Yuval (co-PI) [⬀] Spudich, Serena S [⬀]	UM1Activity Code Description: To support cooperative agreements involving large-scale research activities with complicated structures that cannot be appropriately categorized into an available single component activity code, e.g. clinical networks, research programs or consortium. The components represent a variety of supporting functions and are not independent of each component. Substantial federal programmatic staff involvement is intended to assist investigators during performance of the research activities, as defined in the terms and conditions of the award. The performance period may extend up to seven years but only through the established deviation request process. ICs desiring to use this activity code for programs greater than 5 years must receive OPERA prior approval through the deviation request process.	The Y-Scorch Data Generation Center At Yale For Single-Cell Opioid Responses in the Context of Hiv @ Yale University Abstract Opioid use disorder (OUD) and HIV infection are syndemic conditions that independently and synergistically lead to central nervous system (CNS) dysfunction in tens of millions of people globally. However, the cellular circuits altered by OUD and HIV, and their combination, remain elusive. Further, the identities of cell types within the brain that can harbor HIV infection remain controversial. To address this key vexing question of HIV location within the brain and the effects of HIV and OUD on the brain, comprehensive tissue characterization at the single-cell level is needed to identify novel rare cell types, enriched or depleted cellular populations, and cellular circuits tied to pathogenesis. We propose to employ state-of-the-art methodologies in a center at Yale devoted to generating data on Single Cell Opioid Responses in the Context of HIV Discovery (SCORCH), Y-SCORCH. The center assembles a team of investigators at Yale with leading expertise in neurogenomics, HIV biology, neuroscience of addiction, single-cell analytics and consortium science, and a record of existing collaborations. Our TISSUE Component includes plans to sample 20 brains from four donor groups: controls, HIV (HIV+), HIV with OUD (HIV+OUD+), and OUD without HIV (OUD+). For each brain we will study 4 regions (prefrontal cortex, ventral striatum, insular cortex, and amygdala), representing disease-relevant areas for OUD and HIV. Our ASSAY Component will carry out single nucleus RNA sequencing (snRNA-seq) for 5,000-20,000 cells/sample, single-nucleus ATAC-seq (scATAC-seq), and spatial transcriptomics to generate transcriptomic, epigenetic, and spatial atlases for each donor type and each region. In parallel, we will detect HIV transcripts in the HIV+ groups. Our DATA Deposition & Analysis Component will assemble data standards, facilitate dissemination, and integrate our data with existing brain atlases. It will develop pipelines for high-throughput data analysis, including for single nucleus transciptomes, detection of HIV transcripts and for scATAC-seq data, and develop innovative analysis methods. Our Prioritization & Functional VALIDATION Component proposes a process of identification of brain regions and donor types that best differentiate disease states, and describes experiments to validate the findings generated by transcriptomic and epigenetic analyses. Finally, our Research MANAGEMENT Component provides a framework for data sharing with the SCORCH Data Center and broader Consortium and ensures timely progress in achieving our milestones which include generating `omics data from 640 assays. Our Specific Aims are to: (1) Establish a workflow from tissue samples to single-cell data deposited in the data center, (2) Run the combined experimental and computational workflows on procured specimens, and (3) Follow up on large-scale data production with validation and further analysis. Y-SCORCH has established expertise in all approaches necessary to successfully create single-nucleus transcriptomic data to provide a scaffold for future discovery to inform pathophysiological understanding of CNS effects of OUD and HIV. Show summary Hide summary	1
2020 — 2021	Gerstein, Mark Bender	R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	A Big Data Approach to Identify Epigenetic, Transcriptomic, and Network Dynamics as Immune Dysfunction Drivers Associated With Hiv Infection and Substance Use Disorder @ Yale University PROJECT ABSTRACT The opioid crisis was declared a public health emergency in 2017. It has led to an increased incidence of opioid overdose, injection substance use, and, eventually, HIV transmission. More than 171,000 people in the United States are living with HIV as a result of substance use disorder (SUD). Despite the known fact that both HIV and SUD significantly disturbs both innate immunity and adaptive immunity, their underlying molecular mechanisms, and interplay to immune dysfunction remain unexplored. Comprehensive functional characterization at a single-cell resolution is essential to provide new molecular insights and discover therapeutic targets. Recent advances in novel sequencing technologies and community efforts to share genomic data provide unprecedented opportunities to understand the molecular dynamics of immune dysfunction up HIV infection and SUD. This application describes the development of integrative strategies and machine learning methods to combine novel assays (such as STARR- seq) with high-dimensional, multi-scale genomic profiles to elucidate the transcriptional, epigenetic, and network alterations and to key immune dysfunction drivers associated with HIV and SUD. Specifically, we will (1) Integrate novel functional genomics assays with single-cell multi-omics data to construct cell-type-specific multi-modal gene regulatory network (GRNs) in healthy individuals, (2) build a comprehensive immune profiling data hub for HIV/SUD-affected individuals and construct disease- and cell-type-specific GRNs, (3) uncover how key network changes and aberrant behaviors of TFs upon HIV infection and/or SUD can lead to immune dysfunction. Distinct from existing efforts focusing on transcriptome analyses, this proposed work presents a genuinely novel big-data approach for both modeling gene regulation and investigating disease-risk factors by incorporating heterogeneous multi-omics profiles at a single-cell resolution. The resultant comprehensive list of cis-regulatory elements at a single-cell resolution will expand the number of known functional regions. The constructed immune cell atlas, GRNs, and identify key drivers of immune dysfunction will be accessible to the public via web services and annotation databases. Our integrative computational efforts will be released distributed open-source programs. Altogether, our released resource will accelerate research in the broader scientific community by providing essential tools to investigate immune function, which will benefit other investigators exploring the genetic underpinnings of immune system function of HIV and/or SUD. Show summary Hide summary	1
2021	Cherry, Joe Michael Gerstein, Mark Bender	U24Activity Code Description: To support research projects contributing to improvement of the capability of resources to serve biomedical research.	A Data and Administrative Coordinating Center For the Impact of Genomic Variation On Function Consortium @ Stanford University Project Summary/Abstract The goals of the IGVF Data Administrative and Coordinating Center (DACC) are to support the IGVF Consortium by defining and establishing a strategy that connects all participants to the project?s science. By creating avenues of access that distribute these data to the greater biological research community, the DACC provides a critical connection between scientific producers and consumers. The IGVF Consortium brings together laboratories that generate complex data types via novel experimental assays, often focusing at the single-cell level of gene expression. This work is extended and regularized by laboratories that integrate these unique data using computational analyses to discover the associations and networks between human variation, chromosomal elements and molecular phenotypes for the purpose of elucidating their complex relationship in human cells and tissues. The DACC?s participation enhances the data created by the consortium through the creation of structured procedures for the verification and validation of all submitted data and providing processes for the documentation of metadata that describe each biological sample and assay method. To facilitate access to all the data created, the DACC will construct a state of the art data warehouse, design and develop robust software to enable data submission, and harden unified data processing pipelines. All experimental and computational results will be made available via the IGVF Portal, developed by the DACC. The Portal will integrate these data resources and provide enhanced search and browsing capabilities, along with powerful web services. The DACC will develop tools for semantically-enhanced graph-based searches of experiment metadata, individual genomic elements, variation and phenotype, and will implement methods to distribute these results in matrices suitable for machine learning. Beyond computational infrastructure to house and distribute consortium data, the DACC will also function as the administrative hub of the IGVF. Consortium science thrives on clear and forthright communication between its component parts, and it is the DACC?s responsibility to manage this relationship. This effort will be facilitated by management of consortium working groups, organization of scientific results and publications, and providing regular reporting and feedback to the Steering committee. To fully support the community, the DACC will act as a service organization, allowing biomedical research to take full advantage of the results from the IGVF. To this end, the DACC will organize and host consortium- focused and user-focused meetings, and will provide documentation via many media including written documentation, video tutorials, webinars, and meeting presentations. The various component projects of the IGVF (DACC, mapping, systematic characterization, genetic network regulation, modeling of genomic variation centers and groups) will be tightly woven together to create the IGVF Consortium. Show summary Hide summary	0.97