2019 — 2021 |
Howe, Kevin Schedl, Tim (co-PI) [⬀] Stein, Lincoln D. Sternberg, Paul Warren |
U24Activity Code Description: To support research projects contributing to improvement of the capability of resources to serve biomedical research. |
Wormbase: a Core Data Resource For C. Elegans and Other Nematodes @ California Institute of Technology
Project Summary WormBase is the major publicly available database of information related to Caenorhabditis elegans, an important organism for basic biomedical research, and other nematodes of medical and agricultural signficance. Although a crucial daily resource for members of the C. elegans research field, our users extend to the larger parasitology, biomedical, and bioinformatics research communities. WormBase acts as a central forum through which every research group can contribute to the global effort to comprehend nematode genomes and biology. Most users access WormBase via the Internet (www.wormbase.org); some install the database locally. WormBase offers extensive coverage of C. elegans core genomic, genetic, anatomical and functional information, allowing the biomedical community to fully utilize the results of intensive molecular genetic analyses and functional genomic studies of this organism in the study of human disease. These data include all available nematode genomic data (such as genome sequence, transcripts and cis-regulatory sites prioritized by species), large-scale functional genomic datasets, the function and interactions of genes and gene products as they relate to development, physiology and behavior, and biological reagents and their source information. WormBase comprises a set of databases storing a wide range of biological information; a website that allows users to access stored information and precomputed analyses based on these data; and tools for programmatic access such as an application programming interface, a data mining platform, and bulk downloads. Curation activities include extraction and integration of information from the literature (assisted by the use of information retrieval tools), incorporation of large-scale datasets from a range of research projects, and gene model verification from experimental data. We will curate many nematode genome sequences, along with their annotations and core genetic information, as well as data on gene function, pathways and transcriptional regulatory networks for C. elegans and select other species. We will expand tools available for data mining, workflow management, visualization, and community annotation, and integrate, store and distribute data in a maintainable, interoperable and scalable system. The project team involves three sites: Caltech primarily curates functional information and develops ontologies; EBI carries out sequence-based curation and builds databases for public release; and OICR develops and supports the web presence and visualization. The three sites work closely together and share tasks to ensure timely incorporation, storage and display of information, as well as user outreach and education.
|
0.916 |
2020 — 2021 |
Flicek, Paul (co-PI) [⬀] Haussler, David H (co-PI) [⬀] Howe, Kevin Paten, Benedict [⬀] |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Enabling Comparative Pangenomics @ University of California Santa Cruz
Project Summary: Enabling Comparative Pangenomics To many in the field, it is clear that we are moving rapidly toward a golden age of vertebrate comparative genomics in which thousands of high quality genomes of different species are publicly available and used in understanding the human genome. Despite the opportunity presented by the growth in available genomes, there has been relative stagnation in the software used to compare complete genomes, most of the software developed being old and limited in capabilities. To remedy this situation, we will create a hardened toolkit for genome comparison and annotation that can be robustly applied to thousands of vertebrate genomes. To demonstrate this toolkit and deliver its results to the broader genomics community, we will apply it to create a resource within the existing UCSC and Ensembl Genome Browsers that will incorporate thousands of vertebrate genomes. Large, well organized consortia have coalesced to take on the challenge of sequencing and assembling vertebrate genomes. Our alignments will form a backbone of these projects? analysis, and our synthesis of their data will create a resource that is much greater than the sum of what might otherwise be a series of smaller, fragmented and not directly comparable efforts. We will gather together more than 600 vertebrate genomes into our proposed resource in the first year of the proposal, rapidly delivering results. Paralleling the growth in available reference genomes, the last decade has been marked by an explosion in population sequencing projects. Although much of the cataloged human variation has a very recent evolutionary origin, there is a tremendous opportunity to combine and so better understand intra- and inter- species change using models from population genetics. We will create pangenome software to (i) avoid reference bias in species comparisons (i.e. avoiding assumptions about which alleles are fixed when comparing between species, which is important in quasi-species such as cichlids), (ii) allow ancestral alleles to be comprehensively estimated, including those that are part of structural variation, and (iii) more easily enable the study of balancing selection. To demonstrate the utility of comprehensive variation integration we will create a prototype of a pan-genome for the apes. We will use this graph to identify ancestral alleles and to dynamically convert annotations between species and assembly versions, and, via population mapping experiments, we will demonstrate its power for typing segregating but ancient variation. Using knowledge of ape evolution, we will ultimately extend this graph to adequately model the most complex regions of the human genome.
|
0.952 |