2013 — 2018 |
Szalay, Alexander [⬀] Salzberg, Steven Meneveau, Charles (co-PI) [⬀] Thakar, Aniruddha Burns, Randal Rippin, Michael |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif21 Dibbs: Long Term Access to Large Scientific Data Sets: the Skyserver and Beyond @ Johns Hopkins University
The Project aims to create a sustainable collaborative ecosystem built around several large scientific data sets for the broader science community. Based upon the expertise developed for the Sloan Digital Sky Survey (SDSS) SkyServer and the associated projects the Project will formalize the main system components and reengineer them to be much more reusable.
The Project will take full ownership of the Sloan Digital Sky Survey archive and will provide a robust environment for its continued operations, using an economy of scale enabled by common, shared building blocks derived from the existing SDSS SkyServer framework, based upon a large, scalable database system.
Using these building blocks, the team will build and operate open data archives from large observations and numerical simulations, including computational fluid dynamics, ocean circulation and astrophysics, reaching PB scales. The Project will further extend the tools to life sciences, like large-scale, next-generation genome sequencing experiments, as well as high-throughput neuroscience imaging data. The resulting distributed, parallel database framework will be linked to small, user-created data sets that can be used also collaboratively, in conjunction with each other and the large data collections.
The Project will work with selected communities to help deploying and serving data using our building blocks, demonstrating portability, generality and economies of scale; will help and encourage other institutions and communities to use the tools, while seeking collaborations that result in disruptive changes, and will build tools that accelerate the timescale to deploy new services and applications and rapidly test new ideas.
The Project will enable individual users to bring their "small data" and analyze it collaboratively in the context of the large data. Our particular goals are:
(i) Take full ownership of the SDSS Archive (database and flat files) and ensure a scalable and robust environment for its continued operation;
(ii) Build upon our decade-long effort on SDSS and its ad-hoc spinoffs, through reengineering its components into portable and general building blocks;
(iii) Systematically address curation issues arising from using a service-oriented architecture (SOA), and the resulting service life-cycle;
(iv) Work with projects from additional scientific domains to help deploying and serving data using our building blocks, demonstrating portability, generality and economies of scale;
(v) Develop scalable extensions to our database cluster in order to deal with large numerical simulations scaling up to petabytes, and turn them into open numerical laboratories;
(vi) Use our CasJobs Collaborative Environment to address the problem of small but complex data in the "Long Tail" of science.
|
1 |
2018 |
Salzberg, Steven L |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Computational Methods For Genome Assembly, Transcript Assembly, and Variant Discovery @ Johns Hopkins University
? DESCRIPTION (provided by applicant): Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Numerous DNA sequencing projects are being launched for species whose genomes have not yet been sequenced. Sequencing of messenger RNA has led to an explosion of RNA-seq projects to characterize gene expression in multiple cell types and conditions, and simultaneously to discover new genes and new splice variants of known genes. These sequencing-based studies generate enormous amounts of data, which in turn require sophisticated, efficient, and innovative new algorithms to assemble these genomes and identify their gene content. We propose to develop new computational methods for three specific problems: first, we will develop new assembly algorithms, building on existing methods wherever possible, to assemble genomes from reads generated by the latest sequencing technologies including emerging single molecule technology. In parallel, we will continue to improve our existing assemblers, extending them to handle new and diverse data types, and to evaluate multiple other assembly systems to determine what methods work best for different WGS projects. We will also continue to collaborate with outside groups to help them assemble particularly challenging genomes. Second, we will develop new methods for discovering sequence variants, using a combination of alignment and assembly-based algorithms. These include a new method that finds variants without using alignment to the reference genome, dramatically reducing false positive rates. The method uses very fast alignment algorithms to achieve significant gains in computational speed. We propose another method that uses localized assembly to detect insertions and deletions, one of the weaknesses of most current methods. Third, one of the most exciting recent technology developments in genome analysis of the past five years is RNA-seq, a protocol for sequencing the RNA in a cell. Our group has previously developed two widely used alignment algorithms, TopHat and Cufflinks, for RNA-seq analysis, which were the first to be able to discover previously unknown splice sites and isoforms. Here we propose a novel transcript assembly algorithm, StringTie, which uses a novel network flow algorithm, a method imported from mathematical optimization theory, combined with de novo assembly to assemble and quantitate transcripts. StringTie is the first transcript assembler to use both assembly and reference-based alignment together. One key advantage of StringTie's algorithm is that it assembles and quantifies gene transcripts simultaneously. As compared to Cufflinks and all other competing methods, StringTie produces more complete reconstructions of genes and splice variants, and more accurate estimates of expression levels on both real and simulated data.
|
0.958 |
2019 — 2021 |
Salzberg, Steven L |
R35Activity Code Description: To provide long term support to an experienced investigator with an outstanding record of research productivity. This support is intended to encourage investigators to embark on long-term projects of unusual potential. |
Computational Methods For Microbial and Microbiome Sequence Analysis @ Johns Hopkins University
Project Summary This project will support our work on computational methods for microbial sequence analysis, including gene finding, whole-genome alignment, genome assembly, and metagenomic sequence analysis. Over the years we have developed multiple systems to solve problems in these areas, some of which are very widely used. These tools need continued updates and improvements to keep pace with changes in sequencing technology, changes in experimental design, and the ever-growing number of sequenced genomes. One of these systems is Glimmer, a computational method for finding genes in bacteria, viruses, archaea, and simple eukaryotes. Glimmer is highly accurate, finding over 99% of the genes in most prokaryotic genomes. It has been used by thousands of scientists around the world and in the majority of published bacterial genome sequencing projects over the past decade. Collectively the three main publications describing Glimmer have been cited over 4,700 times, including >700 citations in 2016-17 alone. Usage of Glimmer has been increased in recent years due to the explosion in next-generation sequencing projects, which are particularly cost-effective for bacterial genomes. A second system, MUMmer, is an efficient whole-genome aligner that is used to compare genomes to one another and to compare genome assemblies to detect changes, both large and small. MUMmer and its components, especially Nucmer, have been widely used and incorporated in other systems, including multi-genome aligners and several genome assembly packages. The three main publications describing MUMmer have been cited over 3,600 times including >750 citations in 2016-17. In recent years we have focused our efforts on developing methods for the analysis of metagenomics data, producing several newer tools, including Kraken and Centrifuge. Both of these systems attempt to assign a species identifier to every read in a metagenomics data set. Because the Kraken algorithm is not only accurate but far faster than earlier methods, it was rapidly adopted by many labs soon after its release, and its usage continues to grow. The even newer and more space- efficient Centrifuge system has also been highly successful and was recently incorporated into the analysis package of one of the new third-generation sequencing companies. We continue to work on improving the performance of both algorithms, and this project will allow us to extend them to handle the newest long-read data that is increasingly being used for metagenomics experiments. Finally, a new direction of the lab is the use of metagenomic shotgun sequencing to diagnose infections, for which we are not only modifying our algorithms, but also building customized genome databases where we rigorously screen the genomes to identify and remove contaminants and low-complexity sequences that create false positives. As we have done for many years, we will release all of the software and data generated by this project for free under an open source license, allowing other scientists to use, modify, and redistribute them without restrictions of any kind.
|
0.958 |
2020 — 2021 |
Salzberg, Steven L |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Computational Methods For Genome Assembly, Transcript Assembly, and Gene Discovery @ Johns Hopkins University
Project Summary Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Thousands of new human genomes are being sequenced each year in efforts to track down the genetic causes of human diseases. In parallel with this increase in whole-genome sequencing, RNA sequencing has also exploded in popularity, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also presents opportunities for discovery. Furthermore, to properly analyze the many diverse humans being sequenced, we can no longer afford to rely on a single reference genome that is missing much of the variation found in the human population, and that makes it very difficult to analyze sequences that do not match the reference. We propose to address these challenges in four specific ways: first, we will develop new and improved assembly algorithms that take advantage of the latest long-read technology to create genomes of unprecedented contiguity and completeness. This effort will include a method for creating haplotype-resolved assemblies when sequences from both parents are available, and a method to use an existing reference genome to create a highly contiguous assembly at minimal cost. Second, we will apply these methods to build new human reference genomes, assembled and annotated as thoroughly as the current human reference. These genomes, each representing a single individual, can then serve as the basis for many future studies of the relevant populations. Third, in the area of RNA-seq analysis our lab has previously developed two widely-used spliced aligners, TopHat and HISAT, and two equally popular transcriptome assemblers, Cufflinks and StringTie, which now have many thousands of users. We will extend and improve the StringTie algorithm, augmenting its novel network flow algorithm with de novo assembly plus new alignment methods to handle long reads and to improve its construction and quantification of transcripts. Fourth, we propose to systematically assemble thousands of RNA-seq experiments to discover new genes and to re-build the human gene catalog, an effort that could have a major impact on a broad array of human genetic and genomic studies. We have recently released our first version of this effort as CHESS, a human gene catalog built from a massive RNA-seq database that represents a comprehensive, reproducible, and open method for annotating the human genome. The CHESS database already agrees more closely with the two most widely-used human gene databases than either of them agree with one another, and we will improve it further so that it can provide a basis for biomedical research for many years to come.
|
0.958 |
2021 |
Salzberg, Steven L |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Comprehensive Human Expressed Sequences in Brain (Chess-Brain) and Their Roles in Neuropsychiatric Illness @ Johns Hopkins University
Project Summary The widespread use of RNA sequencing technology over the past decade has allowed scientists to discover a far larger and richer repertoire of genes and transcripts encoded by the human genome than were known just a decade ago. At least 90% of human genes have multiple isoforms, including splicing variants, alternative sites of transcription initiation and termination, exon skipping events, and more. The number of human transcripts in standard gene databases has grown enormously, from ~40,000 in the late 2000s to over 200,000 today, but it is still likely far from complete. Our previous work using exon-exon splice junctions and other fragmentary transcripts has demonstrated the clinical relevance of unannotated but expressed genes in the human brain, including associations with schizophrenia and its genetic risk. This project will attempt to discover and characterize novel gene isoforms collected from both healthy and diseased brains, using the latest computational methods for transcriptome assembly and an extensive collection of brain RNA-seq datasets. The project is organized into three aims: first, we will develop new algorithms designed to assemble RNA-seq data from samples that have been sequenced using ribosomal RNA depletion, a technique that is widely used in human brain studies but that is not used in most other RNA-seq experiments, which instead use polyA+ enrichment. We will implement these methods as extensions to the HISAT and StringTie systems for RNA-seq alignment and assembly, both of which were developed in the PI's and co-PI's labs. We will then apply these improved methods to thousands of publicly available RNA-seq samples from human brain tissue to create a new CHESS-BRAIN (Comprehensive Human Expressed Sequences in Brain) gene annotation database. This effort will also determine which transcripts are tissue-specific and brain-region specific; i.e., expressed at significantly higher or lower levels in brain tissues and in various brain regions as compared to other tissues. In the second aim, we will use these methods to quantify gene expression levels in hundreds of post-mortem brain RNA-seq samples from subjects diagnosed with schizophrenia (SCZD), major depression (MDD), bipolar disorder (BPD), autism spectrum disorder (ASD), and post-traumatic stress disorder (PTSD), whom we will compare to matched controls to identify the contribution of unannotated transcription in these disorders. In our third aim we will perform expression quantitative trait loci (eQTL) mapping across the entire CHESS-brain dataset, both within and across brain regions and diagnoses, to identify genetic regulation of unannotated transcripts, including both coding and noncoding transcripts. This analysis will identify genes and transcripts whose expression levels change significantly in different tissues and diseases. We will combine these results to identify novel transcripts associated with genetic risk for each of the psychiatric disorders.
|
0.958 |