2009 — 2012 |
Zhang, Nancy |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
New Change-Point Problems in Genomic Profiling
DNA copy number data, which measures gains and losses of segments of genomes, is an important data type for understanding genetic variation and for clinical research. The analysis of DNA copy number data motivates new statistical problems, especially in the areas of change-point detection and high dimensional data analysis. This proposal identies these problems, formulates statistical models, and proposes methods for their solution. The topics covered include model selection for irregular high dimensional models, simultaneous change-point detection in a large number of aligned sequences, and segmentation of partially observed sequences. These developments in statistical methodology are a direct response to the current analysis needs at the Stanford Genome Technology Center and in the Cancer Genome Atlas Project, and open source software will be made available to these and broader communities.
Cancer and other genetic diseases are no stranger to genome scientists: high-throughput technologies and statistical analyses have always promised to provide a systems level?s view of disease inheritance and progression. In recent years, new concurrent advances in genomics and statistics, including more efficient high throughput data-collection methods, larger patient sample sets, the atmosphere of more open collaboration, and greater sophistication in study design and data analysis have positioned us to make major new advances in studying genetic disease. Despite this promise, there is still much waiting to be done. In particular, statistical methods for the analysis of genome-wide profiling data lacks the sophistication to deal with the many issues that arises in modern data collection schemes. These issues include high dimensionality, missing observations and simultaneous inference in a large number of patient samples. In this proposal, the investigator and her colleagues formulate these new problems and put forth models with practical solutions. These developments in statistical methodology are a direct response to the current analysis needs at the Stanford Genome Technology Center and in the Cancer Genome Atlas Project, and open source software will be made available to these and broader communities.
|
0.915 |
2010 — 2015 |
Siegmund, David (co-PI) [⬀] Ji, Hanlee Zhang, Nancy |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Atd: Statistical Methods For Threat Detection
Successful statistical analysis of the massive amounts of data available today can lead to successful, early threat detection. This proposal consists of two parts. The first part focuses on the detection of mutations in pathogen samples. Many emerging health threats are due to new mutations in evolving pathogen populations, which can now be profiled using massively parallel sequencing experiments. The investigators work with Dr. Hanlee Ji's laboratory in the Stanford Genome Technology Center, whose deep sequencing platform allows the detection of low prevalence mutations in pathogen samples. This problem was previously treated mainly from an algorithmic perspective, lacking statistical models for error estimates. The investigators propose methods for analysis of single nucleotide changes and general structural variants, and consider the analysis of single samples, the simultaneous analysis of multiple samples, and the comparison of matched samples. The second part of the proposal considers threat detection in a more general framework: detection of changes from background condition in one or more parallel streams of data. Examples are cyber-attacks on computer networks, introduction of belligerent agents (e.g. landmines, aircraft) into previously quiescent environments, appearance of noxious chemicals, genetic modifications of viruses or bacteria, etc. The main contribution is a general conceptual framework for integrating data from a large number of distributed sources, when the signal of interest may be present in only a small fraction of the sources. This proposal motivates theoretical developments in the areas of change-point detection, mixture estimation, empirical Bayes estimation, and false discovery rate control.
Successful statistical analysis of the massive amounts of data collected in modern scientific and technological activities can lead to successful, early threat detection. This proposal consists of two parts. The first part focuses on the detection of mutations in pathogen samples. Many emerging health threats are due to new mutations in evolving pathogen populations, which can now be profiled using next generation sequencing experiments. The accurate detection of new mutations is important, because they may confer survival advantage to the virus that carries it. Currently, this problem has been treated mainly from an algorithmic perspective, lacking statistical models for error estimates. The methods developed in this proposal will bridge this gap. The second part of the proposal considers threat detection in a more general framework: detection of changes from background condition in one or more parallel streams of data. Examples include cyber-attacks on computer networks, introduction of belligerent agents (e.g. landmines, aircraft) into previously quiescent environments, appearance of noxious chemicals, genetic modifications of viruses or bacteria, etc. The main contribution is a general conceptual framework for integrating data from a potentially large number of distributed sources, when the signal of interest may be present in only a small fraction of the sources.
|
0.915 |
2011 — 2013 |
Ji, Hanlee P Zhang, Nancy R |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Statistical Models For Genome Sequencing and Association
DESCRIPTION (provided by applicant): In the next few years, high throughput short-read sequencing will become the de facto method for profiling genome variation. With experimental platforms moving beyond the proof-of-principal stage, large multi-sample studies are underway at the Stanford Genome Technology Center, with a focus on profiling mutations in cancers and in evolving virus populations. Current methods for DNA variant detection are mostly designed for the analysis of DNA from normal samples, and lack power for the analysis of genetically heterogeneous cell populations such as tumors and viruses. The goal of this proposal is to develop statistical models and methods for detecting mutations and estimating their prevalence in genetically heterogeneous samples, and to derive fast, analytic approaches for estimating their significance and power. Methods will also be developed for the aggregation of genetic profiles across multiple samples in the search for mutation hotspots associated with clinical outcome. Our specific aims are: 1. Develop statistical models for the calling of single nucleotide polymorphism/mutations, copy number changes, and structural variants in genetically heterogeneous samples. Derive fast, simulation free methods to estimate the false discovery rates of detection schemes under these models. 2. A statistical framework for aggregating mutation profiles across samples. Most current studies group mutations in to genes or exons, or use arbitrary binning schemes. We propose a new approach to this problem by modeling the mutation profile across patients as aligned point processes. We will extend our work on multi-sample scan statistics to develop a genome-wide variable-window width adaptive test for identifying genomic regions where the occurrence of mutations is associated with a given phenotype. This framework can potentially also be applied to genetic association studies with rare variants. The PI, Dr. Nancy R. Zhang, was trained in mathematics (BA), computer sciences (MS) and statistics (PhD), and, as a faculty in the Department of Statistics at Stanford University, has focused on the statistical analysis of DNA copy number and other types of genome-wide profiling data. Much of her published work address the issue of cross-sample and cross-platform aggregation and multiple-testing control in genome profiling studies. At the heart of this proposal is the collaboration with Dr. Hanlee Ji, an assistant professor in the Department of Medicine and senior associate director at the Stanford Genome Technology Center. This proposal timely responds to the growing need of a statistical data analysis platform for genome resequencing at Stanford and in the larger scientific community. Public, open source software will be made available for all of the developed methods.
|
1 |
2012 — 2015 |
Ji, Hanlee P Kuo, Calvin J [⬀] Zhang, Nancy R |
U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Integrated Genomic Discovery and Functional Validation of Colorectal Cancer Loci
DESCRIPTION (provided by applicant): Colorectal carcinoma (CRC) arises from multiple mutations and genomic aberrations in distinct driver cancer genes that that in concert to spur neoplastic development and phenotype. This inherent genetic complexity greatly complicates both personalized diagnosis and treatment. Previously published studies have confined that large numbers of genes will be mutated or subject to genomic aberrations in CRC. A significant and emerging challenge for the post-genomic era is to identify which of these mutated genes are driver loci that functionally drive colon cancer development, versus passenger loci without functional relevance. Finally, it is of the highest priority that one forges these genetic observations with correlations of prognosis and clinical outcome. This can only be done if we better understand the unified biological ramifications of the combined and diverse multigenic driver background which act synergistically to promote CRC tumorigenesis. This proposal details an integrated analysis that will rely on the CRC genomic data generated by the Cancer Genome Atlas Project (TCGA) to discover novel candidate CRC genes and study multigenic CRC driver gene co-mutated / dysregulated modules within the genetic context of other drivers and provide biological validation in a powerful in vitro primary culture CRC model which can be engineered for multiple genetic events. To accomplish these goals, we will develop and implement novel statistical methodologies for the integrative analysis of multiple TCGA genomic and clinical data sets. The goal is to identify and prioritize novel CRC genes either singly or as co-mutated modules in combination with other known driver CRC genes. We will use the rich TCGA data set to conduct an integrated CRC genomic analysis of point mutations, gene expression, copy number aberrations and methylation data. We will prioritize the discovery of mutations and other genomic aberrations of these novel CRC genes that are associated with specific clinical stages of disease and other clinical parameters. These statistical and computational studies will then be directly coupled to rapid and robust functional target validation of candidate loci using our rigorously characterized in vitro primary intestinal culture methodology (Gotani et al, Nature Medicine, 2009), in which we have recently established the transforming activity of established CRC loci such as APC, KRAS and TP53. Genetic deletion and retroviral expression of shRNA, cDNA or mutants thereof will be utilized to evaluate putative individual driver loci, as well as combinatorial oncogene modules. This proposal directly addresses fundamental problems in the exploration and translation of novel colorectal cancer gene discovery in the context of clinical data which is available from TCGA. RELEVANCE (See instructions): Colorectal cancer (CRC) represents the third most commonly diagnosed cancer in the United States. This proposal utilizes a fusion of genomic analysis of a large population of patients, mathematical modeling and culture of intestinal fragments to functionally identify genes that are critical for colon cancer development. These studies have implications for generation of novel diagnostic and therapeutic strategies for colon cancer.
|
1 |
2014 — 2016 |
Ji, Hanlee P Zhang, Nancy R |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Statistical Models and Analysis of Complex Genomic Variation in Clonal Mixtures @ University of Pennsylvania
DESCRIPTION (provided by applicant): Next generation DNA sequencing (NGS) approaches are widely used in studying human diseases and identifying causative genetic variants. Increasingly, NGS methods are being used to define biologically relevant clonal mixtures, a frequently observed phenomenon in human disease. Examples of clonal mixtures in human disease include tumor cell subpopulations that are a part of cancer. Within a single tumor and clearly evident in metastatic tumor sites, cancer cell clonal populations exist, are genetically distinct and carry their own unique set of somatic variants. A similar phenomenon occurs in viral infection where multiple viral quasispecies are harbored within an infected individual; each quasispecies has their own unique set of genetic variants. One can quantitatively measure expansions or shrinkage in clonal populations as seen in changes in allelic representation of clonal variants. Specific cellular phenotypes are attributable to the unique clonal variants and changes in their representation can be indicators of evolutionary processes. This is frequently the case for drug resistance in cancer and viral infections. Thus, clonal genetic variation has major implications for the pathogenesis of human disease and is increasingly being tested as a longitudinal indicator of disease progression and treatment resistance. The general availability of whole genome and deep targeted resequencing provides an opportunity to conduct systematic analysis of heterogeneous DNA mixtures that have different clonal components. However, in many cases the genetic variant of interest is present at very small proportions (< 5%) and this makes the delineation of these clonal variants exceeding difficult. Many of the widely employed NGS analysis methods are optimized for detecting normal diploid genome variation. These approaches are not optimal for delineating genomic variants from complex clonal mixtures. Some genomic DNA variant classes such as genomic rearrangements are extremely difficult to detect in the context of clonal mixtures. To improve the assessment of clonal variation and evolution of specific clonal populations, we will develop innovative models and robust, sensitive statistical procedures. These methods will enable one to deconvolute genomic variation in clonal mixtures and consider clonal alterations through time and space. We will focus on improving the delineation of complex variations such as genomic rearrangements and other structural variations in genetic mixtures. To develop our methods, we will use heterogeneous DNA sequence data sets with in silico spike in variants and consider the lowest threshold of detection that we can achieve with the best sensitivity and specificity. Subsequently, we will test these methods on NGS data sets from clinical samples, delineate clonal populations based on unique variants and consider quantitative changes in allelic representation as seen in clonal expansion. These samples will be subject to whole genome and targeted resequencing. Cancer relevant samples will include tumors with matched normal, primary and metastatic DNA. We will consider viral quasispecies for a set of clinical samples where we have matched viral nucleic samples obtained longitudinally over the course of infection from a single individual. As a final milestone, we will release our methods as open source software for the biomedical research community.
|
0.954 |
2017 — 2020 |
Li, Mingyao [⬀] Zhang, Nancy R |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Statistical Methods For Single-Cell Transcriptomics @ University of Pennsylvania
PROJECT SUMMARY Cells are the basic biological units of multicellular organisms. Recent technological breakthroughs have made it possible to measure gene expression at the single-cell level, thus paving the way for exploring gene expression heterogeneity among cells. The collection of abundances of all RNA species in a cell forms its ?molecular fingerprint?, enabling the investigation of many fundamental biological questions beyond those possible by traditional bulk RNA-seq experiments. Single-cell RNA-seq (scRNA-seq) allows us to better describe the lineage and type of single cells, characterize the stochasticity of gene expression across cells, and improve our understanding of cellular function in health and disease. ScRNA-seq analysis is transforming biomedical sciences, and has already made great impact in fields such as neuroscience and immunology, and can enhance our understanding of disease development in numerous other contexts including cardiometabolic diseases. However, scRNA-seq data present new challenges for which standard analytical methods are not designed to confront. Current scRNA-seq protocols are complex, often introducing technical biases that vary across cells, which, if not properly removed, can obscure cell type identification and lead to biased results in downstream analyses. Published scRNA-seq studies have mainly been proof-of-principal studies illustrating the utility of scRNA-seq in cell type classification and other basic biological analyses. However, as the use of scRNA-seq continues to grow, researchers are beginning to explore their utility in disease gene discovery. Building upon our expertise in statistical methods development and our experience with analysis of genomics data for human cardiometabolic diseases, in this proposal, we propose to develop novel statistical methods to address some of the key analytical challenges in scRNA-seq analysis. We will guide methods development through the analysis of scRNA-seq data generated from ongoing collaborations with collaborators at the University of Pennsylvania and Columbia University. We propose the following specific aims. Aim 1: Develop methods to recover gene expression and identify cell types. Aim 2: Develop methods to detect gene expression changes between cell types or conditions. Aim 3: Develop methods to estimate isoform-specific gene expression and detect differential alternative splicing. Aim 4: Develop methods to model allele-specific transcriptional bursting and its genetic regulation. This proposal addresses critical challenges in scRNA-seq analysis, and it brings together an exceptional team of scientists with proven track record in statistical genomics, single-cell biology, and cardiometabolic disease. The successful completion of this project will allow researchers to better disentangle complex cellular heterogeneity, precisely relate genomic sequence to gene regulation, and facilitate the translation of basic research findings into clinical studies of human disease.
|
0.954 |
2017 — 2019 |
Ji, Hanlee P Zhang, Nancy R |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Genomic and Cellular Variation From Single Molecules to Single Cells @ University of Pennsylvania
PROJECT ABSTRACT Defining the features of cellular mixtures, where diverse cell types with distinct genomic characteristics are physically intermingled together, is a central problem in biology. For example, diseases such as cancer are characterized by cellular masses comprised of subpopulations, each with its own set of genetic variants and transcriptional signatures, where inter-population DNA variation is compounded with cell-to-cell RNA expression stochasticity. Characterizing genomic diversity in cellular mixtures and assessing its impact on cell-to-cell gene expression variation require analyses at the resolution of individual cells and contiguous genome molecules. This level of analytical resolution is now feasible with next generation sequencing (NGS) assays that integrate molecular barcoding with single-cell RNA sequencing and single molecule DNA sequencing. These technological advances surmount key challenges and herald new opportunities for the study of disease, but require new analysis methods: (1) Current NGS methods are not optimal for detecting and phasing genomic variants from cellular mixtures. For example, it is difficult to detect complex structural variants (SVs) that are carried by only a fraction of the genomes present within a mixture. Methods based on short read data is hindered by the loss of long range contiguity in heavily fragmented DNA as well as the low mappability of many SV junctions. Single-molecule linked-read DNA sequencing overcomes these drawbacks, but is in need of reliable analysis methods. (2) Single-cell RNA sequencing allows the detection of distinct cellular subpopulations with unique transcriptional signatures, however, data from individual cell transcriptomes have high levels of error and bias. New analysis procedures are needed to make statistically sound inferences. (3) The existing methods for single-cell expression analysis typically ignore DNA heterogeneity, which can be crucial for some studies, especially for cancer. It is yet unclear how to simultaneously characterize variation at both the DNA and RNA levels in a cellular mixture. This proposal addresses these issues by developing new statistical methods and experimental designs that enable accurate characterization of cellular mixtures exhibiting both DNA and RNA variations. We propose to develop methods to (1) detect, characterize, and phase complex variants using new single-molecule sequencing technology, (2) improve expression estimates obtained from single-cell RNA sequencing data, and (3) combine bulk single-molecule DNA sequencing and single-cell RNA sequencing to quantify the relationship between DNA variation and transcriptomic variation in genetically heterogeneous samples such as cancer.
|
0.954 |
2020 — 2021 |
Ji, Hanlee P Zhang, Nancy R |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Single Cell Transcriptomic and Genetic Diversity by Single Molecule Long Read Sequencing @ University of Pennsylvania
PROJECT SUMMARY Defining the features of cellular mixtures, where diverse cell types with distinct genomic characteristics are physically intermingled together, is a central problem in biology. During the past decade, single cell sequencing technologies have enabled a new era of high throughput and high resolution interrogation of cell type diversity, vastly expanding our understanding of the role that cell types play in development and disease. Yet, current studies in single cell genomics rely on short-read sequencing and thus suffer from limitations, including: (1) Most studies rely on short read counting which limits the study of alternative splicing. (2) Cell states are reflected by static snapshots, and while population dynamics can be deduced through trajectory and RNA velocity estimation, robust estimation of these parameters remains a major challenge. (3) Despite advances in single-cell DNA sequencing, there is yet no cost-effective way to simultaneously characterize both the genetic variants and transcriptome-level changes in a cell, which is crucial for diseases such as cancer. This proposal is motivated by technological breakthroughs in single-molecule sequencing (SMS) and the recent adaptation of SMS to the massively parallel sequencing of single cell transcriptomes in our lab. We propose to develop computational methods to harness the power of SMS in single cell transcriptomics. In particular, we have developed a new genomic approach which allows one to repeatedly interrogate complete transcripts from single cells using SMS long reads, rather than 3' or 5' counting with short reads. This technology allows experimental designs where specific transcript subsets and/or cellular subsets can be repeatedly targeted for deeper joint short and long read analysis over many iterations, which we will exploit to conduct analyses that were previously intractable.
|
0.954 |