2015 — 2017 |
Milenkovic, Olgica [⬀] Weissman, Tsachy |
U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Genomic Compression: From Information Theory to Parallel Algorithms @ University of Illinois At Urbana-Champaign
DESCRIPTION (provided by applicant): One of the highest priorities of modern healthcare research and practice is to identify genomic changes and markers that predispose individuals to debilitating diseases or make them more responsive to certain therapies and emerging treatments. Timely discovery and knowledge mining in this area of medical research is largely enabled by massive DNA sequencing and functional genomic data, the volumes of which are expected to experience drastic growth in the near future. It is therefore of paramount importance to develop efficient, accurate, and low-latency data compression and decompression techniques that will allow for fast exchange, dissemination, random access, visualization and search of diversely formatted genomic information. The use of specialized compression methods for biological data will ensure unprecedented growth of NIH databases and their utility, new uses of crowd-sourced computing in medical research, and large scale dissemination of experimental results. Specific aims of the proposal include developing parallel, task-oriented algorithms for a reference-based and reference-free compression of reads and whole genomes; b) lossy compression of quality scores; and c) compression of functional genomic data. Although the three data categories have different statistical properties and formats, they may be compressed using similar combinations of pre-processing, statistical coding, and parallel algorithms. Furthermore, some of the universal features of the developed compression techniques will make it possible to successfully apply them on other emerging genomic data formats. The long-term objectives of the proposed research program are two-fold. The first objective is to perform fundamental analytical studies of lossless and certain restricted forms of lossy compression and dimensionality reduction methods for genomic and functional genomic data, using information-theoretic techniques. The second objective is to develop a new suite of parallel algorithms for SAM, FASTQ and Wig track data compression. The developed algorithms are expected to include suitably combined, modified and extended classical compression methods (e.g., arithmetic, Huffman, and Lempel-Ziv coding), as well as novel solutions based on context-mixing and context-tree weighting with biological side-information. Immediate goals of the project include using CUDA, as well as classical parallel computing platforms, to implement current compression algorithms in order to reduce the latency of the compression and decompression process. Novel components of the parallel implementations will include extensive use of state-of-the-art hashing, indexing, and stringing methods. SAM, FASTQ and Wig data files are ubiquitous in genomic research. Hence, a research program resulting in high-performance software suites for compression of these and other genomic information formats will enable management, transfer and access to massive data crucial for the operation of governmental and NIH sponsored projects such as ENCODE, TCGA, ClinVar, Genome 10K, the Million Cancer Genome Warehouse, and ADAM.
|
0.954 |
2020 — 2021 |
Ji, Hanlee P Weissman, Tsachy |
U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
K-Mer Indexing For Pan-Genome Reference Annotation
ABSTRACT The human genome reference sequence is one of the foundations of genome sciences, especially in the context of next-generation sequencing (NGS) analysis. The reference has enabled discoveries in biomedical research and been particularly instrumental in human disease gene identification. However, the human genome reference is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual flexibility to represent the breadth of human variation. Important elements of individual genomes are either missed or incorrectly represented. As a solution that will bridge the next generation of reference assemblies with population genome sequencing studies, we have developed a K-mer-based indexing approach. This method is more efficient computationally, provides accurate representation in the context of populations and facilitates the analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational architecture that will encode and annotate large collections of genomes in the context of a pan-genome reference. First, we plan to develop a scalable, efficient K-mer representation of a large collection of haplotype/phased reference genomes, by 1) generating an index of all K-mers in human reference genome GRCh38 in a manner that can efficiently store variant information as metadata, and then 2) incrementally updating the K-mer index to include all novel K-mers derived from ongoing population sequencing efforts, while 3) developing schemes for directly analyzing compressed genomic data. Second, we plan to apply K-mer representation to genomic analysis by 1) providing the entirety of known human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2) developing functions for our pan-genomic index that supports ultra-rapid queries, such as of clinically important variants, and 3) linking conventional coordinate information to the K-mer metadata in the pan-genome index to allow annotating genetic variation to a particular genome reference. Third, we will create an online web portal for the pan-genome, using cloud computing, to maximize the utility of our approach, to promote community engagement and to enabling contribution from the research community. We expect that completion of these aims will provide: a scalable computational architecture which incorporates the continuous addition of variant information without loss of resolution or accuracy;? rapid query speeds that will remain nearly constant as the database grows;? a universally accessible portal using cloud computing. This work will help solve the issues of multiple assemblies. It will improve researchers? ability to understand the relationship of variants and disease, while also providing great savings over the long-term in infrastructure and computing costs.
|
1 |