2017 — 2020 |
Paschou, Peristera Drineas, Petros |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Novel Statistical Data Analysis Approaches For Mining Human Genetics Datasets
The advent of modern genotyping and sequencing technologies has revolutionized human genetics research, allowing researchers to truly understand how different we are from one another. Large datasets describing the common patterns of human genetic variation may be easily thought of as matrices, with the rows representing individuals and the columns representing loci in the genome that correspond to common polymorphisms. The broader impact of such datasets cannot be overemphasized: they are a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors, as well as understanding the evolutionary and biological history of our species. Extracting useful information from such datasets promotes the progress of science and, at the same time, advances national health, prosperity and welfare. This project will bridge the gap between state-of-the-art algorithms for data analysis developed in the theoretical computer science and applied mathematics communities and the application of such algorithms to the analysis of the increasingly larger volume of datasets in the human genetics community.
In the context of this project, first, from an algorithmic perspective, the project team will design and analyze novel algorithms for three prototypical, fundamental research topics that combine linear algebra and randomization, namely sparse Principal Components Analysis, matrix completion, and linear (or kernel) discriminant analysis. All three topics have been widely popular in the theoretical computer science, machine learning, and applied mathematics communities. Yet these research topics have been essentially overlooked by the population genetics community. Second, from a population genetics perspective, the team will apply the developed algorithms to gain novel insights regarding population structure, ancestry informative markers, and natural selection, as well as improve imputation methods and Genome-Wide Association Studies (GWAS) data analysis. All three methods will be evaluated on population genetics datasets that are available to the PIs. The project will train graduate students and will disseminate the results of the research to a broad community of applied mathematicians, theoretical computer scientists, and population geneticists.
|
0.961 |
2020 — 2023 |
Drineas, Petros Paschou, Peristera |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Randomized Matrix-Sketching Approaches For the Analysis of Massive Human Genomics Data
Researchers in human genetics have now access to unprecedented amounts of genetic information characterizing how truly different we are from one another. From a Computer Science and Applied Mathematics perspective, the resulting datasets can be thought of as matrices, with the rows representing individuals and the columns representing loci in the genome that correspond to common or rare polymorphisms. Analyzing such datasets, Genome Wide Association Studies (GWAS) have reported over 10,000 strong associations between genetic variants and complex traits. However, tools that allow efficient analysis of very large scale datasets are still missing. Extracting useful information from such datasets promotes the progress of science and, at the same time, advances public health, prosperity, and welfare. This project will bridge the gap between state-of-the-art algorithms developed in the theoretical computer science community and the application of such algorithms to the analysis of the increasingly larger volume of datasets in the human genetics community.
This project will explore how randomized linear algebra, from a theoretical and practical standpoint, can be used to speed human genetics data analytics. The first research direction will investigate Linear Mixed Models or LMMs: LMMs form a linear model of the genetic effects on the phenotype of interest. Randomized linear algebra tools will be used to speed up the solution of the resulting optimization problem, without sacrificing accuracy. The second research direction will investigate Polygenic Risk Scores (PRS), which typically operate by first selecting a large number of genetic markers (often in the tens of thousands) out of all available markers (often in the many millions) using single marker significance tests. This feature selection stage is followed by building regression models on the selected markers to predict phenotypes. Randomized linear algebra tools will be used to speed up PRS approaches, while preserving generalization accuracy. Finally, the third research direction, will explore how the particular structure of population genetics datasets can be leveraged in order to design improved randomized linear algebra tools for the analysis of human genetics datasets. The investigators will disseminate their results to a broad community of applied mathematicians, theoretical computer scientists, and population geneticists. They both participate in population genetics conferences and workshops and publish in high-profile journals in population genetics, as well as in conferences and workshops in Computer Science. The investigators will additionally disseminate this knowledge to graduates and undergraduates. They will involve under-represented groups in their research activities, leveraging their prior track record of involving such groups in cutting-edge research.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.961 |