2014 — 2017 |
Raskutti, Garvesh |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
A Reliable and Scalable Approach to Causal Inference For Large-Scale Multivariate Data @ University of Wisconsin-Madison
With masses of large-scale data being generated, a key challenge facing many scientists is to infer relationships amongst variables of interest. In particular, inferring causal or functional relationships amongst genes, proteins, and other biological elements is of fundamental interest to scientists. This project will develop methods for inferring causal or functional relations between genetic, proteomic, and transcriptomic features both for the ENCODE human genome project and data for mice with different susceptibility to obesity and diabetes. For both types of data, this project will develop frameworks that comprise: (1) domain knowledge that informs the choice of model and algorithm; (2) fast, parallelizeable algorithms with provable run-time guarantees; and (3) statistical consistency guarantees for the algorithms developed under assumptions that are likely to be satisfied in practice.
Directed graphical models or Bayesian networks provide a useful framework for representing causal or functional relationships. A number of algorithms have been developed for inferring directed or Bayesian networks from data. However prior approaches are either unreliable as they require assumptions that are rarely satisfied in practice, or do not scale to larger datasets. The proposed project will address this issue by developing algorithms for inferring directed networks with both statistical consistency guarantees and run-time guarantees. The new algorithms will involve exploiting connections between techniques in numerical linear algebra for developing fast solvers of linear systems and concepts in graph theory. Algorithms will be coded in R and will exploit parallel processing. Evaluation will involve both small-scale and large-scale synthetic graphical models with known network structure, real datasets involving yeast data where some of the directions are known, and new biochemistry data in which most of the directions are unknown. Theoretical guarantees on run-time and statistical consistency will be provided using a combination of tools from graph theory, numerical linear algebra, and concentration of measure the PI has used and developed in prior work.
|
0.939 |
2018 — 2021 |
Raskutti, Garvesh |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Estimation, Inference and Testing For Large-Scale Directed Network Models @ University of Wisconsin-Madison
Large-scale interaction networks naturally arise in many modern scientific applications. For example, in biochemistry and systems biology, amino acids in different locations of the protein sequence interact, while in computational neuroscience, connectivity networks amongst neurons in the brain naturally trigger responses to particular stimuli. This project will develop reliable and scalable algorithms for learning the underlying interaction network amongst many nodes. Due to both the scale, complexity, and the changing data technologies in the applications described above, the solutions to the challenges addressed in this project will lead both to the development of novel theory and methodology, and the implementation of new algorithms for the application domains.
The goal of the project is to address the challenge of estimation, inference and testing for large-scale network models. Given the size of the networks generated, this project presents a number of computational and statistical challenges the PI will address by focusing on two methodologies: (i) multivariate time series models; (ii) directed graphical models. The PI's prior work has developed new theory and methodology both for large-scale non-linear time series models and directed graphical models. This prior work points to a number of significant open challenges for both methodologies that this project will. These challenges include: (i) lack of sample size/statistical resources for learning complicated dependence structures; (ii) computational challenges due to non-convexity and large search-spaces for dependence models; (iii) incorporating domain knowledge and scientific experiments into the estimation methodologies; and (iv) exploiting learned networks for hypothesis testing, inference, and parameter estimation. This project will address these challenges and these contributions will lead to the development of new methods for network learning.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.939 |
2018 — 2020 |
Raskutti, Garvesh |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Novel Methods For Large Scale Presence Only Data in Biological Systems Engineering @ University of Wisconsin-Madison
This proposal will develop a number of novel statistical tools for learning genotype-phenotype mappings from experimental data. Massive genotype-phenotype data sets can be generated by genetic diversification, followed by high-throughput screening/selection and next-generation DNA sequencing of functionally-distinct populations. The resulting data presents new and interesting statistical challenges including large numbers of examples, presence-only responses, and noisy/missing data. Presence-only responses arise because most high-throughput screening/selection methods isolate only functional examples (positive responses), while non-functional examples (negatives) are difficult or impossible to obtain. The resulting data sets contain the initial unlabelled variant library and positive examples. The modeling tools developed in this proposal apply to all levels of biological organization spanning from molecules to ecosystems. The novel statistical methods developed in this proposal will model the relationships between protein sequence, structure, and function, with the goal of gaining insight into biochemical mechanisms and designing new and useful proteins. This proposal will (i) develop new theory and tools to analyze the large quantities of protein sequence function data that are being generated by emerging high-throughput methods; (ii) address challenges associated with positive-unlabeled (PU) learning, extremely large data size, low- quality/missing data, and (iii) encoding side information from existing databases or physical models. Furthermore, applying the methods and algorithms developed in this work will generate novel scientific insights and engineered biological systems.
|
0.939 |
2018 — 2021 |
Raskutti, Garvesh Wright, Stephen Willett, Rebecca |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Tripods+X:Res: Collaborative Research: Data Science Frontiers in Climate Science @ University of Wisconsin-Madison
Understanding the factors that determine regional climate variability and change is a challenge with important implications for the economy, security, and environmental sustainability of many regions around the globe. Our understanding and modeling of the large-scale dynamics of the Earth climate system and associated regional-scale climate variability significantly affects our ability to predict and mitigate climatic extremes and hazards. Earth observations and climate model outputs are witnessing an unprecedented increase in data volume, creating new opportunities to advance climate science but also leading to new data science challenges that must be addressed using tools from mathematics, statistics, and computer science. This project focuses on two central challenges at the heart of modern data-enabled climate science: (1) Increasing the predictive capacity of subseasonal forecasts by discovering and quantifying the sources of (un)predictability, including known and emergent climate modes and their interactions and non-stationarities; and (2) Understanding and quantifying the intricate space-time dynamics of the climate system to provide guidance for climate model assessment and regional forecasting. This project brings together an interdisciplinary team that combines expertise in both hydroclimate science and statistical machine learning to create new platforms for climate diagnostics and prognostics. The broader impacts of an enhanced knowledge of the climate system and robust and accurate seasonal forecasts have wide-ranging implications for society as a whole. For example, better seasonal forecasts will allow water resource managers to make sustainable decisions for water allocation.
This TRIPODS+CLIMATE project will develop novel machine learning and network estimation methodologies for analyzing the climate system over a range of space and time scales, to understand climate modes of variability and change and to explore their predictive ability for regional hydroclimatology. The two main objectives of this project are the following. Objective 1: Develop novel classification and regression tools that account for highly-correlated features or covariates, nonlinear interaction terms in high-dimensional settings, and nonstationarity in climate observations. These tools will be used to improve seasonal-to-subseasonal forecasts of regional precipitation using multidimensional climate modes and feature vectors in the presence of evolving dynamics and nonstationarities. Objective 2: Develop network identification methods that leverage recent advances in machine learning and statistics and that can account for the nonstationarity and limited timeframe of climate data. The network representation will be used to analyze the structure and dynamics of the learned dependencies to contextualize and interpret them physically, and to quantify changing patterns in climate modes and their regional predictive capacity. Emphasis will be placed on the western Pacific dynamics where an interhemispheric bi-directional connection has recently been discovered, promising earlier and more accurate seasonal-to-subseasonal forecasts in the southwestern US and other parts of the world.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.939 |