2009 — 2012 |
Cheng, Guang |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
General Semiparametric Inference Via Bootstrap Sampling
This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
The research objectives of this project are first to prove the theoretical validity of the bootstrap method as a general inferential tool for the semiparametric models, and then invent a computationally attractive bootstrap inference procedure, called k-step bootstrap. Semiparametric modelling has provided an excellent framework for the modern complex data due to its flexibility to model some features of the data parametrically but without assuming anything for the other features. The bootstrap is the most popular data-resampling method used in statistical analysis, and has recently been applied to the semiparametric models arising from a wide variety of contexts. Therefore, the systematic theoretical studies on the bootstrap inferences for the semiparametric models are fundamentally important. In practice, the computational cost of the bootstrap inference procedure is particularly high for the semiparametric models. Thus, the investigator proposes an approximate bootstrap method, i.e. k-step bootstrap, and will show that this novel approach results in huge computational savings but without sacrificing any degree of inference accuracy. In addition, the investigator will develop a set of asymptotic results to elucidate the asymptotic structure of the semiparametric M-estimation, which is crucial for the future theoretical research. M-estimation refers to a general method of estimation including the maximum likelihood estimation as a special case.
The primary impact of the proposed work is to lay solid theoretical foundation for the general semiparametric inferences via bootstrap sampling. In addition, the proposed k-step bootstrap approach is practically beneficial in several regards. For instance, the scientists who bootstrap a large data set will benefit, as the minimal computational cost needed in the k-step bootstrap to achieve the satisfactory inference accuracy will be precisely analyzed. However, the broader impacts of the proposed activities are multiple. For instance, a key aspect of this project is the integration of research and teaching, which will be achieved by proposing specific projects for students during the teaching of classes on semiparametric inferences and bootstrap computation. This pedagogical method also facilitates the participation of underrepresented groups of students.
|
0.961 |
2012 — 2018 |
Cheng, Guang |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: Bootstrap M-Estimation in Semi-Nonparametric Models
The PI deals with the bootstrap inferential strategies for two broad classes of bootstrap methods in the context of semi-nonparametric models. As a general-purpose approach to statistical inferences, the bootstrap has found wide applications in semi-nonparametric models. Unfortunately, systematic theoretical studies on the bootstrap inferences are extremely limited, especially when the nonparametric component is not root-n estimable. Two classes of bootstrap methods are considered: the exchangeably weighted bootstrap (EWB) and the model-based bootstrap (also known as the parametric bootstrap). The PI proves that the EWB consistently estimates the asymptotic variance of the Euclidean estimate and is theoretically valid in drawing semiparametric inferences in the framework of penalized M-estimation. However, the EWB may become invalid in drawing inferences for nonparametric components. Hence, the PI considers the model-based bootstrap, and theoretically justifies it as an universally valid inference procedure for all the parameters in semi-nonparametric models. The proposed research also involves the development of advanced empirical processes tools. The above research lays the theoretical foundation for the general semi-nonparametric inferences via various bootstrap sampling schemes, and establishes a general framework for non-standard asymptotic theory concerning the nonparametric components.
The immediate need for fast and efficiently extracting information from all the dimensions of modern massive data sets gives rise to the increasing popularity of the semi-nonparametric models. For example, to understand the recent financial crisis, the semi-nonparametric copula models are applied to address tail dependence among shocks to different financial series and also to recover the shapes of the impact curve for individual financial series. The proposed research promotes the use of semi-nonparametric models in analyzing modern complex data by developing a series of innovative and valid bootstrap inferential tools, and eventually gain substantial scientific productivity across various disciplines. Statistical science benefits from the increasing number of researchers trained in semi-nonparametric modelling both from the statistical and scientific viewpoints. This would include the students funded by this work, broader collaborating research and educational activities. The above research also produces easy-to-implement software for the public.
|
0.961 |
2014 — 2017 |
Cheng, Guang |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Semiparametric Ode Models For Complex Gene Regulatory Networks
Gene regulation plays a fundamental role in cellular activities and functions, such as growth, division, and responses to environmental stimuli. The regulatory interactions among genes and their expression products (RNAs and proteins) intertwine into complex and dynamic gene regulatory networks (GRNs) in cells. Recent technical breakthrough has enabled large-scale experimental studies of GRNs. A central question in GRN analysis is to elucidate network topologies and dynamics that give rise to biological properties at study. However, the magnitude and complexity of these network data pose serious challenges in extracting useful information from within. This project aims to develop statistical and computational tools to reveal underlying structure, dynamics, and functionality of GRNs. New statistical theory and inference methods will be developed to tackle theoretical and computational challenges in modeling and analyzing large-scale GRNs. Results from this research will establish a novel framework to dissect dynamical and complex biological networks, and particularly a GRN that regulates cell proliferation in our case study.
Traditional statistical analysis of GRNs typically assumes that interactions between network nodes can be described by linear functions or low-order polynomials. However, biological processes are usually complex and molecular interactions between network nodes may not be accurately described by simple functions. The main goal of this project is to develop novel and flexible statistical approaches to dissect and reconstruct GRNs by learning nonlinear interactions from time-course experimental data, with either continuous- or discrete-valued gene expression. Specifically, we will develop new modeling and analysis approaches to study GRNs using semiparametric ordinary differential equations (ODEs), and will develop state of art computational tools to characterize the structures and dynamics of GRNs, to help scientists address crucial cellular systems regulated by GRNs. The project has two parts. The first part focuses on Methods and Theory, consisting of three aims: (1) to develop new and automated statistical procedures for studying local patterns and dynamic structures in large and complex GRNs; (2) to establish valid statistical inferences on topological features and regulatory interactions of GRNs; and (3) to develop efficient computational algorithms and software for analyzing large-scale GRNs. Developed methods from this research will provide valuable tools for modeling the topologies and dynamics of GRNs using ODEs. In the second part, we will focus on real data applications. Specifically, we will apply newly developed tools in the first part to analyze a retinoblastoma (Rb)-E2F gene network, which plays a key role in controlling cell proliferation and the gene regulation within.
|
0.961 |
2017 — 2020 |
Cheng, Guang |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Nonparametric Bayesian Aggregation For Massive Data
Modern massive data appear in increasing volume and high heterogeneity. Examples include internet searches, social networks, mobile devices, satellites, genomics, medical scans, etc. Bayesian approaches are particularly useful in such context since the complex structures in the data can be naturally incorporated in Bayesian hierarchical models. Besides, uncertainty quantification can be easily executed through Bayesian computation. However, due to storage and computational bottlenecks, traditional Bayesian computation implemented in a single machine is no longer applicable to modern massive data. In this project, a set of nonparametric Bayesian aggregation procedures with theoretical justifications are developed based on a standard parallel computing strategy known as Divide-and-Conquer. This research will significantly enhance the availability of Bayesian tools and software for analyzing massive data. The educational plan of the project will be in the form of graduate student advising and offering of special topics courses.
This project consists of three major components. First, the PIs will establish a Gaussian approximation of general nonparametric posterior distributions which serves as a theoretical foundation for general distributed Bayesian algorithms. Second, the PIs will develop a nonparametric Bayesian aggregation procedure with theoretical guarantees that is particularly useful to handle massive data in a parallel fashion. Third, the PIs will develop an efficient parallel Markov Chain Monte Carlo (MCMC) algorithm for nonparametric Bayesian models which will perform as well as traditional MCMC with substantially less computational costs. This research will lead to an emergence of "Splitotics (Split+Asymptotics) Theory" providing theoretical guidelines for Bayesian practices. The smoothing spline inference results recently obtained by the PIs will be used as a promising tool for achieving the above goals.
|
0.961 |
2018 — 2021 |
Cheng, Guang |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cds&E: Collaborative Research: Scalable Nonparametric Learning For Massive Data With Statistical Guarantees
We now live in the era of data deluge. The sheer volume of the data to be processed, together with the growing complexity of statistical models and the increasingly distributed nature of the data sources, creates new challenges to modern statistics theory. Standard machine learning methods are no longer able to accommodate the computational requirements. They need to be re-designed or adapted, which calls for a new generation of design and theory of scalable learning algorithms for massive data. This project aims to provide a collection of state-of-the-art nonparametric learning tools for big data analysis, which can be directly used by scientists and practitioners and have beneficial impacts on various fields such as biomedicine, health-care, defense and security, and information technology. The deliverables of this project include easy-to-use software packages that will be thoroughly evaluated using a range of application examples. They will directly help scientists to explore and analyze complex data sets. Due to storage and computational bottlenecks, traditional statistical inferential procedures originally designed for a single machine are no longer applicable to modern large datasets. This project aims to design new scalable learning algorithms of wide-ranging nonparametric models for data that are distributed across a large number of multi-core computational nodes, or in a fashion of random sketching if only a single machine is available. The computational limits of these new algorithms will be examined from a statistical perspective. For example, in the divide-and-conquer setup, the number of deployed machines can be viewed as a simple proxy for computing cost. The project aims to establish a sharp upper bound for this number: when the number is below this bound, statistical optimality (in terms of nonparametric estimation or testing) is achievable; otherwise, statistical optimality becomes impossible. Related questions will also be addressed in the randomized sketching method in terms of the minimal number of random projections.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.961 |
2018 — 2021 |
Cheng, Guang Song, Qifan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
High Dimensional Semiparametric Estimation and Inferences
Semiparametric regression model provides data scientists a useful way to analyze complex-structured data sets. It allows researchers to model some features in a linear way, without restricting the effect of the rest covariates. This flexibility can greatly enhance the prediction performance especially when parametric model assumptions are invalid. In practice, the semiparametric modelling is proven useful in many high dimensional applications in Biostatistics, Econometrics and Neuroscience. However in literature, there is a lack of statistical studies on the estimation and inference of high dimensional semiparametric model. This project aims to lay a solid theoretical foundation for high dimensional semiparametric analysis, in both frequentist and Bayesian paradigms. This research will significantly promote the use of semiparametric analysis of high dimensional complex data. This project consists of three research components. First, the investigators will establish the frequentist estimation theory and obtain new theoretical insights on the asymptotic behavior of the estimators in high dimensional semiparametric model. Secondly, the investigators will develop novel approach to conduct high dimensional semiparametric inferences such as confidence intervals and explore related semiparametric efficiency issue. Thirdly, Bayesian counterparts of estimation and inference theories will be developed. The investigators will establish the frequentist validity of Bayesian point estimations and interval estimations. These research results will provide important theoretical guidelines for high dimensional semiparametric modeling.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.961 |
2021 — 2024 |
Cheng, Guang Lin, Guang [⬀] Chan, Stanley Honorio, Jean |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Robust Deep Learning in Real Physical Space: Generalization, Scalability, and Credibility
The vulnerability of deep neural networks to small and imperceptible perturbations is a major challenge in machine learning today. For a variety of applications such as autonomous vehicles, security, and medical diagnosis, this weakness has severely limited the deployment of machine learning systems at scale. Existing theoretical studies, while laying a good foundation based on advanced statistical analyses, require various idealistic assumptions that are difficult to be validated in real physical environments. Understanding the robustness of deep learning algorithms and its interactions with the real physical environment is therefore a critical step towards a better understanding of explainability, generalization, and trustworthiness. This project aims to close the gap by developing new theories and computer vision systems that can be realistically validated. The outcomes of the research will create new technologies that can be translated into more secure and reliable commercial products, hence strengthening the global competitiveness of the United States; new trustworthy AI systems that can be deployed for surveillance and defense products to improve the national security of the United States; expand the next-generation workforce capacity by developing a complete training pipeline from K-12 outreach to undergraduate research, graduate mentoring, industry partnership, online learning modules, and curriculum development; broaden participation in STEM by leveraging the accessibility and intrigue of the foundational research concepts to conduct educational outreach that targets female participants from elementary up through graduate school; and promote the exchanges of ideas across disciplines in statistics, theoretical computer science, and image processing.
Robust machine learning in real physical space requires co-modeling the deep neural networks and the environment in which the neural networks are operating. Research efforts focusing on one specific domain but not interacting with the other domain will unlikely solve the problem. The combination of skills in electrical engineering, statistics, and computer science possessed by the Purdue-UCSD team offers a unique opportunity to address the problem. The technical approach the team will take is to reformulate the robust adversarial learning problem by incorporating the environmental factors. Four specific research objectives will be pursued: (1) Parametrizing the physical environment via a hierarchy of deterministic and generative approaches, so that the set of all possible distortions can be constrained. (2) Analyzing the generalization bounds of neural networks in the presence of the environmental factors and analyzing the credibility of such a system by studying the robustness and uncertainty quantification. (3) Developing computationally efficient algorithms to seek the equilibrium points of a proposed minimax optimization. (4) Building a computational photography testbed to implement the concepts and validate the theoretical results. On the educational front, the project provides a suite of outreach activities to K-12 to improve their interest in STEM, and research opportunities to undergraduates.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.961 |