2005 — 2009 |
Sinclair, Alistair (co-PI) [⬀] Mossel, Elchanan [⬀] Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Mspa-McS: Markov Random Fields: Structure and Algorithms @ University of California-Berkeley
Markov random fields (MRFs) provide a very general framework for capturing conditional independence in large collections of random variables. The proposal is concerned with the spatial properties of MRFs, and with two widely used algorithmic paradigms for solving inference problems in them: Belief Propagation (BP) and Gibbs Sampling (GS). Though widely used, these algorithms (especially BP) lack rigorous performance guarantees in most situations. One major goal of the proposed research is a deeper understanding of the behavior of BP and GS for different classes of MRFs that arise in applications. A central thesis is that the performance of these algorithms is intimately tied to the spatial structure of the underlying MRF. By making this connection precise the project aims at an improved understanding of the algorithms and their relationship to one another. Based on this insight, a second major goal of the project is to investigate systematic methods for designing MRFs tailored to a specific application; the MRFs should have suitable spatial properties so that algorithms like BP or GS (or variants thereof) are both successful in practice and provably effective.
Markov random fields are a rich class of mathematical models that are extremely well suited to capturing the behavior of large systems of independent components whose interactions are best described in statistical terms (rather than in terms of deterministic laws). Such systems are ubiquitous in today's world. As a first example, consider the millions of computers on the internet: the communication time between your computer and a given website is not a fixed quantity, but varies depending on the number of other users, the time of day etc. A second example is the problem of modeling and predicting global climate, which depends on a very large number of factors that interact in statistically variable ways. Modeling such applications with Markov random fields leads to a number of computational problems that---due to their extremely large size---are essentially impossible to solve exactly. The primary goal of this research project is the development and analysis of efficient algorithmic methods for obtaining approximate solutions, with rigorous guarantees on accuracy and running time. In light of the broad range of scientific and engineering contexts in which Markov random fields are used, basic research on these algorithms has the potential for very broad impact in many domains, including modern-day computing and communications infrastructure, intelligent systems for medical diagnosis, and the modeling of complex physical and biological systems.
|
1 |
2006 — 2012 |
Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: Novel Message-Passing Algorithms For Distributed Computation in Graphical Models: Theory and Applications in Signal Processing @ University of California-Berkeley
Many real-world scientific and engineering systems consist of a large number of interacting subsystems. Examples include wireless sensor networks, in which low-power devices are used to monitor and detect events over an extended spatial region; and data compression in network settings where data is stored in a distributed manner (e.g., large databases distributed over multiple computer servers). Graphical models provide a powerful set of tools for modeling, analyzing, and designing systems of this nature. These models derive their power by combining a probabilistic model (i.e., one in which there is uncertainty or stochasticity in the system operation) with graphs that capture the dependencies among the systems. This research involves developing new algorithms for applying these graphical models to large-scale systems like sensor networks and data compression.
Leveraging the full power of graphical models requires efficient methods for solving a core set of challenges. In this research, the investigator characterizes the limitations of existing algorithms, and moreover develops alternative and ultimately more powerful message-passing algorithms for solving these core computational problems. The following four projects address related aspects of this high-level goal: (a) analysis of provably effective algorithms based on linear programming; (b) novel message-passing algorithms for performing near-optimal lossy data compression; (c) fundamental research on issues of stability and robustness in message-passing; and (d) new methods for automated learning of models from data. These research thrusts are closely coupled with educational initiatives, including recruitment of undergraduates into research; broad dissemination of publicly-available survey papers, tutorial slides and software for graphical models; and the fostering of interaction between Engineering and Statistics via the Designated Emphasis in Communication, Computation and Statistics at UC Berkeley.
|
1 |
2006 — 2010 |
Yu, Bin Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
High-Dimensional Challenges in Statistical Machine Learning: Theory, Models and Algorithms @ University of California-Berkeley
TECHNICAL SUMMARY: This research proposal consists of four closely related research thrusts, all centered around the common goal of an integrated treatment of statistical and computational issues in dealing with high-dimensional data sets arising in information technology (IT). The first two research thrusts focus on fundamental issues that arise in the design of penalty-based and other algorithmic methods for regularization. Key open problems to be addressed include the link between regularization methods and sparsity, consistency and other theoretical issues, as well as structured regularization methods for model selection. Sparse models are desirable both for scientific reasons including interpretability, and for computational reasons, such as the efficiency of performing classification or regression. The third research thrust focuses on problems of statistical inference in decentralized settings, which are of increasing importance for a broad variety of IT applications such as wireless sensor networks, computer server ``farms'', traffic monitoring systems. Designing suitable data compression schemes is the key challenge. On one hand, these schemes should respect the decentralization requirements imposed by the system (e.g., due to limited power or bandwidth of communicating data); on the other hand, they should also be (near)-optimal with respect to a statistical criterion of merit (e.g., Bayes error for a classification task; MSE for a regression or smoothing problem). The fourth project addresses statistical issues centered around the use of Markov random fields, widely-used for modeling large collections of interacting random variables, and associated variational methods for approximating moments and likelihoods in such models.
BROAD SUMMARY: The field of statistical machine learning is motivated by a broad range of problems in the information sciences, among them remote sensing, data mining and compression, and statistical signal processing. Its applications range from homeland security (e.g., detecting anomalous patterns in large data sets) to environmental monitoring and assessment (e.g., estimating changes in Arctic ice). A challenging aspect to such applications is that data sets tend to be complex, massive (frequently measured in terabytes of data), and rich in terms of possible features (hundreds of thousands to millions). These characteristics presents fundamental challenges in the design and application of statistical models and algorithms for testing hypotheses and performing estimation. Whereas classical statistical methods are designed separately from computational considerations, dealing effectively with extremely high-dimensional data sets requires that computational issues be addressed in a more integrated manner during the design and testing of statistical models; and that issues of over-fitting and regularization, while always statistically relevant, become of paramount importance.
|
1 |
2006 — 2010 |
Anantharam, Venkatachalam [⬀] Nikolic, Borivoje (co-PI) [⬀] Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Theory and Methodology For the Design and Evaluation of High-Performance Ldpc Codes @ University of California-Berkeley
The project addresses some of the currently most important basic research challenges in error control coding theory. The bulk of current theory on efficient decoding methods is asymptotic in nature, and thus does not yield useful predictions for the shorter and intermediate blocklengths needed for delaysensitive applications (e.g., high-speed communications, streaming and peer-to-peer networks). A closely related goal is to gain a deeper understanding the so-called "error floor" behavior exhibited by certain classes of low-density parity check (LDPC) codes. These error floors seriously limit their usefulness for very low bit error rate (BER) applications (e.g., high-speed communications, data storage). A current major bottleneck is the lack of systematic and reliable methods to explore this deep BER regime, and we propose to address this challenge through a combination of combinatorial and geometric analysis, hardware-based emulation, and fast stochastic simulation.
Intellectual merit: Lack of finite-length analysis and the existence of error floors is a key bottleneck slowing down the deployment of high performance codes in a range of very important applications. Addressing this challenge requires a range of deep and fundamental scientific questions, drawing on ideas from polyhedral combinatorics, graph theory, dynamical systems and probability theory. Hardware implementation of message passing algorithms for high performance applications is in itself challenging. The research uses the development of high performance iterative decoders as a design driver for developing a novel emulation-simulation based design approach. This paradigm is novel and poses many fundamental challenges, including the development of fast simulation techniques for gathering error statistics, stochastic adaptive algorithms that exploit error statistics in order to design better codes, and investigation of systematic techniques for efficiently mapping code designs onto a field-programmable gate array (FPGA) platform.
Broader impact: Message-passing algorithms on graphs play a fundamental role across of a broad spectrum of scientific and engineering disciplines. The range of applications covers areas as diverse as communication systems, image processing, bioinformatics, statistical physics, and natural language processing, among many others. Consequently, our research, with its explicit goal of gaining a deeper understanding of message passing algorithms, has the potential for broad impact and dissemination across a variety of areas. In addition, our project has a significant educational and outreach component. Graduate students at Berkeley will be trained in both algorithmic and VLSI design techniques while participating in this research. Students coming our of this program will have a unique skill set that spans both theory and implementation, thereby constituting an invaluable addition to the nation's technical workforce. We are also very active in promoting undergraduate research and in facilitating research experiences for students from historically underrepresented communities. The overall thrust of this project has many well defined subprojects, which makes it very well-suited to such outreach activity at the undergraduate level. 1
|
1 |
2009 — 2013 |
Wainwright, Martin Bloom, Joshua [⬀] El Karoui, Noureddine (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cdi-Type Ii: Real-Time Classification of Massive Time-Series Data Streams @ University of California-Berkeley
The collection of new data in any discipline does not, in general, lead to the creation of new knowledge. With the current data deluge, the human role in scientific discovery, traditionally so important, must now be partially fulfilled by powerful algorithms. However, current tools and technology start to break down when discovery and understanding, by the very nature of the science at hand, must happen quickly and in near real-time.
New astronomical surveys coming online in the next few years, many observing the same regions of the sky repeatedly in time, will collect more data in the next decade than in all of human history so far. Opening up truly new vistas on the dynamic universe requires both rapid data processing and quick decisions about what available resources (e.g., telescopes) worldwide must be marshalled to study newly discovered phenomena. This necessitates an intelligent "real-time" machine-based decision or "classification" framework that should be able to deal with incomplete (and in some cases spurious) information.
This project will produce a framework for extracting novel science from large amounts of data in an environment where the computational needs vastly outweigh the available facilities, and intelligent (as well as dynamic) resource allocation is required. New theory will be developed that will allow current machine learning paradigms to scale to large parallel computing environments. The core result is the production, for projects generating thousands of gigabytes of new data a night (such as the proposed Large Synoptic Survey Telescope), of probabilistic statements about the physical nature of astronomical events. Uncovering anomalous events that do not fit easily into a currently accepted classification taxonomy - events that may lead to completely new scientific discoveries - will be particularly emphasized in this work.
Building these computational tools now with concrete scientific returns in mind will form the foundation for more rapid transformative applications in other fields with similar demands and constraints (high-frequency financial data, robotics, medical signal monitoring, geophysics, weather, and particle physics). This endeavor will also serve for years as a training ground for students and researchers across several departments and disciplines, and will broaden their scope towards a truly interdisciplinary education. By exposing students in the physical sciences to cutting-edge computer science and machine learning concepts, this project will provide a frame-work for computational thinking that will lead to future innovation.
|
1 |
2009 — 2012 |
Yu, Bin Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Inference in High-Dimension: Statistics, Computation and Information Theory @ University of California-Berkeley
This research proposal consists of two related research thrusts, all centered around the common goal of an integrated treatment of statistical and computational issues. The first research thrust concerns various issues associated with the use of structured regularization methods in high-dimensional inference. Such types of regularization are natural in different settings, including estimation of structured covariance matrices, graphical model selection, and hierarchical data modeling. The researchers propose to provide sharp characterizations of when structured regularization, with its typically higher computational costs, is guaranteed to yield improvements in statistical efficiency, or conversely, when structured regularization might impair statistical efficiency. The second research thrust addresses the role of statistical stability in optimization, and the development of new methodology for choosing path length parameters in iterative algorithms.
Statistical inference problems of a high-dimensional nature---meaning where the number of observations n is similar to or even smaller than the number of parameters p---are ubiquitous throughout various areas of science and engineering, among them genomics, neuroscience, remote sensing, natural language processing, data compression, financial time series, and statistical signal processing. As a concrete instances, consider the problem of estimating the structure of a social network consisting of a large number of individuals (p could be 10,000 or larger) based on a relatively small number of snapshots (n could be 100 to 1,000). The overarching theme of the proposal is the development of new methodology and theory for high-dimensional data. Given the ubiquity of such data, such developments have the potential to impact a variety of fields making use of statistical modeling and tools, among them information technology (IT), neuroscience, remote sensing, and data compression. Moreover, the proposal is inter-disciplinary in nature, and so has the potential to strengthen bridges between statisticians and researchers in other departments (e.g., computer science, electrical and civil engineering) also working on IT applications. Via this type of intellectual unification, the proposed research is likely to have broader impact---much beyond any specific technical contributions---in terms of bridging different research communities, and providing broad training to graduate students and postdoctoral researchers.
|
1 |
2011 — 2015 |
Yu, Bin Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Sparse and Structured Networks: Statistical Theory and Algorithms @ University of California-Berkeley
The proposal focuses on Markov random fields and directed graphical models, classes of statistical models that are based on a marriage between graph theory and probability theory, and allow for flexible modeling of network-structured data. The core of the proposal consists of multiple research thrusts, all centered around the goal of developing practical algorithms and theory for statistical estimation with network-structured data. One research thrust concerns various issues associated with model selection in undirected graphical models, also known as Gibbs distributions or Markov random fields. Problems include determining the information-theoretic limitations of graphical model selection in high dimensions (where the number of vertices may be larger than the sample size), not only for i.i.d. data but also dependent data; developing methods for tracking sequences of networks that evolve over time; and developing methods for data with hidden variables. Another research thrust concerns various statistical problems associated with directed acyclic graphical structures (DAGs), including estimating equivalence classes of DAGs in the high-dimensional setting; estimating causal relationships via designed interventions; and efficient computational methods for DAG selection using the Lasso and related methods. Overall, the proposed research is inter-disciplinary in nature, drawing on techniques from mathematical statistics, convex optimization, information theory, concentration of measure, and graph theory.
Science and engineering abounds with different types of networks. Examples include social networks such as FaceBook and Twitter, networks of genes and proteins in molecular biology, network models for economic and market dynamics, neural networks in brain imaging, networks of disease transmission in epidemiology, and information networks in law enforcement. In the real-world, the structure of the underlying network is not known, but instead one observes samples of the network behavior (e.g., packet counts in a computer network; instances of infection at given time instances of an epidemic; emails or text messages sent among a group of people), and the goal is to infer the network structure. Methods for solving this network inference problem have a broad range of applications. Examples include inferring brain connectivity and disease etiology in neuroimaging studies, detecting terrorist cells in social networks, monitoring intrusions in computer networks, and understanding the basis of gene-protein interactions in systems biology.
|
1 |
2013 — 2017 |
Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif: Medium: Collaborative Research: New Approaches to Robustness in High-Dimensions @ University of California-Berkeley
Rapid development of large-scale data collection technology has ignited research into high-dimensional machine learning. For instance, the problem of designing recommender systems, such as those used by Amazon, Netflix and other on-line companies, involves analyzing large matrices that describe users' behavior in past situations. In sociology, researchers are interested in fitting networks to large-scale data sets, involving hundreds or thousands of individuals. In medical imaging, the goal is to reconstruct complicated phenomena (e.g., brain images; videos of a beating heart) based on on a minimal number of incomplete and possibly corrupted measurements. Motivated by such applications, the goal of this research is to develop and analyze models and algorithms for extracting relevant structure from such high-dimensional data sets in a robust and scalable fashion.
The research leverages tools from convex optimization, signal processing, and robust statistics. It consists of three main thrusts: (1) Model restrictiveness: Successful methods for high-dimensional data exploit low-dimensional structure; however, many real-world problems fall outside the scope of existing models. This proposal significantly extends the basic set-up by allowing for multiple structures, leading to computationally efficient algorithms while eliminating negative effects of model mismatch. (2) Non-ideal data: Missing data are prevalent in real-world problems, and can cause major breakdowns in standard algorithms for high-dimensional data. The second thrust devises relaxations and greedy approaches for these non-convex problems. (3) Arbitrary Outliers: Gross errors can arise for various reasons, including fault-prone sensors and manipulative agents. The third thrust proposes efficient and randomized algorithms to address arbitrary outliers.
|
1 |
2016 — 2019 |
Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Statistical Estimation in Resource-Constrained Environments: Computation, Communication and Privacy @ University of California-Berkeley
The past decade has witnessed an explosion in the scale and richness of data sets that arise in both science and engineering. A wide variety of application areas have lead to large-scale data sets. Examples include social networks such as Facebook, on-line recommender systems such as Amazon and Netflix, neuroscience data including fMRI, EEG, and brain-machine interfaces, and image/video processing which includes face recognition, surveillance, and security. All of these areas require effective methods for statistical inference---that is, methods that lead to actionable conclusions from the data. The classical approach in statistics is to study inference algorithms without consideration of their computational and storage requirements; this approach leads to many methods that simply cannot be implemented for large-scale problems. The goal of this research is to develop a principled framework for characterizing the fundamental limits of statistical estimation under computational and storage constraints. This shift in perspective will lead to the development of new and computationally efficient methods for statistical estimation in resource-constrained environments.
While the notion of minimax risk characterizes the fundamental limits of statistical estimation, it is based on taking an infimum over all measurable functions of data, thereby allowing for estimators that have exponential computational complexity, require prohibitive amounts of storage, and/or reveal sensitive data. The goal of this proposal is to study various constrained forms of statistical minimax based on limiting the class of possible estimators. The proposed work is interdisciplinary in nature, combining ideas from mathematical statistics, information theory, optimization theory, and computational complexity. The first research thrust concerns the tradeoffs between computational costs and statistical accuracy. The main goal is to understand when there are gaps between the classical minimax risk, and the lowest risk achievable by algorithms that run in polynomial-time. Specific model classes of interest include high-dimensional forms of sparse regression, sparse principal component analysis, and classification problems in neural networks. The second research thrust focuses on estimation in distributed settings. Many data sets are so large so that they cannot be stored at a single central location, but instead must be split into many pieces, and stored on separate machines that can communicate only relatively small amounts of information. Thus, an important problem is to characterize the minimal amount of communication needed for a distributed implementation to match the performance of the centralized estimator.
|
1 |
2019 — 2022 |
Bartlett, Peter (co-PI) [⬀] Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Af: Small: Optimizing Probabilities For Learning: Sampling Meets Optimization @ University of California-Berkeley
Methods for large-scale machine learning and artificial intelligence (AI) have had major impacts on the world over the past decade, including in both industrial and scientific contexts. These spectacular successes are driven by a combination of the availability of massive datasets, and appropriate models and algorithms for extracting useful information and insights from these datasets. This research project aims to advance the methodology and understanding of algorithms for large-scale machine learning and AI by exploiting the interplay between sampling and optimization. In particular, two grand challenges are addressed: first, the tools and insights of optimization theory can develop more effective design and analysis techniques for sampling methods; second, these techniques can be used to design and analyze optimization methods for problems such as those that arise in deep learning. Successful research outcomes of this project are likely to increase the understanding of methods used for sampling and for optimization, and to facilitate their principled design. Successful outcomes have a significant potential for practical impact in the large and growing set of applications where large-scale sampling and optimization methods are used, including computer vision, speech recognition, and self-driving cars. The research will support the development of graduate students, will be disseminated through large graduate courses at Berkeley and their web-based course materials, and has the potential to benefit the broader community through the application of the methods studied in deployed AI systems.
The project has three main technical directions. First, it aims to identify the inherent difficulty of sampling problems by proving lower bounds. Second, it aims to produce analysis tools and design methodologies for sampling algorithms based on a certain family of stochastic differential equations known as a Langevin diffusion. This will enable the development of sampling algorithms with performance guarantees. Third, it will use the viewpoint of sampling techniques to analyze and design stochastic gradient methods for nonconvex optimization problems, such as the optimization of parameters in deep neural networks. An additional outcome of the project will be the organization of a workshop on the topic of the interface between sampling and optimization.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |
2020 — 2025 |
Yu, Bin Jordan, Michael (co-PI) [⬀] Bartlett, Peter [⬀] Wainwright, Martin Hug, Josh |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Foundations of Data Science Institute @ University of California-Berkeley
The Foundations of Data Science Institute (FODSI) brings together a large and diverse team of researchers and educators from UC Berkeley, MIT, Boston University, Bryn Mawr College, Harvard University, Howard University, and Northeastern University, with the aim of advancing the theoretical foundations for the field of data science. Data science has emerged as a central science for the 21st century, a widespread approach to science and technology that exploits the explosion in the availability of data to allow empirical investigations at unprecedented scale and scope. It now plays a central role in diverse domains across all of science, commerce and industry. The development of theoretical foundations for principled approaches to data science is particularly challenging because it requires progress across the full breadth of scientific issues that arise in the rich and complex processes by which data can be used to make decisions. These issues include the specification of the goals of data analysis, the development of models that aim to capture the way in which data may have arisen, the crafting of algorithms that are responsive to the models and goals, an understanding of the impact of misspecifications of these models and goals, an understanding of the effects of interactions, interventions and feedback mechanisms that affect the data and the interpretation of the results, concern about the uncertainty of these results, an understanding of the impact of other decision-makers with competing goals, and concern about the economic, social, and ethical implications of automated data analysis and decision-making. To address these challenges, FODSI brings together experts from many cognate academic disciplines, including computer science, statistics, mathematics, electrical engineering, and economics. Institute research outcomes have strong potential to directly impact the many application domains for data science in industry, commerce, science and society, facilitated by mechanisms that directly involve a stream of institute-trained personnel in industrial partners' projects, and by public activities designed to nurture substantive interactions between foundational and use-inspired research communities in data science. The institute also aims to educate and mentor future leaders in data science, through the further development of a pioneering undergraduate program in data science, and by training a diverse cohort of graduate students and postdocs with an innovative approach that emphasizes strong mentorship, flexibility, and breadth of collaboration opportunities. In addition, the institute plans to host an annual summer school that will deliver core curriculum and a taste of foundational research to a diverse group of advanced undergraduates, graduate students, and postdocs. It aims to broaden participation and increase diversity in the data science workforce, bringing the excitement of data science to under-represented groups at the high school level, and targeting diverse participation in the institute's public activities. And it will act as a nexus for research and education in the foundations of data science, by convening public events, such as summer schools and research workshops and other collaborative research opportunities, and by providing models for education, human resource development, and broadening participation.
The scientific focus of the institute will encompass the full range of issues that arise in data science -- modeling issues, inferential issues, computational issues, and societal issues ? and the challenges that emerge from the conflicts between their competing requirements. Its research agenda is organized around eight themes. Three of these themes focus on key challenges arising from the rich variety of interactions between a decision maker and its environment, not only the classical view of data that is processed in a batch or a stream, but also sequential interactions with feedback (the control perspective), experimental interactions designed to answer "what if" questions (the causality perspective), and strategic interactions involving other actors with conflicting goals (the economic perspective). The other research themes focus on opportunities for major impacts across disciplinary boundaries: on elucidating the algorithmic landscape of statistical problems, and in particular the computational complexity of statistical estimation problems, on sketching, sampling, and sub-linear time algorithms designed to address issues of scalability in data science problems; on exploiting statistical methodology in the service of algorithms; and on using breakthroughs in applied mathematics to address computational and inferential challenges. Intellectual contributions to societal issues in data science will feature throughout this set of themes. The institute will exploit strong connections with its scientific and industrial partners to ensure that these research directions enjoy a rich engagement with a broad range of commercial, technological and scientific application domains. Its sequence of research workshops and a collaborative research program will serve the broader research community by nurturing additional research in these key challenge areas. The institute will be led by a steering committee that will seek the help of an external advisory board to prioritize its research themes and activities throughout its lifetime. Its educational programs will include curriculum development from K-12 through undergraduate, a graduate level visit program, and a postdoc training model, aimed at empowering the next generation of leaders to fluidly work across conventional disciplinary boundaries while being mindful of the full set of scientific issues. The institute will undertake a multi-pronged effort to recruit, engage and support the full range of groups traditionally under-represented in mathematics, computer science and statistics.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |
2020 — 2023 |
Wainwright, Martin |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iterative Algorithms For Statistics: From Convergence Rates to Statistical Accuracy @ University of California-Berkeley
Science, engineering, and industry are all being revolutionized by the modern era of data science, in which increasingly large and rich forms of data are now available. The applications are diverse and broadly significant, including data-driven discovery in astronomy, statistical machine learning approaches to drug design, and decision-making in robotics and automated driving, among many others. This grant supports research on techniques and models for learning from such massive datasets, leading to computationally efficient algorithms that can be scaled to the large problem instances encountered in practice. The PI plans to integrate research and education through the involvement of graduate students in the research, the inclusion of the research results in courses at UC Berkeley and in publicly available web-based course materials, as well as in mini courses at summer schools and workshops. This project will also provide mentoring and support for graduate students and postdocs who are female or belong to URM communities.
Many estimates in statistics are defined via an iterative algorithm applied to a data-dependent objective function (e.g., the EM algorithm for missing data and latent variable models; gradient-based methods and Newton's method for M-estimation; boosting algorithms used in non-parametric regression). This projectl gives several research thrusts that are centered around exploiting the dynamics of these algorithms in order to answer statistical questions, with applications to statistical parameter estimation; selection of the number of components in a mixture model; and optimal bias-variance trade-offs in non-parametric regression. In more detail, the aims of this project include (i) providing a general analysis of the EM algorithm for non-regular mixture models and related singular problems, in which very slow (sub-geometric) convergence is typically observed; (ii) developing a principled method for model selection based on the convergence rate of EM, and to prove theoretical guarantees on its performance; developing a general theoretical framework for combining the convergence rate of an algorithm with bounds on its (in)stability so as to establish bounds on the statistical estimation error; and (iii) providing a complete analysis of the full boosting path for various types of boosting updates, including kernel boosting, as well as gradient-boosted regression trees, and to analyze the "overfitting" regime, elucidating conditions under which overfitting does or does not occur.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |