2015 — 2018 |
Saha, Barna |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Crii:Af: Scaling Up Dynamic Programming For Certain Optimization Problems @ University of Massachusetts Amherst
Dynamic programming is one of the most fundamental and systematic techniques for algorithm design and analysis. Pioneered by Richard Bellman in 1940s for applications in engineering control theory, this method has since been extremely popular in a huge variety of optimization problems in computer science, applied mathematics, engineering, biology and economics. However, dynamic programming algorithms typically have high polynomial time complexity--quadratic, cubic or even more--and high space requirements as well. This significantly limits the practicality of this method. Any advancement in the scalability of a generic tool like dynamic programming is likely to have a huge impact across different fields.
There has been a significant effort in accelerating dynamic programming over the last several decades; however, the speedup gains have mostly been limited to only a poly-logarithmic factor improvement at best (e.g., the Four Russians Method). This limitation is primarily due to the focus on developing exact optimal algorithms. On the other hand, there are many applications where one may relax the need for optimality, and instead settle for an approximate solution. Allowing even a small suboptimality may lead to a huge benefit in time and space requirements.
The principal motivation for this project is to introduce new techniques to develop highly scalable dynamic programming methodology as a generic toolkit for designing scalable algorithms. Understanding the tradeoff between approximability and time complexity remains a major goal. To achieve this goal, a new suite of mathematical tools will be required since existing methods mostly give only a poly-logarithmic improvement in time complexity, or can handle only the case when data satisfies additional local properties. Randomization will play a crucial role in this project. Various randomized sketching methods and compression techniques will be considered, incorporating fresh tools from information theory, graph theory, and the probabilistic method. While the major focus is on improving running time, the approaches developed under this project will have a direct effect in reducing space usage as well. Finally, a class of basic problems in scheduling and pattern recognition over sequences will be studied, which are not only of theoretical interest, but also have wide application in large scale data management, bioinformatics, and sustainable computing.
The proposed work will significantly improve the state of the art in the design of scalable algorithms for major optimization problems arising in Big Data domains. The algorithms will be implemented and tested on a variety of publicly available data sets in bioinformatics, industrial cloud cluster usage data, dynamic network routing data, etc. The close connection of the PI with industry will result in collaborations with practitioners and possible adaptations of the developed methodologies when possible. The PI has experience in mentoring minority students and is committed to involvement of under-represented minorities, at both graduate and undergraduate levels, in cutting edge research.
|
0.931 |
2017 — 2023 |
Saha, Barna |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: Efficient Fine-Grained Algorithms @ University of Massachusetts Amherst
Ever since the inception of computation, the fundamental question has been- what problems can be solved by computers? And which ones can be solved in a reasonable time? This brought in the notion of polynomial time solvable problems which are considered efficient vs NP-hard problems which take prohibitively long time to solve. The field of approximation algorithms, along with the theory of inapproximability, deals with which of the NP-hard problems can be solved efficiently by allowing approximate solutions. However, this crude distinction of algorithmic efficiency--polynomial vs NP-hard, is insufficient when handling today's large scale of data. We need a finer-grained design and analysis of algorithms that pinpoints the exact exponent of polynomial running time, and a better understanding of when a speed-up is not possible. Unfortunately, except for a few problem-specific innovations, the study of such algorithms is deeply lacking in the literature.
This project targets to build a unified theory of fine-grained algorithm design to study fast approximation algorithms, and their fine-grained hardness. Developing systematic techniques that emphasize on the trade-offs between running time, approximation and randomness, and aid in designing low-complexity parallel algorithms will significantly improve the state of the art. Moreover, motivated by core machine learning applications, the project proposes an alternate model of efficiency via classical query complexity. Tools from classical query complexity have previously inspired development of fast algorithms and vice versa; the PI expects similar connections to happen in this project. Elements of this endeavor will be integrated with new courses, many of the algorithms developed herein will be implemented and the close connection of the PI with industry will result in possible adaptation of methodologies.
The project will address a suite of important optimization problems, from long-standing open questions to modern problems with diverse applications. It will introduce fresh tools from additive combinatorics, fourier analysis, rate distortion theory, and circuit complexity for analysis of algorithms, and establishing lower bounds. Specifically, new generic techniques of amnesic dynamic programming, fast matrix-product over semiring, and low-degree polynomial method will be developed to design fine-grained approximation algorithms and their parallel counterparts. Time complexity may not always be the primary measure of efficiency. There are many core machine learning problems where query complexity, that quantifies the amount of labeled data acquired via active querying, is more important. The project will analyze for the first time the query complexity of basic learning problems, and explore its connections to developing fast algorithms.
|
0.942 |
2019 — 2022 |
Mazumdar, Arya [⬀] Saha, Barna |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif: Small: New Directions in Clustering: Interactive Algorithms and Statistical Models @ University of Massachusetts Amherst
Clustering, or partitioning data-points based on pairwise similarities, is one of the most important problems of unsupervised learning and data mining. Algorithms for clustering often suffer from low accuracy and high time complexity due to processing massive amounts of data that are noisy, and for lack of a benchmark to validate the clustering output. This project studies new directions for clustering algorithms and their fundamental limits by incorporating several new technical tools from information theory. Scalable algorithms for clustering with theoretical guarantees and practical validation should have high impact in all areas of data science.
This project aims to make use of interactive algorithms that adaptively gather labeled data to improve accuracy of clustering and related machine learning problems. In recent years, wide-spread use of crowdsourcing has facilitated the development of such algorithms. The algorithms that will be developed, as well as the lower bounding methods, extend to a large class of machine learning problems, and help understand trade-off between information theoretic optimality and computational efficiency. A generalized setup taking into account erroneous interactions, use of similarity measures, overlapping clusters and parallel interactions in design of algorithms is being pursued. Connections of this setup with well-known random graph community models such as the stochastic block model will also be explored in this project. New statistical models, such as the geometric block model, will be studied as an alternative to the popular stochastic block model. Algorithms for community recovery for the geometric block model and information theoretic lower bounds will be developed. This project will also develop statistical models that faithfully capture the properties of real networks to provide a benchmark for testing clustering algorithms. These include statistical models for community detection and interactive clustering.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.931 |
2019 — 2022 |
Katsoulakis, Markos (co-PI) [⬀] Mcgregor, Andrew [⬀] Mazumdar, Arya (co-PI) [⬀] Flaherty, Patrick Saha, Barna |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Hdr Tripods: Institute For Integrated Data Science: a Transdisciplinary Approach to Understanding Fundamental Trade-Offs and Theoretical Foundations @ University of Massachusetts Amherst
Many areas of science, engineering, and industry are already being revolutionized by the adoption of tools and techniques from data science. However, a rigorous analysis of existing approaches together with the development of new ideas is necessary to a) ensure the optimal use of available computational and statistical resources and b) develop a principled and systematic approach to the relevant problems rather than relying on a collection of ad hoc solutions. In particular, there are many interrelated questions that arise in a typical data science project. First is the acquisition of relevant data: Can data be collected interactively and might this reduce the costs of data acquisition? Is the data noisy and how might this impact the results? Second is the processing of data: If the data cannot fit in the memory of a single machine, how can we minimize the communication costs within a cluster of machines? When are approximate answers sufficient and how does the required accuracy trade off with the computational resources available? Third is the prediction value of the available data: Can the uncertainty of the final results be quantified? How can the modeling assumptions used by our algorithms be efficiently evaluated? This award supports a data science institute with the main goal of developing an understanding of the fundamental mathematical and computational issues underlying the aforementioned questions. Ultimately, this will enable practitioners to make more informed decisions when investing time and money across the life cycle of their data science project. Achieving this goal necessitates a transdisciplinary approach and the team of investigators includes experts in theoretical computer science; applied and computational mathematics; machine learning and statistics; and coding and information theory. In addition to pursuing the above research goals, the institute will coordinate education and training activities and develop resources for the research community.
Specific research goals explored in this project include: 1) Understanding the trade-off between rounds of interactive data acquisition and statistical and computational efficiency. 2) Minimizing query complexity in interactive unsupervised learning problems. 3) Understanding space/sample complexity trade-offs when processing stochastic data. 4) Developing fine-grained approximation algorithms relevant to core data science tasks. 5) Using coding theory to enable communication-efficient distributed machine learning. 6) Designing variational inference methods with statistical guarantees given limited resources. 7) Developing a principled approach to exploiting trade-offs between bias, model complexity, and computational budget. Specific institute activities include: 1) Technical workshops and training activities for researchers in domain sciences. 2) A virtual speaker series. 3) Education initiatives including the development of new courses that will teach foundational topics in data science and resources that can be used across different institutions. The grant will also train postdoctoral scholars and undergraduate researchers.
This project is part of the National Science Foundation's Harnessing the Data Revolution (HDR) Big Idea activity.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.931 |
2022 — 2027 |
Dasgupta, Sanjoy (co-PI) [⬀] Wang, Yusu (co-PI) [⬀] Chaudhuri, Kamalika (co-PI) [⬀] Mazumdar, Arya (co-PI) [⬀] Saha, Barna |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Encore: Institute For Emerging Core Methods in Data Science @ University of California-San Diego
The proliferation of data-driven decision making, and its increased popularity, has fueled rapid emergence of data science as a new scientific discipline. Data science is seen as a key enabler of future businesses, technologies, and healthcare that can transform all aspects of socioeconomic lives. Its fast adoption, however, often comes with ad hoc implementation of techniques with suboptimal, and sometimes unfair and potentially harmful, results. The time is ripe to develop principled approaches to lay solid foundations of data science. This is particularly challenging as real-world data is highly complex with intricate structures, unprecedented scale, rapidly evolving characteristics, noise, and implicit biases. Addressing these challenges requires a concerted effort across multiple scientific disciplines such as statistics for robust decision making under uncertainty; mathematics and electrical engineering for enabling data-driven optimization beyond worst case; theoretical computer science and machine learning for new algorithmic paradigms to deal with dynamic and sensitive data in an ethical way; and basic sciences to bring the technical developments to the forefront of health sciences and society. The proposed institute for emerging CORE methods in data science (EnCORE) brings together a diverse team of researchers spanning the afore-mentioned disciplines from the University of California San Diego, University of Texas Austin, University of Pennsylvania, and the University of California Los Angeles. It presents an ambitious vision to transform the landscape of the four CORE pillars of data science: C for complexities of data, O for optimization, R for responsible learning, and E for education and engagement. Along with its transformative research vision, the institute fosters a bold plan for outreach and broadening participation by engaging students of diverse backgrounds at all levels from K-12 to postdocs and junior faculty. The project aims to impact a wide demography of students by offering collaborative courses across its partner universities and a flexible co-mentorship plan for truly multidisciplinary research. With regular organization of workshops, summer schools, and seminars, the project aims to engage the entire scientific community to become the new nexus of research and education on foundations of data science. To bring the fruit of theoretical development to practice, EnCORE will continuously work with industry partners, domain scientists, and will forge strong connections with other National Science Foundation Harnessing Data Revolution institutes across the nation.<br/><br/>EnCORE as an institute embodies intellectual merit that has the potential to lead ground-breaking research to shape the foundations of data science in the United States. Its research mission is organized around three themes. The first theme on data complexity addresses the complex characteristics of data such as massive size, huge feature space, rapid changes, variety of sources, implicit dependence structures, arbitrary outliers, and noise. A major overhaul of the core concepts of algorithm design is needed with a holistic view of different computational complexity measures. Faced with noise and outliers, uncertainty estimation is both necessary, and at the same time difficult, due to dynamic and changing data. Data heterogeneity poses major challenges even in basic classification tasks. The structural relationships hidden inside such data are crucial in the understanding and processing, and for downstream data analysis tasks such as in visualization and neuroscience. The second theme of EnCORE aims to transform the classical area of optimization where adaptive methods and human intervention can lead to major advances. It plans to revisit the foundations of distributed optimization to include heterogeneity, robustness, safety, and communication; and address statistical uncertainty due to distributional shift in dynamic data in control and reinforcement learning. The third and final theme of EnCORE proposes to build the foundations of responsible learning. Applications of machine learning in human-facing systems are severely hampered when the learned models are hard for users to understand and reproduce, may give biased outcomes, are easily changeable by an adversary, and reveal sensitive information. Thus, interpretability, reproducibility, fairness, privacy, and robustness must be incorporated in any data-driven decision making. The experience and dedication to mentoring and outreach, collaborative curriculum design, socially aware responsible research program, extensive institute activities, and industrial partnerships would pave the way for a substantial broader impact for EnCORE. Summer schools with year-long mentoring will take place in three states involving a large demography. Joint courses with hybrid, and fully online offerings will be developed. Utilizing prior experience of running Thinkabit lab that has impacted over 74,000 K-12 students so far, EnCORE will embark on an ambitious and thoughtful outreach program to improve the representation of under-represented groups and help create a future generation of workforce that is diverse, responsible, and has solid foundations in data science.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.942 |