1998 — 2001 |
Faloutsos, Christos (co-PI) [⬀] Spirtes, Peter (co-PI) [⬀] Wasserman, Larry (co-PI) [⬀] Moore, Andrew Nichol, Robert |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Kdi: New Algorithms, Architectures and Science For Data Mining of Massive Astrophysics Sky Surveys @ Carnegie-Mellon University
Moore 9873442 There many massive databases in industry and science, and this is particularly true for Astrophysics. There are many kinds of questions that physicists and other users wish to ask the databases in real time, e.g., 'find outliers'; 'find clusters'; 'find patterns'; 'classify the data records into N predetermined classes.' Wide-ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (such as the new generation of astrophysics sky surveys) this can be intractable using current algorithms. This project aims to make repeated statistical querying of huge datasets computationally feasible by transforming massive databases into condensed representations that permit the rapid answering of such questions. To achieve these goals, the investigator and his colleagues explore ways in which tools from statistics (such as Bayesian networks), databases (such as kd-trees/R-trees), and Artificial Intelligence (such as AD-trees and rule-finders) can help, how they scale up, and how they can be combined. The investigators intend to help automate the process of scientific discovery for astrophysical data sources in which there is too much information for any unaided human to have a chance of spotting patterns, regularities, or anomalies. Government and industry in the U.S. have invested heavily in ingenious new ways to gather information in all branches of science and industry, from cell biology to the flows of capital in international commerce. Scientists and analysts who have worked so hard to gather magnitudes more data than they had ten years ago are now faced with an equally daunting task: exploiting it fully. It is ironic that in fields such as astrophysics there is now so much data that no human has enough time to even see a tiny fraction of it. The job of discovering new relationships, anomalies, and even causation, must now be at least partly turned over to computers. The investigators comprise a team of statisticians, computer scientists, and astronomers who have each already made progress in this direction. This team develops new algorithms to squeeze as much information as possible from trillion-byte astrophysics databases such as the Sloane Sky Survey. They also make sure that the resulting technology is deployed elsewhere in science and industry.
|
0.936 |
2001 — 2007 |
Wasserman, Larry (co-PI) [⬀] Connolly, Andrew Moore, Andrew Nichol, Robert Schneider, Jeff Miller, Christopher (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr/Im: Statistical Data Mining For Cosmology @ Carnegie-Mellon University
Scientists are now confronted with many very large high-quality data sets. The potential scientific benefits of these data are offset by the laborious process of analyzing them to answer questions and test theories. This project will develop new data mining algorithms in pursuit of the goal of computer assisted discovery. Two key issues in achieving this are computational efficiency and autonomy. If scientists are to focus their energy on understanding, answers must arrive in minutes rather than days, hence the need for efficiency. Autonomy is important both from the data mining and the statistical perspective. Detailed searches for relationships, models, and parameters are too large for humans to undertake manually. New statistical methods will have to autonomously and quickly select models, test their significance, and report the results to search algorithms looking for new discoveries.
The National Virtual Observatory (NVO) currently under construction is a model of the future of science. The NVO will assemble petabytes of data from many multi-wavelength sky surveys into a single repository. The new methods to be developed will be implemented in the domain of cosmology, but they will be applicable to all other sciences.
The members of this project are computer scientists, physicists and statisticians who have a track record of collaborating closely. Working together they have produced: new algorithmic theory, new statistical theory, and publicly fielded software packages resulting from the theory, while developing new courseware and training students.
This proposal involves research and education in the following areas:
Nonparametric data analysis. Nonparametric statistical models enable powerful analysis techniques that make minimal assumptions, which is critical for scientific accuracy.
Automated discovery. Statistical models can be used directly for discovery. Individual objects are compared to models to identify anomalies and data generated models are compared to theoretical models to refute or confirm hypotheses.
Computational methods for fast analysis. The project will build on past successes of getting orders of magnitude speedups on operations such as Expectation Maximization based clustering and n-point correlations to make the new methods fast.
Automated simulation parameter searching. Using all of the above methods, a system will be developed that starts with a parameterized simulation and some observational data. The system will search the space of parameters, testing the resulting simulation against the real data using nonparametric methods to determine the best settings.
|
0.936 |