2000 — 2004 |
Berman, Francine [⬀] Ellisman, Mark (co-PI) [⬀] Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr: Virtual Instruments: Scalable Software Instruments For the Grid @ University of California-San Diego
Ensembles of distributed communication, computation, and storage resources, also known as "Computational Grids", are emerging as a critical platform for high-performance computing. Grids are used effectively to support runs of distributed applications at a large enough scale to provide new disciplinary results to their developers. Researchers in almost every field of science and engineering are particularly interested in a class of applications particularly well suited to the Grid, scientific simulations where many parameterized instances of a give computation are performed. The development of accessible, efficient, fault-tolerant Grid-enabled versions of simulation software will enable disciplinary scientists to investigate wide-ranging scenarios and to obtain new results orders of magnitude faster than is currently possible.
Many scientists would like to view large-scale simulations as software instruments that support some level of user interaction. This would be effective only if simulations can be deployed easily and controlled dynamically, i.e. if the computation can be steered. A traditional scenario is for the user to steer the simulation based on partial results that evolve continuously during execution. The partial results provide an increasingly refined indicator of the final results of the simulation and can be used to identify mid-execution which parameter sets are most promising. Given the potential of wide-area, federated Grid environments to deliver the aggregate computational power, data storage and dissemination facilities for large-scale simulations, and the need for scientists to steer such computations, it is increasingly important to develop performance-efficient and steerable software instruments that target the Grid. This project will address the significant computer science problems that arise from the need to support steerable scientific simulations in large-scale Grid environments.
The project will design, develop, and prototype a virtual software instrument as a vehicle for designing and prototyping scalable, steerable scientific simulations for the Grid. It will use a Monte Carlo simulation program, MCell, as a prototype application for development and testing of the virtual instrument. The virtual instrument itself will consist of a set of software modules, libraries, interfaces, and steering-sensitive scheduling algorithms. The project will have impact on both the computer science and disciplinary science communities. It will foster new research in computer science through the development of event models, performance models, data management strategies, and adaptive scheduling and steering algorithms. It will also enable domain scientists to obtain new results in neuroscience.
|
1 |
2002 — 2006 |
Baru, Chaitanya Casanova, Henri Snavely, Allan (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Data Intensive Grid Benchmarks @ University of California-San Diego
Efficient development and deployment of Grids will be advanced by defining a suite of benchmark to measure expected quality-of-service for Grid architectures and anticipated time-to-solution of Grid applications. In particular, the benchmarks to focus on the impact of alternative middleware implementations on application performance are needed.
The outcomes of the project will be: (1) A set of low-level benchmark probes for measuring the performance of Grid infrastructure and the overheads of Grid middleware. (2) A synthetic applications benchmark suite embodying anticipated Grid usage scenarios of several emerging data intensive Grid applications. (3) Applications profiling tools that can summarize the resource usage patterns of Grid applications, and thus inform the refinement of an evolving synthetic benchmark as production applications mature. (4) A maintained website of performance data where Grid users can obtain benchmarks, view results reported by this group and others, and submit their own new results. (5) Enabled research (some as part of this proposal, but also by the wider community) into factor affecting Grid performance.
As Grids emerge, it is important to deploy measurement methods along with them so that applications and architectures can evolve guided by scientific principles. All sciences need agreed upon metrics-a common language for communicating results. And a system, to be well engineered, must be measured so that alternative implementations can be compared quantitatively. Also, users of systems need performance objectives that describe system capabilities so that they can develop and tune their applications towards informed objectives. And system architects need examples of how the users will exercise the system to inform the design process. Benchmarks are thus an important part of the middleware for Grids, enabling communication about technological advances and design tradeoffs. The proposed work will develop a suite of data intensive Grid benchmarks to convey information back and forth between people building Grids and people planning to use them.
The data-intensive applications form a class of problems requires access to multiple data archives remote from their computational resources. They belong on the Grid because they are already distributed (as opposed to traditional HPC applications that may or may not scale well to Grids). A benchmark suite to embody the resource requirements of these emerging applications is needed to help Grid architects understand how applications will stress infrastructure and middleware, and give users an early look at the quality-of-service they can expect for their applications.
|
1 |
2002 — 2005 |
Dongarra, Jack [⬀] Beck, Micah Plank, James (co-PI) [⬀] Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Gridsolve: a System For Grid-Enabling General Purpose Problem Solving Environments @ University of Tennessee Knoxville
The purpose of GridSolve is to create the Middleware necessary to provide a seamless bridge between the simple, standard programming interfaces and desktop scientific computing environments (SCE) that dominate the work of computational scientists and the rich supply of services supporting by the emerging Grid architecture. Users of scientific computing environments can easily access and reap the benefits of shared processing, storage, software, data resources and so forth, by using GridSolve. Gridsolve will be build upon the work done earlier on NetSolve software. The work will lead to a new RPC mechanism Grid RPC, will provide client interfaces for key general purposes SCEs, will ensure that GridSolve can bind to and utilize all major Grid backends, will adapt GridSolve to use existing Grid services (security, resource monitoring) ,and will provide support for flexible state management and data logistics through the use of distributed storage resources.
Gridsolve is based on a substantial base of NetSolve software, and draws upon knowledge and experience using NetSolve with a variety of Grid environments and applications. The effort will make tools like Matlab and Mathematica work on the grid, with the result that grid technologies and capabilities will be useful to an even broader base of scientists and researchers.
|
0.936 |
2002 — 2005 |
Dongarra, Jack (co-PI) [⬀] Beck, Micah Plank, James [⬀] Wolski, Richard Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ngs: Optimizing Performance and Reliability in Distributed Computing Systems Through Wide Spectrum Storage Services @ University of Tennessee Knoxville
Title: Optimizing Performance and Reliability in Distributed Computing Systems Through Wide Spectrum Storage Services
Logistical Networking promotes an innovative model of network storage that views data transmission and storage as part of a unified framework, just as military or industrial logistics view transportation lines and storage depots as coordinate elements of a single infrastructure. Under the Logistical Computing and Internetworking (LoCI) project, we have created and implemented a primitive abstraction of network storage, called the Internet Backplane Protocol (IBP) [43], as the lowest network layer in a "storage stack" for Logistical Networking.
The goal of the proposed research is to test this hypothesis by creating a Logistical Runtime System (LoRS) designed to support advanced applications on a well-provisioned network of IBP-based storage, and then exploring its value and performance along several dimensions using real-world Grid applications.
|
0.936 |
2003 — 2007 |
Wolski, Richard Casanova, Henri Chien, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ngs/Models to Support Performance-Engineering of Global Computations @ University of California-Santa Barbara
Models to Support Performance-Engineering of Global Computations
This Proposal focuses on modeling and scheduling methodologies that will enable global computing, thus extending Computational Grids and high-performance distributed computing to a global level. There are three objectives to the work proposed: 1. develop novel models of complex emerging global and Grid computing platforms for both resource availability and resource classi_cation, attempting to unify the performance characteristics of both platform types. 2. Using these models, develop scheduling strategies which enable robust, predictable performance and deliver high availability for applications. 3. validate these approaches with two \real-world" global computing applications, using both novel simulation techniques and if possible a real global computing system. The predictive models and simulation framework will advance the foundational understanding of both Grid and global computing environments. These developments particularly will be key to advancing the state-of-the-art of the global computing infrastructure by addressing the computer science research questions that are relevant to both communities.
|
0.975 |
2003 — 2007 |
Ferrante, Jeanne [⬀] Carter, Larry (co-PI) [⬀] Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Software: Autonomous Scheduling On Large Distributed Systems @ University of California-San Diego
Advances in network and middleware technologies have brought computing with many widely-distributed and heterogeneous resources to the forefront, both in the context of Grid Computing and of Internet Computing. These large distributed platforms allow scientists to solve problems at an unprecedented scale and/or at greatly reduced cost. The high level goal of this work is to further the development of software methodologies and algorithms to enable scientists, engineers and others to use large heterogeneous distributed systems.
Application domains that can readily benefit from such platforms are many; they include computational neuroscience, factoring large numbers, genomics, volume rendering, protein docking, or even searching for extra-terrestrial life. Indeed, those applications are characterized by large numbers of independent tasks, which makes it possible to deploy them on distributed platforms with high network latencies. More specifically, in this work we assume that all application data initially resides in a single repository, and that the time required to transfer that data is a significant factor. Efficiently managing the resulting computation is a difficult and challenging problem, given the heterogeneous and typically dynamic attributes of the underlying components. Such an approach allows for adaptivity and scalability, since decisions and changes can be made locally. This approach is particularly effective for scheduling in environments that are heterogenous, dynamic, and unstructured, such as global and peer-to-peer computing platforms consisting mostly of home PC's.
This research develops a simple yet general computation and communication model for Grid and Internet platforms, and autonomous and decentralized scheduling techniques based on this model. It analyzes the optimality of these techniques in terms of steady-state and overall application performance. Further, it encorporates adaptability and fault-tolerance, and evaluates the resulting techniques by both simulating and running real applications on actual testbeds. Its overall impact to the scientific community is to enable scientists to solve important classes of problems faster and in a more cost-effective fashion.
|
1 |
2003 — 2008 |
Ferrante, Jeanne [⬀] Carter, Larry (co-PI) [⬀] Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Us-France Cooperative Research: Algorithms and Simulations For Scheduling On Large-Scale Distributed Platforms @ University of California-San Diego
0314180 Ferrante
Scheduling computational tasks on a given set of processors is a key issue for high-performance computing. Future computing systems, such as the computational grid, are likely to be widely distributed and strongly heterogeneous. This three-year US-France cooperative research award between the University of California at San Diego, Ecole Normal Superieure and French National Institute for Research in Informatics and Applied Mathematics (INRIA) in Lyon addresses the impact of heterogeneity on design and analysis of static scheduling techniques on grid-based systems. The project has three major objectives: (1) development of hierarchical, steady state scheduling algorithms for heterogeneous platforms; (2) adaptation of peer-to-peer strategies for client-server applications; and (3) extension of SIMGRID simulation methodologies and tools. SIMGRID is a discrete-event simulation toolkit that can be used for distributed applications and computing environment topologies. The researchers involved in this project are: Jeanne Ferrante, Larry Carter and Henri Casanova of the University of California at San Diego and the San Diego Supercomputing Center, and Eddy Caron, Yves Robert of the Ecole Normale Superieure in Lyon, Frederic Vivien of INRIA.
This award represents the US side of a joint proposal to NSF and INRIA. NSF provides funds for visits to France by US investigators and students. They will participate in joint research and a concluding workshop at the end of the third year. INRIA supports the visits of French researchers to the United States. The joint activities take advantage of combined US-French expertise in models and algorithm techniques for scheduling on large-scale distributed, grid-based systems. The project advances NSF's priority area - cyberinfrastructure research and development - which will enable collaboration among scientists and engineers across disciplines and national boundaries.
|
1 |
2005 — 2009 |
Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Csr-Pdos: Designing Large-Scale Distributed Systems For Realistic Failure Models
Large-scale distributed systems are now being regularly built and
deployed. Being large-scale, they are comprised of many components
- computers, communications infrastructure, storage devices, and
so on - all of which are prone to failure (that is, exceptional
nondeterministic behavior). Designing large-scale systems to cope
efficiently with such failures is a difficult and ongoing problem;
to do so effectively requires a much deeper understanding of how
they actually fail. To gain such understanding, the PIs are
collecting information on the failures of three actual large-scale
systems: a data grid, a desktop grid, and a peer to peer cooperative
backup system. This data is being collected either through cooperation
with other funded projects (eg, the BIRN project) or by actually
deploying the system themselves. The collected failure data is being
used to develop more abstract failure models that can be used as the
basis of algorithm and system development. The PIs are using these
failure models to understand how the systems being studied can be
improved (for example, by having higher availability, lower overhead,
or better performance than the original systems which are based on much
less precise failure models). The PIs are making available all the failure
information that they collect (after being anonymized) via the web.
Failure models and protocols are being made available through papers,
which are also distributed when ready via the project web site. This
information is of interest to those designing, constructing and
deploying new large-scale systems.
|
0.969 |
2009 — 2013 |
Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Ii-New: Distributed Research Testbed (Dirt)
Through Grid and Cloud computing, the importance of distributed computing has risen dramatically in recent years, increasing the computational power available to a widening audience of scientific and commercial users. Gains in computing power have caused a drastic increase in the volume of data produced by users, requiring new research on improved management and access to distributed data. These gains also drive the need for efficient scheduling and leasing of computational resources and for adapting current work in machine virtualization to a distributed context. These research directions require the development and evaluation of new models for computational, communication, and storage costs, but existing infrastructure makes model evaluation difficult or impossible, since they are in constant use by other researchers. This project addresses these concerns by providing a diverse group of researchers with a Distributed Research Testbed (DiRT) on which to develop and evaluate new technologies. The clusters making up the testbed are located at the University of Chicago, the University of Florida, the University of Hawai?i, the University of Notre Dame, and the University of Mississippi. Unlike working grid environments, we have complete low-level control of the hardware and complete knowledge of where the data and computation are located. We will use the testbed to address problems faced today by the growing number of users of distributed computing. Because high performance computing is essential to the conduct of modern science, this project will have significant impact on research and education in a a wide variety of scientific disciplines.
|
0.969 |
2017 — 2019 |
Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Si2-Sse: Wrench: a Simulation Workbench For Scientific Worflow Users, Developers, and Researchers
Many scientific breakthroughs can only be achieved by performing complex processing of vast amounts of data efficiently. In domains as crucial to our society as climate modeling, oceanography, particle physics, seismology, or computational biology (and in fact in most fields of physics, chemistry, and biology today), scientists nowadays routinely define "scientific workflows". These workflows are complex descriptions of scientific processes as data and inter-dependent computations on these data. When executed, typically with great expenses of computing, storage, and networking hardware, these workflows can produce groundbreaking results. A famous and recent example is the workflow that was used as part of the LIGO project to confirm the first detection of gravitational waves from colliding black holes. Scientific workflows are mainstays in today's science. Their efficient execution (in terms of speed, reliability, and cost) is thus crucial. This project seeks to provide a software framework, called WRENCH (Workflow Simulation Workbench), that will make it possible to simulate large-scale hypothetical scenarios quickly and accurately on a single computer, obviating the need for expensive and time-consuming trial and error experiments. WRENCH potentially enables scientists to make quick and informed choices when executing their workflows, software developers to implement more efficient software infrastructures to support workflows, and researchers to develop novel efficient algorithms to be embedded within these software infrastructures. In addition, WRENCH makes it possible to bring scientific workflow content into undergraduate and graduate computer science curricula. This is because meaningful knowledge can be gained by students using a single computer and the WRENCH software stack, making such learning possible even at institutions without access to high-end computing infrastructures, such as many non-Ph.D.-granting and minority-serving institutions. As a result, this work will contribute to producing computer science graduates better equipped to take an active role in the advancing of science. Due to its potentially transformative impact on scientific workflow usage, development, research, and education, this project promises to promote the progress of science across virtually all its fields, ultimately resulting in broad and numerous benefits to our society.
Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, building large-scale workflows and orchestrating their executions efficiently (in terms of performance, reliability, and cost) remains a challenge given the complexity of the workflows themselves and the complexity of the underlying execution platforms. A fundamental necessary next step is the establishment of a solid "experimental science" approach for future workflow technology development. Such an approach is useful for scientists who need to design workflows and pick execution platforms, for WMS developers who need to compare alternate design and implementation options, and for researchers who need to develop novel decision-making algorithms to be implemented as part of WMSs. The broad objective of this work is to provide foundational software, the Workflow Simulation Workbench (WRENCH), upon which to develop the above experimental science approach. Capitalizing on recent advances in distributed application and platform simulation technology, WRENCH makes it possible to (i) quickly prototype workflow, WMS implementations, and decision-making algorithms; and (ii) evaluate/compare alternative options scalably and accurately for arbitrary, and often hypothetical, experimental scenarios. This project will define a generic and foundational software architecture, that is informed by current state-of-the-art WMS designs and planned future designs. The implementation of the components in this architecture when taken together form a generic "scientific instrument" that can be used by workflow users, developers, and researchers. This scientific instrument will be instantiated for several real-world WMSs and used for a range of real-world workflow applications. In a particular case-study, it will be used with a popular WMS (Pegasus) to revisit published results and scheduling algorithms in the area of workflow planning optimizations. The objective is to demonstrate the benefit of using an experimental science approach for WMS research. Another impact of this project is that it makes it possible to include scientific workflow content pervasively in undergraduate and graduate computer science curricula, even for students without any access to computing infrastructure, by defining meaningful pedagogic activities that only require a computer and the WRENCH software stack. This educational impact will be demonstrated in the classroom in both undergraduate and graduate courses at our institutions.
|
0.969 |
2019 — 2022 |
Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Cybertraining: Implementation: Small: Integrating Core Ci Literacy and Skills Into University Curricula Via Simulation-Driven Activities
Scientific and societal progress in the 21st century relies on a large, heterogeneous, and evolving ecosystem of Parallel and Distributed Computing (PDC) technologies. And yet, most college students graduating today from computing curricula have little exposure to PDC concepts and practices. There is thus an imminent risk that the emerging scientific workforce will be ill-prepared for using and developing those computing infrastructures that are key to progress. Teaching PDC early and effectively in university curricula is notoriously difficult, in part due to the need to provide students with access to and meaningful hands-on learning opportunities on actual PDC platforms. This project addresses this challenge directly by relying on simulation technology: it provides students with hands-on learning opportunities that do not require access to any PDC platforms. This makes it possible to teach the full gamut of PDC conceptual and practical topics effectively and at any higher education institution in the nation. The pedagogic activities being developed in this project can be integrated into existing university courses and also provide a sound basis for developing new courses, starting at freshman levels. By supporting education in a view to modernizing the scientific workforce, this project promotes the progress of science, as stated by NSF's mission.
Years of Cyberinfrastructure research and development have resulted in a rich set of abstractions and interoperable software implementations that can leverage a wide range of hardware platforms. It is crucial to provide students with hands-on pedagogic activities through which they can acquire the PDC conceptual and practical knowledge necessary for them to join a workforce that develops and uses this Cyberinfrastructure. Requiring that these activities be conducted on actual hardware and software stacks limits participation because only few institutions have access to secure representative, stable, and possibly large deployments that can be used for educational purposes. The main insight behind this work is that simulation promotes both participation and pedagogy because it allows students to experience arbitrary Cyberinfrastructure scenarios, only requiring that they have access to a standard laptop computer. This is feasible due to the recent development of simulation frameworks for easily developing simulators of complex distributed systems that afford simulations that are both pedagogically accurate and scalable. Given this insight and this recent development, this project develops simulation-driven interactive pedagogic activities for a spectrum of Student Learning Objectives (SLOs), ranging from standard PDC SLOs as well as SLOs relevant to current and emerging Cyberinfrastructure practices. The activities are organized in modules with a prerequisite structure, and come with guidelines for integration into existing university courses, starting at freshman levels. Several pedagogic strategies are employed through which students execute interactive simulations with configurable levels of details along various narrative paths. Research questions include determining which strategies, with which levels of simulation details, work best for which SLOs.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.969 |
2020 — 2021 |
Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ccri: Planning: Collaborative Research: Infrastructure For Enabling Systematic Development and Research of Scientific Workflow Management Systems
Scientific workflows are used almost universally across research and engineering disciplines and have underpinned some of the most significant discoveries of the past several decades (e.g., first detection of gravitational waves from colliding black holes, the discovery of the Higgs boson, and the detection of an exotic nuclear decay). Workflow management systems (WMSs) are software systems that provide abstraction and automation for facilitating access to and management of distributed and heterogeneous compute and storage resources. They enable a broad range of researchers to easily define sophisticated computational processes and to then execute them efficiently on parallel and distributed computing platforms. Unfortunately, in spite of widespread adoption of workflows, the technology landscape is segmented and presents significant barriers to entry due to the existence of dozens of seemingly comparable, yet incompatible, systems. The research landscape is also disjoint, making it difficult to compare and contrast approaches, verify and reproduce results, and build upon existing work.
This project will engage with representatives from the workflows community ? including researchers, developers, science and engineering users, and cyberinfrastructure experts. Through targeted community surveys and focused workshops, the project will gather a diverse set of perspectives, create a community-owned WMS inventory and common knowledge taxonomy, define an experimental methodology for measuring WMS capabilities, and develop a blueprint for a community research infrastructure. This proposed infrastructure has the potential to truly democratize workflows research, enabling researchers, postdocs, and students, irrespective of their institutions, to access cutting-edge infrastructure for comparison, evaluation, and verification of workflows research results.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.969 |
2021 — 2024 |
Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Elements: Simulation-Driven Evaluation of Cyberinfrastructure Systems
Most scientific breakthroughs and discoveries are now preconditioned on performing complex processing of vast amounts of data as conveniently, reliably, and efficiently as possible. This requires high-end interconnected compute and storage resources, as well as software systems to automate the processing on these resources. An enormous amount of effort has been invested in producing such "cyberinfrastructure" software systems. And yet, developing and evolving these systems so that they are as efficient as possible, while anticipating future cyberinfrastructure opportunities and needs, is an open challenge. This project transforms the way in which these systems are evaluated, so that their capabilities can be developed and evolved judiciously. The traditional evaluation approach is to observe executions of these systems on real-world hardware resources. Although seemingly natural, this approach suffers from many shortcomings. Instead, this project focuses on simulating these executions. Simulation has tremendous, and untapped, potential for transforming the development cycle of cyberinfrastructure systems. Specifically, this project produces software elements that can be easily integrated into existing and future systems to afford them with simulation capabilities. These capabilities make it possible for developers to put their systems through the wringer and observe their behaviors for arbitrary operating conditions, including ones that go beyond current hardware platforms and scientific applications. Simply put, these capabilities will make it possible to establish a solid experimental science approach for the development of cyberinfrastructure systems that support current and future scientific endeavors that are critical to the development of our society.
The cyberinfrastructure has been the object of intensive research and development, resulting in a rich set of interoperable software systems that are used to support science. A key challenge is the development of systems that can execute application workloads efficiently, while anticipating future cyberinfrastructure opportunities and needs. This project aims to transform the way in which these systems are evaluated, so that their capabilities can be evolved based on a sound, quantitative experimental science approach. The traditional evaluation approach is to use full-fledged software stacks to execute application workloads on actual cyberinfrastructure deployments. Unfortunately, this approach suffers from several shortcomings: real-world experiments are time- and labor-intensive, and they are limited to currently available hardware and software configurations. An alternative to real-world experiments that does not suffer from these shortcomings is simulation, i.e., the implementation and use of a software artifact that models the functional and performance behaviors of software and hardware stacks of interest. This project uses simulation to transform the way in which cyberinfrastructure systems are evaluated as part of their long-term development cycles. This is achieved via software elements for enhancing production cyberinfrastructure systems with simulation capabilities so as to enable quantitative evaluation of these systems for arbitrary execution scenarios. Creating these scenarios requires little labor, and executions can be simulated accurately and orders of magnitude faster than their real-world counterparts. Furthermore, simulations are perfectly reproducible and observable. While this approach is general, its effectiveness will be demonstrated by applying it to a number of production systems, namely, workflow management systems. This project capitalizes on the years of development invested in the SimGrid and WRENCH simulation frameworks.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.969 |
2021 — 2024 |
Casanova, Henri |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Oac Core: Simulation-Driven Runtime Resource Management For Distributed Workflow Applications
Many scientific breakthroughs in domains such as health, climate modeling, particle physics, seismology, etc., can only be achieved by performing complex processing of vast amounts of data. This processing is automated by software systems that use the compute, storage, and network hardware provided by the cyberinfrastructure. In addition to automation, a key objective of these systems is the efficient use of the resources as measured by cost and energy usage, while making the processing as fast as possible or as needed. To this end, these systems must make decisions regarding which resources should be used to do what and when. Many such systems are used in production today and make such decisions. Yet making good, let alone best, decisions is still an open research challenge. Theoretical research has proposed solutions that are difficult to put into practice, and practical solutions are known to not make good decisions, or at least not consistently so. However, both theory and practice follow the same basic philosophy: make decisions by reasoning about known information on what needs to be computed and on what hardware resources are available. This philosophy has shown its limits, so this project adopts a radically different approach. The key idea is to repeatedly execute fast, computationally inexpensive simulations of the application execution in order to evaluate large sets of potential resource management decisions and automatically select the most desirable ones. The benefits of this approach will be demonstrated for several software systems used to support scientific applications that are critical for the development and sustainability of society.
Software systems are used to run scientific applications on advanced cyberinfrastructure. These systems automate application execution, and make resource management decision along several axes including selecting and provisioning (virtualized) hardware, picking application configuration options, and scheduling application activities in time and space. Their objective is to optimize both application performance and also a set of resource usage efficiency metrics that include monetary and energy costs. Consequently, the resource management decision space is enormous, and making good decisions is a steep challenge that has been the subject of countless efforts, both from theoreticians and practitioners. However, the challenge is far from being solved: theoreticians produce solutions that are rarely used by practitioners, and conversely practitioners implement solutions that may be highly sub-optimal because they not informed by theory. This project resolves this disconnect by obviating the need for developing effective resource management strategies. The key idea is to use online simulations to search the resource management decision space rapidly at runtime. Large numbers of fast simulations of the application's execution are executed throughout that very execution, so as to evaluate many potential resource management options and automatically select desirable ones. This approach thus shifts the overall problem from the design of complex resource management algorithms to the enumeration of many resource management decisions. The transformation of resource management practice in cyberinfrastructure systems not only renders the resource management problem tractable but also unlocks previously out-of-reach resource management decisions. The benefits of this transformation will be demonstrated for a critical class of production systems and applications, specifically Workflow Management Systems and the scientific applications they support.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.969 |