2009 — 2014 |
Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: a Vertically Integrated Approach to Microprocessor Design For Unpredictable Silicon @ University of Wisconsin-Madison
The basic hardware technology that is used to build microprocessors and the applications we run on them are are radically changing. Hardware building blocks are becoming increasingly brittle and error-prone, leading to a fundamentally unpredictable hardware substrate. Applications are also changing, and emerging new classes of applications are increasingly relying on probabilistic methods. They have an inherent tolerance for uncertainty, do not require hardware to be correct all the time, and this provides an opportunity to creatively utilize hardware.
This CAREER proposal develops a vertically integrated approach that exposes these hardware properties to the applications to develop more efficient system-wide solutions. This simplifies the physical manufacturing and design of microprocessors, and can make them more efficient. The PI unifies analyses in different parts of the microelectronics system stack to develop new interfaces between layers and techniques to expose more information across these layers. In the integrated education plan, the PI seeks to broaden the computer architecture curriculum by educating students on the implications of these physical silicon technology drivers on microprocessor designs. The research component of this proposal has broad implications across all microelectronic systems since it is becoming harder to manufacture smaller and smaller devices. The innovations proposed can enable continued technology scaling and have the potential to reduce complexity and cost in the overall micro-electronics system stack, and help maintain the US lead in microelectronics.
This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
|
0.945 |
2009 — 2013 |
Townsend, Richard Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Grassy: a Hardware-Accelerated Spectral Synthesis Engine For Asteroseismic Mode Identification @ University of Wisconsin-Madison
A hardware-accelerated graphics processing unit-based computational machine will be designed, developed, and commissioned to support reduction and analysis of time series of stellar spectra. The goal of this project is to study the variability to infer details of the structure of the inside of the observed stars. This tool, referred to as asteroseismology, is a unique method to probe a star's internal structure from measurements of its discrete oscillation spectrum. Asteroseismology was originally developed for studying the sun, but advances in space-based observations are allowing its application to other stars.
|
0.945 |
2009 — 2013 |
Estan, Cristian (co-PI) [⬀] Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Nets:Small:a Unified Lookup Framework to Enable the Rapid Deployment of New Protocols in High-Speed Routers @ University of Wisconsin-Madison
High-end routers and switches perform performance-critical lookups into large data based on network packet content. Current designs rely on specialized hardware modules for high performance. While mostly sufficient, emerging new protocols create a problem; for these to be deployed, expensive equipment upgrades are required. To support the rapid deployment of new data plane protocols, we need a flexible hardware/software framework.
This project is developing such a unified framework for future network devices with four components: a) an abstract execution model to represent hardware, b) a toolchain to ease implementation, d) quantitative performance evaluation of protocols, and d) programmable hardware architectures specialized for the core operations in network protocols. Leveraging trends in tiled architectures, the project is developing a specialized tiled architecture that moves computation close to storage and thus provide efficient lookups. The project is also investigating how these design methodologies extend beyond protocol processing to payload inspection as required by intrusion prevention, application identification and other functionality.
Deploying and implementing new protocols is challenging. This project will result in specification of flexible hardware for high-end routers. This flexibility to support multiple protocols using a single toolchain and architecture framework will reduce development costs of future high-end routers and speedup the deployment of new protocols. Another outcome will be the open-source release to the academic community of the toolchain for developing high-speed implementation of protocols, reference implementations of various protocols, standard data sets, and a NetFPGA implementation of the proposed hardware architecture.
|
0.945 |
2009 — 2013 |
Vernon, Mary (co-PI) [⬀] Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Shf: Small: Multi-Core Architecture, Applications, and Tools Co-Design @ University of Wisconsin-Madison
Technology limitations, emerging applications, and changing usability trends are ushering in a new era in multi-core computer systems. A key problem for both application and microprocessor design is that applications are largely evolving independently from the architectural development of microprocessors. This is a problem for architectural research because developing efficient architectural solutions requires realistic characterization of the next generation applications. From a system design perspective, understanding application behavior is crucial for building an efficient system, since they must be optimized to exploit mechanisms provided in the architecture. This proposal seeks to re-think next generation multi-core systems - both software and hardware architectures using state-of-the-art quantitative design tools.
The two key ideas explored in this research are the following. First, is a hybrid task-level/data-level parallelism execution model for emerging applications that have abundant but not synchronization-free parallelism. Second, is the development of new highly accurate and efficient quantitative models to evaluate architecture and application design alternatives, at scale and over a wide range of application workloads. The project seeks to provide a suite of quantitative tools to close the development loop of design, evaluation and analysis of software's behavior on hardware, allowing the tuning of both software and hardware. This project takes real-time graphics as a challenge application and derives a full system, called Copernicus, for future real-time graphics that can provide significantly higher image qualities. The project will also integrate these quantitative models in the curriculum and disseminate to the research community.
The innovations proposed in this research have the potential for significantly aiding microprocessor and application development for future systems. The development of a full system specification for ray-tracing can trigger an inflection in the evolution of both programmable processors and specialized graphics processors.
|
0.945 |
2011 — 2014 |
Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Csr: Small: Formal Models, Processor Architecture, and Evaluation of Sampling For Hardware Reliability @ University of Wisconsin-Madison
Device physics, manufacturing, and engineering challenges in process scaling are providing signi?cant challenges in producing reliable transistors for future technologies. Many academic experts, industry consortia, and research panels have warned that future generations of silicon technology are likely to be much less reliable with multi-core chips with cores failing in the ?eld due to faults in silicon are around the corner. Concurrently with the reducing reliability, the individual energy ef?ciency of transistors is not keeping up with increase in transistor density. These two trends portend a perfect storm: as the energy ef?ciency of transistors is slowing down, they are becoming highly unpredictable which will force further inef?ciencies. Addressing hardware reliability is a fundamental problem for microprocessors and hence for sustaining the IT revolution. This project looks at mechanisms for allowing chips and the higher levels of software to continue working even when devices fail. The basic idea the project looks at is how to detect when chips fail.
The core idea that this projrct builds upon is the principle of Sampling. Instead of checking for failures all the time, the idea is to use a periodic sampling window for checking for device failures. The project investigates formal models, hardware implementation, and evaluation to understand the effect of device failures and the impact of the detection techniques.
|
0.945 |
2012 — 2016 |
Sifakis, Eftychios Sankaralingam, Karthikeyan Patel, Jignesh |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Csr: Small: Accelerating Towards the Hardware Specialization Era: a Holistic Approach @ University of Wisconsin-Madison
Due to fundamental power limitations, the energy efficiency of transistors is not improving, thus making computer chips power limited. Without significant innovations in the ways to build microprocessor hardware, the performance improvements that we have grown accustomed to cannot be sustained. Specialization using accelerators, and thereby reducing power, is a promising way forward, as evidenced by the commercial success of SIMD-accelerators and GPUs, and by new computer hardware research concepts. Clearly there are very many choices in this design space of accelerators, ranging from the granularity of the work they target, the baseline processor they accelerate, compilation and execution model, and abstraction presented to the application developer. This wide choice introduces a fundamental problem in that tools and application frameworks do not exist for reasoning about this wide design space. This project's over-arching goal is to develop quantitative tools, mechanisms, and detailed application studies covering the design space of accelerators.
The project provides an array of techniques for the specialization era: a novel framework to guide microprocessor designs and application developers to reason about accelerators, architecture mechanisms, and highly optimized applications for accelerators. The tools and mechanisms help sustain performance improvements in future power-constrained technology generations. The project's prototype system, simulation enhancements, and compiler framework are released as tools for the community. The project's focus on computational-physics applications has numerous uses that positively impact broad swaths of society by enabling applications ranging from virtual surgery to modeling vehicle traversal on snow terrains. The PIs continue mentoring under-represented students and undergraduate students as part of this project.
|
0.945 |
2012 — 2017 |
Jimenez, Daniel Jha, Somesh (co-PI) [⬀] Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Shf: Medium: Title: Idempotent Processing and Architectures @ University of Wisconsin-Madison
For many decades, Moore's Law has allowed exponential growth in computing capability while simultaneously reducing the power consumed by digital devices. Due to fundamental material properties and engineering challenges, in the future the power and energy efficiency of transistors that are the building blocks of digital devices will not improve significantly. Thus to continue providing performance improvements without increasing power consumption, new techniques to design microprocessors are required. This research project looks at a new approach to build microprocessors to make them more energy efficient. The main idea in this research project is to develop techniques allowing microprocessors to efficiently predict without having to expend power-hungry resources to recover in case the prediction is wrong. The research leverages the mathematical principle of idempotence (doing something multiple times producing the same result) in a novel way. In this project, this principle is applied to microprocessor design to develop a class of processors called Idempotent Processors. The research addresses formal theoretical analysis of the technique, ways to build software compilers, and microprocessor designs spanning CPUs to GPUs to exploit this principle.
The core idea of this project is to use the property of idempotence: performing an idempotent operation many times produces the same result. The research builds upon the following insight: applications naturally decompose into a continuous series of idempotent regions; i.e., their execution can be broken down into a set of regions, where each region is idempotent - re-execution has no side-effects. The research develops the idea of Idempotent Processors, whose fundamental abstraction is executing idempotent regions of code. This allows novel modifications to the microprocessor pipeline and allows many forms of speculation without the need to restore any state prior to re-execution. This design approach unifies speculation for performance, reliability, and energy efficient execution under one principled approach. The static analysis research formalizes the notion of idempotence and investigates mechanisms for determining idempotent regions. The compiler implementation for various ISAs (instruction set architectures), CPUs (central processing units), and GPUs (grahics processing units) evaluates the approach quantitatively.
The project's end-to-end solutions across multiple synergistic directions have potential for disruptive impact. The project involves collaborative work between UW-Madison and UT-San Antonio and involves under-graduate researchers, exchanges visits between institutions, and explores integrated curriculum enhancement and outreach across UW and UTSA. The project's multi-disciplinary and multi-institution collaboration provides distributed impact.
|
0.945 |
2012 — 2017 |
Jha, Somesh [⬀] Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Twc: Phase: Medium: Collaborative Proposal: Understanding and Exploiting Parallelism in Deep Packet Inspection On Concurrent Architectures @ University of Wisconsin-Madison
Deep packet inspection (DPI) is a crucial tool for protecting networks from emerging and sophisticated attacks. However, it is becoming increasingly difficult to implement DPI effectively due to the rising need for more complex analysis, combined with the relentless growth in the volume of network traffic that these systems must inspect. To address this challenge, future DPI technologies must exploit the power of emerging highly concurrent multi- and many-core platforms. Unfortunately, however, current DPI systems severely limit their use of parallelism by either resorting to coarse-grained load-balancing or restricting their analysis to very simple, hard-coded detectors.
In order to fully exploit parallel hardware platforms, in this project we develop a comprehensive approach that introduces parallelism across all stages of the complex DPI pipeline. We investigate application-independent scheduling strategies that take existing DPI analyses and automatically parallelize their processing. We do so by mapping them into a domain-specific intermediary representation that abstracts from specifics of the underlying hardware architecture while providing low-level consistency guarantees. Conceptually, the project's goal is to virtualize and abstract parallelism as a fundamental primitive, just like how virtual memory abstracts away physical memory size limitations from programmers.
|
0.945 |
2014 — 2017 |
Lu, Shan [⬀] Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Xps: Full: Cca: Production-Run Failure Recovery Based Approach to Reliable Parallel Software
Title: XPS: FULL: CCA: Production-Run Failure Recovery Based Approach to Reliable Parallel Software
Concurrency bugs are a severe threat to system reliability in the multi-core era. Approaches to handling concurrency bugs and improving the reliability of production-run parallel software are sorely needed. This project aims to create a new parallel computing paradigm. The intellectual merits are that the project will pioneer treating run-time failure recovery as default for parallel programs, and reshaping every aspect of parallel-program development and maintenance. The project's broader significance and importance are that it will help lower the costs of software development, in-house testing, failure diagnosis, and bug repair, broadly benefiting society through better-performing parallel software.
Specifically, the proposed framework will include five components: (1) a feather-weight run-time recovery framework that utilize natural program idempotence to obtain natural concurrency-bug failure recovery; (2) a new code-development system that guide developers to write software with improved recoverability; (3) a new in-house testing system, where the testing focus is shifted towards hard-to-recover code; (4) a new on-demand run-time monitoring system that leverages on-demand run-time monitoring for run-time recovery; (5) a new off-line failure diagnosis system that leverages the feedback from recovery for failure diagnosis and fixing. These five components will work together to significantly improve the reliability and lower the development cost of parallel software.
|
0.964 |
2016 — 2019 |
Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Shf:Small:Pushing the Limits of Transparent Specialization @ University of Wisconsin-Madison
For many years, microprocessors took advantage of how Moore's Law (smaller successive generations of transistors) and the fact that they consumed less power to build better performing chips. Due to fundamental limitations it is no longer possible to cost-effectively make smaller transistors or significantly reduce their power consumption. This research project develops future microprocessors that are organized differently and more effectively using limited transistors to provide better chips. The main idea this research project explores is a way to tailor certain parts of a chip to certain tasks, thereby making each component very small and power-efficient. A program's execution moves from one such component to another, each being tuned for that phase of the program. The curriculum enhancements will provide students significant experience in designing hardware. This research will help steer microprocessor designs in novel ways to sustain performance improvements and help sustain information technology leadership.
The specific approach taken by this research is a unique and novel form of specialization called: behavior specialized acceleration (BSA). This is a paradigm that exploits program behaviors and their inter-relationship to hardware microarchitecture. It is workload domain agnostic. Specializing for program behaviors is advantageous both because fewer accelerators can target a large variety of codes, and because these behaviors are typically analyzable by a compiler, meaning their use can be transparent from programmers. In particular the research will develop a chip organization that includes an ExoCore fabric and an Endocore fabric. The ExoCore fabric is a processor core organization that uses behavior specialization to transparently improve the execution of general-purpose workloads. The Endocore fabric employs different behavior specializations and attempts to support computationally-intensive domains using a single hardware architecture and well-defined software interfaces. This research develops mechanisms, an FPGA prototype of a family of ExoCore and EndoCore designs, their accompanying compiler, and detailed performance evaluation.
|
0.945 |
2017 — 2018 |
Sankaralingam, Karthikeyan Ganco, Martin (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Pfi:Air - Tt: Memory Processing Unit: a Low Power Processor For Analytics Applications @ University of Wisconsin-Madison
This PFI: AIR Technology Translation project focuses on translating 3D chip-stacking technology to develop a new microprocessor architecture called Memory Processing Unit (MPU). The new architecture offers the promise of faster, more energy efficient calculations than current solutions. This is important because there is a large body of applications such as deep-learning, big-data analytics, and data science that all require significant processing capability. While capability requirements continue to increase to meet the needs of these new applications, the rate of improvement of power efficiency of the microprocessors is decreasing (in other words, the trends predicted by Moore's Law are beginning to slow).
This project will result in a prototype chip-design of the MPU with complete software API (Application Programming Interface) for end-to-end application demonstrations based on real-time speech recognition and big-data analytics for Internet search capabilities. The MPU includes the following unique features: an energy-efficient simple core implementation (including exploration of mechanisms for pipeline organization, virtual-memory, and coherence), an implementation that connects 128 such cores together, cores connected through 3D links to memory, and end-to-end software implementation. Compared to state-of-art server chips in the market, the MPU architecture provides two-fold to 12-fold calculation speedup while reducing energy consumption by 10-fold.
This project addresses the following two technology gaps as it translates from research discovery toward commercial application: a) the difficulty of developing end-to end software on this new highly concurrent architecture and the determination of the mechanisms required, and b) Extensive exploration of various application domains (initially demonstrating speech recognition and internet search capabilities) to determine quantitatively the benefits of this architecture over the state-of-the-art.
Personnel involved in this project, including graduate students and undergraduates, will receive innovation and entrepreneurship experiences through the technology commercialization activities, customer interviews, and business development. In addition the team will work with entrepreneurship programs like D2P at UW-Madison.
|
0.945 |
2018 — 2021 |
Sankaralingam, Karthikeyan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Fomr: Collaborative Research: Single-Thread Multi-Accelerator Execution to Close the Dennard Scaling Gap @ University of Wisconsin-Madison
With ever-slowing scaling trends in microprocessor technologies, traditional techniques of improving processor performance are no longer viable, and achieving higher performance requires a dramatically different approach. This project develops a multicore chip microarchitecture using specialized accelerators and code-injection techniques without needing to modify user-level software, compilers and operating systems. The impact of this research is to help steer microprocessor design in novel ways that can help sustain performance improvements, especially for datacenter and big-data computing.
This project builds on a recent promising technique involves offloading program phases onto specialized processors (accelerators) which are tuned to execute programs with specific characteristics (i.e., parallelism, control dependence, memory behavior) at extremely high efficiency. There are two main challenges which motivate the major thrusts of this work. The first is the question of how to design a practical system for managing the execution of heterogeneous accelerators and dynamic translation. The second is how to design a set of accelerators which provide integer factors of improvement over general purpose processors' performance and energy efficiency. This work addresses the first challenge by designing a disaggregated translation subsystem, including region detection hardware at each core, a set of disaggregated compiler cores and a translation cache, and a hardware/software layer which dynamically re-maps logical threads to physical cores based on dynamic code properties and load balancing. To design a set of balanced accelerators, this work is analyzing programs to identify key program behaviors, and developing targeted accelerators for each. Finally, this includes the design of synthesis-time resource allocation algorithms which will co-optimize the choice of cache interface, general core attributes, and accelerator execution model.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.945 |