2006 — 2010 |
Taylor, James (co-PI) [⬀] Nekrutenko, Anton Albert, Istvan (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Tailoring Genomic Data to the Needs of Experimental Biologists and Educators @ Pennsylvania State Univ University Park
Anton Nekrutenko Istvan Albert
Pennsylvania State University is awarded a grant to develop a resource that integrates various types of genomic data (sequences, alignments, microarray data) with an extensive array of analysis tools. The system, G2, will enable researchers to follow the "data->analyses->experiments" path that is critically needed to realize the goals of genome sequencing and functional genomics projects. It will have a very simple, intuitive interface that would allow instructors to demonstrate principles of genomic biology in school and college classrooms. Building on longstanding work from the bioinformatics group at Penn State University and its collaborators in developing genome analysis tools and databases, the G2 metaserver represents a new generation of comparative genomics resources. The project will develop a simple, compact, and powerful system for the integration of any genome-based information; create a framework that makes it exceptionally easy for any developer to add analysis tools to our system, and explore the utility of new "Google-like" user interfaces that interpret simple input and guide the user during the search process. An educational version of the system will enable teachers to show genomic biology in action and to demonstrate the strong connection between genomics and classical disciplines such as evolutionary biology. Undergraduate and graduate students working on the project will acquire interdisciplinary training in large-scale genome analysis, data mining and programming. A minority undergraduate student will be recruited to work on this project in the summer. The project data will be utilized in a lecture and in a computer class presented at the Biotechnology and Bioinformatics Workshops for K-12 science educators conducted annually at Penn State.
|
1 |
2009 — 2011 |
Nekrutenko, Anton |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
An Efficient Lightweight Environment For Biomedical Computation @ Pennsylvania State University-Univ Park
DESCRIPTION (provided by applicant): The translation from large volumes of experimental data to clinically relevant insights relies on sophisticated computational analysis tools that can handle the enormous high-throughput sequence, polymorphism, and functional datasets. Developing appropriate tools is necessary but not sufficient, because the independent analysis tools in themselves do not solve an increasingly problematic barrier blocking the bench-to-bedside path outlined in the NIH Roadmap for medical research: making powerful new computational tools readily accessible and useful for experimental biologists. Developing usable and consistent user interfaces requires significant effort, and few tool developers can afford to devote time and resources to this goal. Currently many powerful, independent analysis tools exist, but lack integrated, easy-to-use interfaces that would allow experimental biologists to take advantage of them. Thus, developing tools to analyze overwhelming amounts of data is no longer the main challenge in biomedical research. Instead the problem lies in making existing tools usable for bench biologists so that they can take full advantage of existing data. We have developed a system - GALAXY - that makes substantial progress toward solving this problem. For experimental biologists, it provides an intuitive and consistent interface for performing sophisticated analyses with minimal effort, regardless of the scale of data involved. For computational tool developers, it makes it easy to integrate existing tools with a modern user interface by writing a simple, concise interface description. For data providers, it features a simple, elegant data access protocol. Thus, GALAXY bridges a critically important gap between data resources, computational tools and users, by making it easy to modernize the interfaces of any existing tool, freeing developers of new tools from the need to develop interfaces from scratch, and facilitating tool interoperability and complex analyses by seamlessly integrating analysis outputs, applications and external data. Here we propose to develop novel features specifically designed for translational research. First, we will engineer a tool integration framework streamlining delivery of analysis software to experimentalists. Second, we will develop a statistical genetics toolkit allowing clinicians to manipulate and interpret human variation data on any scale. Third, we will implement the first integrated system for analysis of short-read sequencing data. Fourth, we will design utilities for manipulation of the most valuable comparative genomics resource - multi- genome alignments. Finally, we will build a workflow system to enable reproducible and collaborative analysis of genomic data. PUBLIC HEALTH RELEVANCE: Genomic data discovery is no longer a limiting factor for much of the medical research. The NIH Roadmap recognizes that many challenges in biomedical research will only be overcome through appropriate investment to improve integrative access to existing data and tools, so researchers can more effectively and rapidly trans- late their findings into practice. The proposed project addresses this challenge by allowing biomedical re- searchers to take advantage of the enormous sequence, polymorphism, and functional datasets easily and effectively.
|
1 |
2009 — 2014 |
Lazarus, Ross Taylor, James (co-PI) [⬀] Nekrutenko, Anton Pond, Sergei |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cyberinfrastructure For Accessible and Reproducible Research in Life Sciences @ Pennsylvania State Univ University Park
Pennsylvania State University University Park is awarded a grant to extend an existing software system, Galaxy (http://galaxyproject.org) that integrates computational biology tools for experimental biologists to access sequence, polymorphism and functional datasets. Analytic functions to be enhanced include a set of tools allowing bench scientists to perform large-scale analyses with whole-genome multiple alignments without ever leaving their web browsers; an interactive statistical genetics tool kit for comprehensive analysis of data generated from high-throughput genotyping experiments; an evolutionary analysis toolset for detection of selection, recombination and other critical evolutionary parameters, that can be applied directly to existing sequence and alignment datasets; and novel approaches for improving reproducibility of research in life sciences by implementing simple yet powerful analysis workflows that can be constructed by example and explicitly. The project is a collaborative effort with Brigham and Women's Hospital, Emory University, and University of California, San Diego.
Galaxy bridges a critically important gap between computational tools and users, by making it trivial to modernize the interfaces of existing tools, freeing developers of new tools from the need to develop interfaces from scratch, and facilitating tool interoperability and complex analyses by seamlessly integrating diverse tools and databases. Galaxy has two thriving communities of users: those who use the public Galaxy instance to access a variety of tools and databases with nothing more than a web browser, and those who take advantage of Galaxy framework?s portability and minimal installation requirements to create their own Galaxy instances and provide easy access to locally developed tools and private data.
|
1 |
2009 — 2010 |
Nekrutenko, Anton Taylor, James Peter [⬀] |
RC2Activity Code Description: To support high impact ideas that may lay the foundation for new fields of investigation; accelerate breakthroughs; stimulate early and applied research on cutting-edge technologies; foster new approaches to improve the interactions among multi- and interdisciplinary research teams; or, advance the research enterprise in a way that could stimulate future growth and investments and advance public health and health care delivery. This activity code could support either a specific research question or propose the creation of a unique infrastructure/resource designed to accelerate scientific progress in the future. |
Dynamically Scalable Accessible Analysis For Next Generation Sequence Data
DESCRIPTION (provided by applicant): Project Summary Wide availability of "next-generation" sequencing (NGS) instruments has enabled any investigator, for a modest cost, to produce enormous amounts of DNA sequence data. However, working with these raw sequences presents significant problems for individual investigators, small labs, or core facilities. For an experimental group with no computational expertise, simply running a data analysis program is a barrier, let alone building a compute and data storage infrastructure capable of dealing with NGS data. Fortunately, a computational model - "Cloud computing" - has recently emerged and is ideally suited to the analysis of large- scale sequence data. In this model, computation and storage exist as virtual resources, which can be dynamically allocated and released as needed. Importantly, cloud resources can provide storage and computation at far less cost than dedicated resources for certain use cases. However, formidable challenges need to be addressed to make these resources available to individual investigators. Specifically, although cloud computing provides a way to acquire computational resources on demand, the resources provided are either virtual machines on the Internet or specific programming libraries, which are unusable for experimentalists. Thus, a viable analysis solution needs to be accessible and deployable without informatics expertise;it must efficiently and automatically use dynamically scalable resources, while taking into account time and cost;it must include appropriate analysis tools and easily support addition of new tools as they emerge. We have previously developed a software system - Galaxy (http://galaxyproject.org) - that provides a robust framework for addressing these needs. Here we propose to significantly extend this framework to allow any experimentalist to perform large-scale NGS analyses utilizing the power of cloud computing infrastructure. In particular, we will modify the existing Galaxy framework to run entirely within the cloud. We will adapt the way Galaxy schedules and executes jobs to make effective use of cloud-style. We will provide a mechanism for individual users to create and deploy custom Galaxy instances on a cloud through an entirely web-based interface. Finally, we will test our approach by applying the developed facilities to the existing human re- sequencing data in order to uncover hidden patters of mutations causing human genetic disease on a very large scale. PUBLIC HEALTH RELEVANCE: Project Narrative Increasingly available and inexpensive high-throughput DNA sequencing holds great promise for biomedical research, but informatics challenge block the full realization of the potential of this transformative technology. In particular progress is limited by the informatics and engineering expertise of biomedical researchers, and the availability of sufficient computational infrastructure to analyze these enormous datasets. This project will address these problems by bringing together Galaxy, a system for making complex computational analysis accessible and reproducible, with "cloud computing", an infrastructure model where computing resources are purchased on demand as needed, making it possible for investigators with no informatics expertise to perform data-intensive analysis using cloud resources.
|
0.966 |
2012 — 2021 |
Nekrutenko, Anton Taylor, James Peter [⬀] |
U24Activity Code Description: To support research projects contributing to improvement of the capability of resources to serve biomedical research. U41Activity Code Description: To support biotechnology resources available to all qualified investigators without regard to the scientific disciplines or disease orientations of their research activities or specifically directed to a categorical program area. |
Democratization of Data Analysis in Life Sciences Through Galaxy @ Johns Hopkins University
? DESCRIPTION (provided by applicant): Overall High throughput data production technologies, particularly next generation DNA sequencing, have ushered in the most disruptive changes to biomedical research in decades. Making sense of the large datasets produced by high throughput technologies requires sophisticated statistical and computational methods, as well as substantial computational power. This has led to an acute crisis in biomedical research, as researchers without informatics training attempt to perform computation-dependent analyses. Since 2005, the Galaxy project has been working to address this problem by providing a framework that makes advanced computational tools usable by non-experts. Galaxy seeks to make data-intensive research more accessible, transparent, and reproducible by providing a Web-based environment in which users can perform computational analyses and have all of the details automatically tracked for later inspection, publication, or reuse. In the proposed project, we will improve Galaxy in several specific ways. We will greatly increase Galaxy's usability for working with large numbers of datasets. Modern experiments often involve hundreds of datasets organized in complex ways; we will make analyzing such data simple and intuitive. We will improve the development and distribution of software tools, making it much easier for developers to distribute tools and for users to acquire them, all while preserving provenance. We will greatly improve access to a wide variety of computational resources such as cloud computing and high- performance clusters, enabling biomedical researchers to use resources that have traditionally been difficult to work with. Finally, we will engage in training, outreach, and dissemination, including the development of scalable training materials that can be used by others to conduct biomedical data analysis training.
|
1 |
2014 — 2016 |
Nekrutenko, Anton |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Cc-Nie Integration: Developing Applications With Networking Capabilities Via End-to-End Sdn (Dances) @ Pennsylvania State Univ University Park
The DANCES project team of network engineers, application developers, and research scientists is implementing a software-defined networking (SDN)-enabled end-to-end environment to optimize support for scientific data transfer. DANCES accomplishes this optimization by integrating high performance computing job scheduling, network control capabilities offered by SDN along with data movement applications in an end-to-end network infrastructure. This integration provides access to control mechanisms for managing network bandwidth. The control of network resources enabled by SDN enhances application stability, predictability and performance, thereby improving overall network utilization. Motivation for the DANCES project is to apply the advantages of advanced network services to the problem of congested metropolitan and campus networks. DANCES uses XSEDENet across Internet2 in conjunction with OpenFlow-enabled network switches installed at the collaborating sites as the end-to-end hardware and software substrate.
Knowledge gained through DANCES is being disseminated through educational programs offered by the participating institutions and at existing community workshops, meetings, and conferences. The insights and experience obtained through DANCES will promote a better understanding of the technical requirements for supporting end-to-end SDN across wide area and campus cyberinfrastructure. The resulting SDN-enabled applications will make the request and configuration of high bandwidth connections easily accessible to end users and improve network performance and predictability for supporting a wide range of applications.
|
1 |
2017 — 2021 |
Nekrutenko, Anton Pond, Sergei L Kosakovsky Taylor, James Peter (co-PI) [⬀] |
R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Tuning Big Data Analysis Infrastructure For Hiv Research @ Pennsylvania State University-Univ Park
Abstract: The state of big data analytics in the field of HIV/AIDS research is critically lacking. Decreasing cost of sequencing stimulated the development of novel software tools and analysis frameworks. The bulk of these efforts has been driven by truly expansive (and well-funded) collaborative projects such as the 1000 genomes, ENCODE, modENCODE, GTEx, the Human Microbiome, the Cancer Genome Atlas, and others. While these projects hardened many aspects of NGS data analysis and manipulation, as well as established standards for data representation (e.g. BAM, VCF, CRAM formats) they were facing a set of challenges that is markedly distinct from those faced by HIV researchers, e.g. long stable genomes with few mutations (i.e., human) versus short variable genomes with many mutations (i.e., HIV). Consequently, the development of HIV-specific tools and applications for next generation sequencing (NGS) has largely been the domain of individual labs, independently designing sensible ad hoc, yet disaggregated, solutions to common problems, resulting in a fragmented field largely without accepted standards and gaps between available solutions and the needs of end users. The current practice of writing ?full-stack? custom in-house solutions for NGS analyses is not scalable, not maintainable, largely fails to leverage the developments from other domains of NGS data analysis, and hampers the adoption of this transformative technology in HIV research. The specific aims of this proposal address practical aspects of HIV/AIDS-related NGS analysis by assembling proven and newly developed tools and modules into ?data to answer? series of workflows, and creating a publicly available and accessible turnkey solution suitable for a large proportion of HIV/AIDS researchers needing to perform routine and bespoke analyses of NGS data..
|
1 |
2019 — 2023 |
Das, Chitaranjan (co-PI) [⬀] Kandemir, Mahmut Nekrutenko, Anton Medvedev, Paul |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Frameworks: Re-Engineering Galaxy For Performance, Scalability and Energy Efficiency @ Pennsylvania State Univ University Park
Biomedical research is an important branch of science that deals with the problem of studying biological processes and identifying, preventing and curing diseases. This research forms the pathway to the discovery of new medicines as well as new therapies. As such, biomedical research is crucial to advance the national health and prosperity. Given the geographically distributed research groups and biomedical labs, collaborative science plays a very important role in biomedical research. Galaxy is an open source, web-based framework that is extensively used by more than 20,000 researchers world-wide for conducting research in many application domains, the most prominent of which is biomedical research. It provides a web-based environment using which scientists perform various computational analyses on their data, exchange results from these analyses, explore new research concepts, facilitate student training, and preserve their results for future use. Galaxy currently runs on a large variety of high-performance computing (HPC) platforms including local clusters, supercomputers in national labs, public datacenters and Cloud. Unfortunately, while most of these systems supplement conventional CPUs with significant accelerator capabilities (in the form of Graphical Processing Units (GPUs) and/or Field-Programmable Gate Arrays (FPGAs)), the current Galaxy implementation does not take advantage of these powerful accelerators. This project enhances the Galaxy framework so that it can take full advantage of the tremendous computational capabilities offered by GPUs and FPGAs. By doing so, the important applications running under Galaxy experiences significant speedups, thereby accelerating scientific discoveries.
This project consists of four complementary tasks, which follow a logistic progression as follows: Task-I focuses on redesigning existing Galaxy tools with GPU/FPGA support and integrate them to Galaxy tool-chains; Task-II provides containerization support for the tools and accelerator-aware orchestration for running Galaxy on cloud platforms; Task-III implements specific policy driven scheduling schemes for Task-I and Task-II; and finally, Task-IV redesigns Galaxy storage to speed up execution and reduce bottlenecks related to data transfer. The proposed enhancements to Galaxy enables the integration of innovation with discovery by providing a state-of-the art experimental platform to a larger community of researchers across several disciplines. On the broader impact and outreach/educational front, this project impacts the performance and energy efficiency of Galaxy tools and applications and improves the productivity of a typical Galaxy user tremendously; that is, the main beneficiaries of this project are thousands of members of existing Galaxy Community. However, this project also (i) helps existing GPU and FPGA based (non-Galaxy) applications start using Galaxy, thereby taking full advantage of all existing toolsets within the framework, (ii) enables Galaxy tools to take better advantage of emerging cluster scheduling capabilities, and (iii) creates a synergy with concurrent Galaxy related efforts and existing infrastructure efforts the PIs are involved with, to further expedite scientific discoveries. As such, this proposed system support will have a broad societal impact via the enhanced Galaxy system support. On the education side, the project involves under-represented groups in computer science as well as in bio-informatics, outreach to undergraduates, various K-12 related activities (Science-U, CSATS, VIEW), and engagement with researchers in other disciplines (e.g., natural language processing, image processing, drug discovery and cosmology) via a workshop open to the Galaxy community.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |
2019 — 2021 |
Carey, Vincent James Goecks, Jeremy Leek, Jeffrey T. Morgan, Martin T Nekrutenko, Anton Schatz, Michael [⬀] Waldron, Levi David |
U24Activity Code Description: To support research projects contributing to improvement of the capability of resources to serve biomedical research. |
Implementing the Genomic Data Science Analysis, Visualization, and Informatics Lab-Space (Anvil) @ Johns Hopkins University
Project Summary: The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab?space (AnVIL) will power the next generation of computational genomic research. We will develop the AnVIL environment using the leading national?scale cyberinfrastructure as the foundation supporting the most widely?used analysis environments and frameworks vetted by NHGRI researchers. Our user?centered solution for data access, analysis, and visualization will enable investigators across all levels of expertise to fully utilize genomic datasets using environments they are already familiar with, leveraging well?engineered and optimized scientific computing infrastructure for greater efficiency and lower costs. Aim 1: Engineer the AnVIL Data and Compute Platform. We will leverage the TACC Science Cloud and the Agave Science?As?A?Service platform to deploy a cloud?based environment supporting the data storage, access, and compute needs of the NHGRI research community. Aim 2. Develop APIs for Data and Compute Access. To maximize the domain?wide impact of AnVIL, we will draw on community efforts and our own collective experience supporting diverse genomic analyses to define access standards and to design and implement AnVIL APIs. Aim 3. Build an AnVIL metaportal integrating widely used analysis platforms. We will create a single metaportal residing within TACC's Science Cloud providing a unified view of users' data and activities, provenance and billing, and access to several of the most widely used workbenches for genomic research. These workbenches include Bioconductor, Galaxy, the Genome Modeling System, Juypter, and RStudio. The metaportal will also provide access to the most popular genomic visualization tools. Aim 4. Develop novel data aggregation, indexing and query schemes to increase analysis efficiency and reduce cost. We will build approaches, including indexing and pre?computation of key statistics, to make better use of existing (e.g., TCGA, GTEx) and future large datasets with the goal of increasing data utility and decreasing the cost of posing scientific queries against massive datasets. Aim 5: Develop training and outreach infrastructure and materials. We will build support for training directly in the AnVIL platform, including tight coupling to MOOC style courses, self?directed training materials, and support materials for conducting online and in?person training workshops. Aim 6: Engage in effective project governance and assessment. We will establish a leadership and management structure involving key stakeholders from NHGRI, including program staff and the NHGRI appointed Data Steering Committee and External Advisory Committee. The key innovation of this work is our leveraging of existing hardware, software, and human resources to create a practical and pragmatic solution to the challenge of building the AnVIL.
|
0.94 |
2020 — 2021 |
Nekrutenko, Anton |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Rapid: Collaborative Research: Agile and Effective Responses to Emerging Pathogen Threats Through Open Data and Open Analytics @ Pennsylvania State Univ University Park
The same types of questions arise during every emergent viral outbreak regarding its origin, its evolution, the manner of spread, and how to detect and mitigate it. Many, if not all, of these questions depend on rapid and reliable genomic analysis of diverse viral sample sequences by multiple laboratories. Early viral investigation is often impaired lack of reproducibility, rigor, and data/analytic sharing; the current investigation of COVID-19 is no different. Essential questions such as the extent of intra-host genomic variability (indicative of adaptation or multiple infection), viral evolution (selection, recombination), transmission (phylogentic and phylogeographic) cannot be answered reliably if researchers cannot trust/replicate the source data and analytical approaches. The goal of this award is to develop, deploy, and continuously update viral genomic analysis workflows to enable the analysis and monitoring of viral evolution and dynamics for SARS-CoV-2, and to use the lessons learned to prepare capacity for future outbreaks. Bioinformatics workflows for critical tasks in investigating COVID-19 and future viral outbreaks will be developed and COVID-19 deep sequencing data will be analyzed and shared openly via Galaxy and Datamonkey to facilitate and accelerate evolutionary discovery and enable the study of and response to inevitable future infectious disease outbreaks. In an age of digital connectivity, open and accessible shared data and analysis platforms have the potential to transform the way biomedical research is done, opening the way to ?global research markets?, where competition arises from deriving understanding rather than access to samples and data. By enabling any researcher with an Internet connection to perform the same analyses as are done by top-flight research groups in resource-rich countries, this award will deliver critical components to resource limited settings, which are often those that are first and disproportionately affected by viral outbreaks.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |