2014 — 2017 |
Halchenko, Yaroslav |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Crcns Us-German Data Sharing: Datagit - Converging Catalogues, Warehouses, and Deployment Logistics Into a Federated 'Data Distribution'
Contemporary neuroscience is heavily data-driven, but today's data management technologies and sharing practices fall at least a decade behind software ecosystem counterparts. Distributed version control systems, such as Git, facilitate collaborative software development, and turnkey distributions, like NeuroDebian, free researchers from tedious and unreliable maintenance tasks. Likewise, neuroscientists will need to incorporate recent technological developments to access, manage, and contribute back to the ever growing array of scientific data more efficiently. Making a rich collection of disjoint datasets available through a simple unified interface can transcend limitations of individual studies and revolutionize how scientific data are managed, distributed, and shared across all fields of science. With support from the National Science Foundation, Dr. Yaroslav O. Halchenko of Dartmouth College, along with Dr. Michael Hanke of the University of Magdeburg (Germany), will develop DataGit, a suite of data distribution tools. DataGit will employ software for data tracking and deployment logistics to unify access to many existing neuroimaging data hosting portals, such as crcns.org, openfmri.org and humanconnectome.org. DataGit will make it easy to access existing data and to share new or derived data with full support for distributed version control, data integrity protection and authenticated access to original data hosting.
Making data management as easy and as versatile as source code management will further the efforts toward open and fully reproducible science. Uniform access to federated collections of data will promote the visibility and accessibility of neuroscientific data inside and outside the field, far beyond the scope of any individual data-sharing effort. The benefits from the proposed developments will translate directly to educators' aims in the classroom. Through integration with software distributions, uniform access to software elements and datasets for online training materials will enable educators to teach not only from textbooks but also through hands-on replication of state-of-the-art original publications. In addition, giving any researcher the ability to easily deploy complex heterogeneous analysis pipelines will be instrumental in translating the achievements of flagship efforts, such as the Human Connectome Project, into accessible tools for clinical applications. Consequently, even more researchers will be able to tackle even larger challenges to benefit society by improving our understanding of the human brain.
A companion project is being funded by the German Ministry of Education and Research (BMBF).
|
0.915 |
2019 — 2022 |
Halchenko, Yaroslav |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Proposal: Crcns Us-German Data Sharing Proposal: Datalad - a Decentralized System For Integrated Discovery, Management, and Publication of Digital Objects of Science
Scientists collect terabytes of critical data every year. Recently a strong open science movement has generated traction for the beneficial practice of sharing data across laboratories, universities and research institutions. Yet, sharing data is not enough. Data must be shared using standardized formats and accompanied by curated metadata to allow for tracking, search, and organization. Metadata are essential for scientific discovery, as they are routinely used to complete all data analyses. However, to date, most brain projects focus on collecting or analyzing data, not on metadata management. Typical metadata records consist of heterogeneous study descriptions, developed at study release stage, without consistency across records or standard mechanisms to track changes. This project will increase access to brain data and improve metadata handling by combining two NSF-funded projects. It will develop a first-of-its-kind metadata management system able to track data and metadata distributed across heterogeneous geographical locations, storage systems and data formats. This portion of the project will expand the functionality of a previously funded NSF project DataLad. DataLad will also be enhanced to interoperate with major data repositories such as OSF and Figshare. Furthermore, the project will use the NSF-funded cloud computing platform brainlife.io to create a data and metadata marketplace by gathering data from multiple currently separated repositories into a single ecosystem . The goal is to improve interoperability across open science projects and make data and metadata easily searchable and available for computing on national cyberinfrastructure systems, ultimately advancing scientific discovery by increasing data discoverability, utilization, and publication.
This project will generate various technological advances. The core target will be an extensible system capable of automated gathering of metadata from various domains. It will be comprised of two major components: 1) a set of metadata parser algorithms that extract metadata from datasets and individual files using a flexible JSON-LD based data structure (with the ability to encode controlled vocabularies where available) and 2) an aggregation procedure that merges the aggregated metadata across parsers and stores them into compressed files that are optimized for bandwidth-efficient exchange and can be queried directly, or used as input into SQL or graph databases for data discovery applications. Extracted metadata will be included within the same datasets under Git and git-annex version control for unambiguous referencing and versatile data logistics. In parallel development we will improve interoperability of DataLad with existing data publishing portals (such as Figshare and OSF) by taking advantage of extracted metadata (e.g., Author, Description) to prefill required fields, and also by bundling the entire Git object store within the publication to make such published datasets installable back by DataLad without any loss of information. To make such published datasets discoverable, we will establish a crowd-sourced registry (with a RESTful API) which will get announcements on the availability of new datasets upon publication and aggregate their metadata to enable querying across datasets and data hosting providers. The final development will be the integration of DataLad within the brainlife.io data marketplace. This will make it possible to search and install datasets on brainlife.io as well as to process the data utilizing the brainlife.io analyses Apps on various NSF-funded national cyberinfrastructure high-throughput computer systems.
A companion project is being funded by the Federal Ministry of Education and Research, Germany (BMBF).
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.915 |