Computing Community Consortium Blog

The goal of the Computing Community Consortium (CCC) is to catalyze the computing research community to debate longer range, more audacious research challenges; to build consensus around research visions; to evolve the most promising visions toward clearly defined initiatives; and to work with the funding organizations to move challenges and visions toward funding initiatives. The purpose of this blog is to provide a more immediate, online mechanism for dissemination of visioning concepts and community discussion/debate about them.


CCC Q&A: A Look Into A Pilot Project to Enhance Data Access

September 23rd, 2024 / in CCC / by Petruce Jean-Charles

The National Science Data Fabric (NSDF) is an pilot project funded by the National Science Foundation (NSF), designed to enhance data access and management for research institutions around the country and globally. The project is being lead by Valerio Pascucci (University of Utah), Michela Taufer (University of Tennessee, Knoxville), Alex Szalay (Johns Hopkins University), John Allison (University of Michigan, Ann Arbor), and Frank Wuerthwein (San Diego Supercomputer Center) and it aims to create a connected framework that provides the integration, security, and sharing of many datasets. CCC spoke with Taufer about her interest in the project and its benefits.

What interested you about this project?

The National Science Data Fabric (NSDF) is an ambitious and groundbreaking initiative supported by NSF that promises to transform how data is accessed, shared, and used in scientific research. What drew Valerio Pascucci, Frank Wuerthwein, Alexander Szalay, John Allison, and myself to join our forces and initiate NSDF was our bold vision to democratize data-driven discovery, especially for underrepresented institutions. In our research, we experience first-hand how traditional data management infrastructures need substantial improvements to keep up as the amount of data generated in scientific experiments and simulations grows exponentially. More importantly, we observed how colleagues less fortunate than us when it came to infrastructure for data access and use clearly have to face unscrambled challenges to apply their talent to research. 

What are the primary goals of the NSDF in relation to data-driven scientific discovery?

At its core, NSDF aims to promote equity in scientific discovery by democratizing access to large-scale data. We designed the project to provide a suite of services that empowers researchers, regardless of their institutions size or resources, to fully leverage the data needed for groundbreaking discoveries. One of the primary goals is to remove the barriers currently to accessing and using large datasets, especially for smaller research groups that may need more technical infrastructure to handle petabytes of information. NSDF is also about turning access into action. We do not just want to allow researchers to download large datasets; we want to provide them a friendly environment to analyze, visualize, and interact with these datasets in real-time, unlocking real liabilities for scientific discovery. What I like about NSDF is that it is not just a project; it is a community that values and includes all scientists, regardless of their institutional background.

How does the NSDF architecture enhance access to data across different research institutions? What elements make up the NSDF stack?

We designed NSDF’s architecture on a modular, containerized data delivery environment that makes it platform-agnostic and highly flexible. This means researchers from institutions with different types of infrastructure can all access the same robust system without needing specialized hardware or software. NSDF enables federated access to data by integrating services across multiple institutions, from national labs to academic institutions and MSIs. We created entry points that connect these institutions and interconnected the entry processes to facilitate seamless data movement and collaborative data usage across geographically distributed teams. So today, thanks to NSDF, scientists at the University of Utah can access material science data commons at the University of Michigan Ann Arbor or students modeling terrain conformations and soil moisture at the University of Delaware can analyze remote data located in public repositories such as DataVerse or private data storage such as Seal Storage at runtime.

The NSDF stack comprises several vital services that create a robust and scalable data environment. Some are as follows:

  • NSDF-Catalog is a comprehensive metadata catalog that allows users to find and retrieve datasets easily.
  • NSDF-Stream is a streaming service for real-time data access, essential for large-scale experiments where data must be processed on the fly.
  • NSDF Dashboards are interactive dashboards that allow researchers to visualize and analyze datasets remotely.
  • NSDF-OpenVisus is a framework for progressive, cache-oblivious visualization of large datasets.
  • NSDF-Cloud is a cloud integration service that facilitates access to data from various cloud providers like AWS, Azure, and Jetstream.

What role does Internet2 play in the functionality of the NSDF, and how does its infrastructure support data sharing?

NSDF leverages the power of Internet2, a high-speed network designed for advanced research and education. This infrastructure ensures that even massive datasets, such as NASA’s 2.8- petabyte ocean circulation model, can be accessed, shared, and processed in real time without overwhelming the networks of individual institutions. With Internet2’s capabilities, NSDF can bypass the bottlenecks that often occur when moving large datasets across standard internet connections. This is especially important when dealing with petabyte-scale datasets, such as those from NASA’s oceanographic models or the Cornell High Energy Synchrotron Source (CHESS), which require real-time processing capabilities.

What specific research projects are currently being supported by the NSDF pilot, and how do they benefit from this data fabric infrastructure?

NSDF already supports several high-impact research projects, including projects in climate science, materials science, and astronomy. These projects demonstrate the transformative potential of NSDF.

For example:

  • NASA’s LLC4320 ocean dataset: This 2.8-petabyte dataset is being used to study global ocean circulation and its role in climate change. NSDF enables researchers to access and process this enormous dataset in real-time, overcoming the limitations of traditional computational infrastructure.
  • CHESS beamlines: NSDF supports real-time data access and experiment steering for researchers at the Cornell High Energy Synchrotron Source. By integrating NSDF’s data management tools, scientists can collaborate remotely and publish their findings in real time​.
  • Soil Moisture and Terrain Parameters Project: This project focuses on processing high- resolution terrain data using Digital Elevation Models (DEMs) to study soil moisture patterns. Using NSDF services such as OpenVisus and IDX format conversion, researchers can visualize large-scale geospatial datasets and extract critical information about terrain parameters like slope, aspect, and hill shading. NSDF’s cloud infrastructure significantly speeds up the analysis, making it possible to handle terabytes of data efficiently. The project particularly benefits earth science research by providing timely, scalable access to critical environmental data.​

NSDF is designed to address challenges by making vast, complex datasets more accessible and usable for scientists across a broad range of disciplines, from astrophysics to biology​.

CCC Q&A: A Look Into A Pilot Project to Enhance Data Access