Home | All News Posts

Backing Up Critical US Databases

HIFIS Helps the UFZ Backup US Databases

In this blog post, we outline how the team of the Helmholtz Centre for Environmental Research (UFZ) is supporting researchers in the Earth and Environment domain to back up critical US databases. This initiative is an effort of the Research Data Management (RDM) team at UFZ, in collaboration with HIFIS, to ensure the continuity of research in light of the uncertainty surrounding the sustainability of key databases.

UFZ and the HIFIS team collaborated closely on this effort. The HIFIS team provided the necessary infrastructure and guidance to facilitate the backup process, along with finding the best solutions to make the data accessible. Collaborating has been crucial in ensuring that the data is preserved and can be accessed by researchers in the long term.

With increasing uncertainty about the long-term availability and sustainability of these resources, it is essential to proactively safeguard access to critical data. Researchers are encouraged to regularly review and inventory the databases they rely on, prioritizing those with unique datasets, high usage, and limited alternative sources.

Critical Datasets

Four datasets are already in our backup - and more are being added currently. These datasets are largely hosted by the US Environmental Protection Agency (EPA). Relying on the EPA has become a liability, particularly regarding the targeted attacks at its science division. The following databases are currently backed up:

  • EPA Ecotox Knowledgebase: A comprehensive database of ecotoxicological data, including information on the effects of chemicals on aquatic and terrestrial organisms.
  • EPA Toxicity Reference Database (ToxRefDB): A database that provides toxicity data for chemicals, including information on their effects on human health and the environment.
  • EPA ToxValDB: A database that contains toxicity values for chemicals, including information on their effects on human health and the environment.
  • EPA invitrodb: EPA’s ToxCast dataset, which allows the ToxCast program to predict the toxicity of chemicals based on their chemical structure and biological activity.

How HIFIS Supports These Efforts

HIFIS is able to not only provide large-scale infrastructure, but also has dedicated staff to respond quickly to these urgent matters. Both, Codebase and dCache are resilient, sustainable, and large-scale services that can accomodate the tasks that are required in these rescue operations.

You Can Help: Assessing Your Data Needs

You might imagine that your data set is likely to be safe from deletion or otherwise removal of accessibility. However, the reality is that many researchers are unaware of the potential risks associated with relying on external databases, especially those hosted in the US. The recent developments have highlighted the need for researchers to take proactive steps to ensure the continuity of their research data.

(Simplified) Diagram of the data rescue workflow. It is essential for researchers to be able to continue working on presently existing data.

Furthermore: not only researchers at UFZ are reliant on US hosted databases. Resources like the EPA databases are widely used across the Helmholtz Association and beyond. Therefore, it is crucial for all researchers to assess their data needs and identify which databases are critical for their work. This is a proactive step that any researcher can take to ensure the continuity of their research, not only within the environmental and earth sciences, but also in other domains.

Once you identify the databases that are critical for your work, estimate the total data volume and the effort required for backup. There are some preliminary things that you can check before engaging with the UFZ RDM team, particularly regarding the data volume, format, metadata, access method, and any specific requirements for the backup. This will help the RDM team to provide you with the best possible support.

If you have questions on the process, feel free to contact us!

Using Codebase for Small Data

For smaller datasets — especially those distributed as code repositories, scripts, or small data packages — codebase.helmholtz.cloud is the recommended solution. The workflow is simple

  1. Download the data set to your local machine and.
  2. Create a new project in Codebase in the group “Backup-US-Science-Data” https://codebase.helmholtz.cloud/ufz/backup-us-science-data.
  3. Push the repository or upload files to Codebase, preserving commit history and metadata.

This approach is also ideal for many text-based data sets, such as tabular data, as the text can be browsed in the browser without downloading the data set. If you have any questions regarding licensing terms, and how they would impact the backup, please reach out to HIFIS support.

Using dCache for Large Datasets

For larger datasets (typically above 400MB per project), dCache in the Helmholtz Cloud provides scalable and robust storage. The workflow is as follows:

  1. Request access to dCache by contacting the UFZ RDM team via rdm-support@ufz.de or via the HIFIS Support portal https://support.hifis.net.
  2. Upload data: dCache supports both browser-based uploads and command-line tools. Preliminary tests indicate that, while there were no issues with the browser-based upload, the command-line tool provides more control over the process.
  3. Organize datasets with clear folder structures and metadata files to facilitate future access and sharing.

dCache is well-suited for large datasets and supports collaboration across Helmholtz centers.

Rescued Data List

The list with backed-up data can be accessed here: https://codebase.helmholtz.cloud/ufz/backup-us-science-data/backup-us-science-data. The currently backed-up data sets are listed here, along with the storage location and more information. Particularly the license and access information is important moving forward. While we aim to make all data sets publicly available, this is not always possible.

Conclusion: Proactive Data Stewardship

As the sustainability of key US databases remains uncertain, taking proactive steps to back up critical data is essential for the continuity of research in the Earth and Environment domain. By leveraging Helmholtz Cloud services like Codebase and dCache, researchers can ensure that their valuable datasets are preserved and accessible for future use. Regularly reviewing and updating your data inventory will help maintain resilience against potential disruptions in data availability.