artistic title image

Mass Storage for Machine Learning in Seismology

image/svg+xml

From dCache to DESY Storage (HDF) — HIFIS mass storage from DESY for Helmholtz

Raven as logo for dCache

dCache is an open source project which developed a system for storing and retrieving large amounts of data, providing world-wide access. It has been built and is further developed by Deutsches Elektronen-Synchrotron (DESY), the Fermi National Accelerator Laboratory (FNAL) and the Nordic e-Infrastructure collaboration (NeIC).

Thus, the system was a perfect candidate for DESY to provide mass storage Helmholtz-wide via HIFIS. Actually, it has been one of the first services connected to the Helmholtz AAI in early 2020 as a demonstrator. Now, it’s becoming a regular service and will be available via the Helmholtz Cloud Portal in autumn 2022, branded as “DESY Storage (HDF)”, HDF being short for Helmholtz Data Federation.

SeisBench — A toolbox for machine learning in seismology

The first users of the then prototype are seismologists from the REPORT-DL project: Rapid Earthquake Phase Analysis of Ocean-bottom, Regional and Teleseismic events with Deep Learning. This was funded in 2019 by Helmholtz AI, another Helmholtz Incubator plattform, and within this context, SeisBench was developed — A toolbox for machine learning in seismology.

seisBench logo

SeisBench is an open-source python toolbox, aiming to standardise access to datasets and models for seismic waveform processing with deep learning. This way, SeisBench both reduces the overhead for developers of such models and bridges the gap between model developers and seismic practitioners.

Key part of SeisBench is the ability to directly access benchmark datasets and pretrained models. To facilitate the sharing of this data, they use the DESY Storage (HDF). This service equips them with a high-performance repository, enabling the comfortable sharing of datasets of several hundred gigabytes. Additional functionality provided through webDAV allows to implement convenience functions, such as the possibility to enumerate available model weights. More detailed information can be found in the project’s documentation.

Schematic diagram of SeisBench.
Schematic diagram of SeisBench. By Jack Woollam, license: GPLv3

Within the nine months since publication, SeisBench has grown an active user base of almost 200 users. These users access the DESY Storage (HDF) repository around 5000 times per month. Users are located internationally, including researchers at world-leading institutions (e.g. Harvard, Cambridge, Cornell). The majority of users come from outside the Helmholtz community, which highlights the importance to grant world-wide and easy access to such contents.

In addition to the infrastructure, DESY Storage (HDF) offers the SeisBench team detailed statistics on usage patterns. This allows them to identify which parts of their software are most used by the community, e.g., which models are of the largest interest. They use this information for planning future focuses in the development of SeisBench.

How to use DESY Storage (HDF) via HIFIS for your projects

The storage service is usable for any user group with central Helmholtz stakeholders, but not intended for single users. Users shall please briefly apply via HIFIS support, providing a main contact (if multiple users are involved), the purpose of usage (brief description), including approximate ressources needed, the number of users (approximately), the Helmholtz centres / other organisations of the user(s) and the envisioned time frame of usage in order to set up the service optimally.

Get in contact

For dCache / DESY Storage (HDF): Christian Voss, Paul Millar

For SeisBench: Jannes Münchmeyer, Jack Woollam, Andreas Rietbrock

For HIFIS: HIFIS Support


Changelog

  • 2022-08-02 – adapt the date for release of DESY Storage (HDF) in the Helmholtz Cloud Portal