artistic title image

Docker for Science (Part 3)

This post is part of a short blog series on using Docker for scientific applications. My aim is to explain the motivation behind Docker, show you how it works, and offer an insight into the different ways that you might want to use it in different research contexts.

Quick links:

After working through the previous two parts, the next big question is: What now? We’ve downloaded other people’s Docker images, and we’ve created our own Docker image – how do we get other people to download, and do something useful with, our Docker image? Likewise, how do we use our Docker image in a wider set of circumstances, for example on HPC systems, or a shared team VM?

This post is more specific to looking at scientific applications, so we’ll look mainly at using GitLab and Singularity for these purposes, as these are the most commonly-used tools in scientific computing.

Sharing Docker Images

The first problem is how to move Docker images from one computer to another, so that they can be shared between members of a team, or even between one team and another.

Over the last two posts, I mentioned Docker Hub, the main Docker container registry. This is both the source of all official Docker images (hello-world, python, etc), as well as the host of a number of unofficial, third-party images (jupyter/scipy-notebook, rocker/shiny)1. If you’re working with truly open-source code, and want to share that with the wider Docker ecosystem, you can create an account and upload images here in the same way that you might host Python packages on PyPI.

However, in practice, a lot of scientific programming is hosted internally via your institution’s own Git host, and most scientific applications are fairly specific, and probably not of huge use outside the purpose that they were developed for.

For this purpose, a number of git-hosting tools (such as GitLab and GitHub) also include per-project Docker registries. This means that you can build and save your Docker images in the same place that you keep your code.

For the purposes of this post, I’ll assume the use of GitLab, because is one of the most common options in Helmholtz institutions2. When enabled by your administrator, GitLab projects include a private container registry for each project. You can set it up for your project by going to Settings > General > Visibility, project features, permissions. This will enable a “Packages and Registries > Container Registry” option in the project sidebar, which will take you to an empty page, because you probably don’t have any images stored yet.

How do you store an image here? Let’s start off by doing it manually, and then do it “properly” – by which I mean get it to happen automatically. If you want to follow along, create a new private repository that you can use as a sandbox, and push the code from the previous post to play around with.

Saving Images – The Manual Process

In the top-right corner of this Container Registry page, there is a button that says “CLI Commands”. This will walk us through the main steps of getting the image that we generated earlier into this registry. The first command it gives us is docker login, followed by an URL for the local registry. Copying this into your terminal, and pressing enter, will either use your GitLab SSH key (if you’re using one), or it will ask for your username and password for your GitLab account.

If you can set up SSH for your GitLab account, please do so – this means your password does not need to be stored on disk! You can find more details out [here](https://docs.gitlab.com/ee/ssh/README.html, or in the documentation for your local GitLab instance.

Once we’ve logged in, we can move to the next step. The suggestion given by GitLab is a command to build the image, but we already built our image while we were learning about Dockerfiles in the previous post. However, the project name used by GitLab is different to the one we used then – why?

Well, in GitLab at least, the name of your image is a combination of the project name, and the group or user who owns that project. For example, if your username is “user123”, and you create a project called docker-test inside your personal area in GitLab, your image will be called user123/docker-test. In addition, Docker requires that if you use a registry that isn’t the main Docker registry, you specify that registry as part of the image name. So, in this case, you’ll actually need to name your image <registry-url>/user123/docker-test, where <registry-url> is whatever you used to log in in the previous step.

This isn’t a problem at all – we can just run the build again, and because of Docker’s clever caching mechanism, we shouldn’t even need to wait for the build to happen again. We can just run the command that GitLab gave us in our project directory, and we get the renamed tag for free.

The final step is to push the image – for this, we simply use the docker push command, giving it the image name that we just used. When this is done, we should be able to refresh the container registry page, and see our image sitting there, with a little note that says “1 Tag”. Clicking on the image will show that tag – it should be the latest tag that Docker always defaults to if none is specified. To upload a different tag, just specify it at the end of the image name – for example: registry.hzdr.de/user123/docker-test:my-tag.

Hopefully, it’s clear that the manual process here isn’t too complicated – we login, we do the build as we did in the previous post, and we add a push command as well. The most complicated part is the name, but we can get this from GitLab. However, three manual commands may well be three more commands than we actually need – how does the automatic process compare, and is it simpler?

Saving Images – The Automatic Process

In GitLab, we can use CI Pipelines to make things happen automatically when we update our code. Often, this will be building our project, running a linter or typechecker (e.g. mypy), or running any automatic tests associated with the project. GitLab makes it fairly easy to use these pipelines to build and upload images to the container registry, and the HIFIS team have created some templates that can be used to make this completely simple.

To use pipelines, a project needs to have a file in it called .gitlab-ci.yml, which defines a series of Jobs that need to be executed. In the base project directory, create this file, and type the following into it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
include:
    # include the HIFIS Docker template, so that we can extend the predefined jobs there
    - "https://gitlab.com/hifis/templates/gitlab-ci/-/raw/master/templates/docker.yml"

stages:
    - build # => build the dockerfile
    - release # => upload images to the repository

docker-build:
    extends: .docker-build
    stage: build

# this will update the `latest` tag every time the master branch changes
release-latest:
    extends: .docker-tag-latest
    stage: release
    needs:
        - job: docker-build
          artifacts: true

This creates a pipeline with two jobs, one which builds the docker image, and one which uploads it to the registry. If you push this to the master branch and click on the CI/CD > Pipelines tab in your GitLab project, you should already be able to see the pipeline being executed.

The documentation for this template is available here.

Sharing Images

Having an image in a registry is one thing, but sharing it with other people is another. Private GitLab projects will also have private registries, which means that anyone else who wants to access the registry will need to log in to GitLab via Docker (as we did in the manual saving process) and have sufficient privileges in the team.

However, there is another route. GitLab also provides access tokens that can be given to people to allow them the ability to pull images from Docker, but not to make other changes. They don’t even need to have a GitLab account!

In a project’s settings, under Settings > Access Tokens, there is a page where you can create tokens to share with other people. These tokens are randomly-generated passwords that are linked to a particular project, that specify exactly what a person is able to access. For the purposes of sharing a Docker image, the read_registry permission is enough – this will allow the bearer of the token to access the registry, but not push new images there, or access other project features.

To create an access token, give the token a name to describe what it’s being used for, select the relevant permissions that you want to grant3, and optionally give an expiry date, if you know that the token will only be needed until a certain time. In response, GitLab will provide a string of letters, digits, and other special characters, which can be copied and sent to the people who need to use it.

To use this token, use the docker login command with your normal GitLab username, and the token provided. For more information, see the documentation here.

Docker for HPC (High Performance Computing)

Once you’ve got a Docker image that you can build and run on your computer, it makes sense to look for more useful places to run this image.

In research software, it’s common to run programs on an HPC, or High Performance Computing system. This is a shared cluster of high-performance servers, often equipped with GPUs, managed centrally by a research institute where users can run their programs for longer periods of time (several hours, or even days) without having to keep their own computers running. Generally, the user will log on, schedule a job, and then be notified when their job has finished running so they can view the results.

Unfortunately, for very good reasons, HPC administrators are often very reluctant to install Docker on their systems. One of the side-effects of the way that Docker works is that it is generally possible for a Docker image running on a server to gain administrator access on that parent server, essentially “breaking out” of the container. This makes the administrator’s job much more difficult in terms of locking down each user’s processes and isolating them from each other. As a result, it’s generally not a good idea to run Docker in this way.

However, surprisingly, Docker isn’t the only way to run Docker images. There are a number of other tools used to do containerisation, and one tool particularly that is both designed to run on HPC systems, and can interoperate with Docker, meaning you can usually just run your Docker image just like normal.

This tool is called Singularity. It is actually a complete containerisation tool in its own right, with its own format for defining containers, and its own way of running containers4. More importantly, it knows how to convert other container formats (including Docker) into its own .sif format. In addition, it runs as the current user – it doesn’t require any magical higher privileges like Docker. (This is a trade-off, but for the purposes of scientific applications, it’s usually a reasonable one to make.)

If you want to install Singularity and try it out yourself, you will need a Linux machine and a Go compiler, along with a few other dependencies. You can find the full instructions here. Running Singularity on an HPC system will also depend on how exactly that HPC system has been set up, but it will generally involve requesting a job, and running the singularity command as part of that job, with the desired resources.

One of the key questions when using a private registry such as GitLab (see above), is how to log in to that registry. Interactively, Singularity provides a –docker-login flag when pulling containers. In addition, it’s possible to use SSH keys for authentication in certain circumstances.

Docker in The Wild

So far, we’ve generally assumed that the Docker containers being created are wrapping up whole programs for use on the command line. However, there are also situations where you might want to send a whole environment to other people, so that they have access to a variety of useful tools.

If you’ve used GitLab CI (and some other similar systems), this is how it works. When GitLab runs a job in a pipeline, it creates a fresh Docker container for that job. That way, the environment is (mostly) freshly-created for each job, which means that individual jobs are isolated. It also means that the environment can be anything that the user wants or needs.

By default, this will probably be some sort of simple Linux environment, like a recent Ubuntu release, or something similar. However, if a CI job needs specific tools, it may well be simpler to find a Docker image that already has those tools installed, than to go through the process of reinstalling those tools every time the job runs. For example, for a CI job that builds a LaTeX document, it may be easiest to use a pre-built installation such as aergus/latex.

In fact, in GitLab, it’s even possible to use the registries from other projects to access custom images, and use those custom images in jobs in other projects. It’s even possible to use jobs to create images to use in other jobs, if that’s something that you really need.

Conclusion

Here, as they say, endeth the lesson.

Over the course of these three blog posts, we’ve talked about the purpose of Docker, and how it can be used to package applications and their dependencies up in convenient way; we’ve got started with Docker, and learned how to run Docker containers on our system; we’ve walked through how to create our own Docker containers using a Dockerfile; and finally, in this post, we’ve talked about some of the ways that we can use Docker practically for scientific software development.

Docker can often be a hammer when all you need is a screwdriver – very forceful, and it’ll probably get the job done, but sometimes a more precise tool is ideal. The motivating example for this blog series was the complexity of Python project development, where trying to remember which packages are installed, and which packages are needed by a particular project, can cause a lot of issues when sharing that project with others. For this case alone, Docker can be useful, but you may want to consider a package manager such as Poetry, which can manage dependencies and virtual Python environments in a much simpler way.

However, when different tools, languages, and package management needs come together, using Docker can often be a good way to make sure that the system really is well-defined, for example by ensuring that the right system packages are always installed, as well as the right Python packages, or the right R or Julia software.

If you feel like dependency management for your project is becoming too complex, and you’re not sure what packages need to exist, or how to build it on any computer other than your own, then hopefully this approach of building a Docker container step-by-step can help. However, if you would like more support for your project, HIFIS offers a consulting service, which is free-of-charge, and available for any Helmholtz-affiliated groups and projects. Consultants like myself can come and discuss the issues that you are facing, and explore ways of solving them in the way that is most appropriate to your team.

For more details about this, see the “Get In Touch” box below.


Get In Touch

HIFIS offers free-of-charge workshops and consulting to research groups within the Helmholtz umbrella. You can read more about what we offer on our services page. If you work for a Helmholtz-affilliated institution, and think that something like this would be useful to you, send us an e-mail at support@hifis.net, or fill in our consultation request form.

Footnotes

  1. Notice that all third-party images have two parts – a group/maintainer name (e.g. jupyter), and a specific image name (e.g. scipy-notebook). This is the main way that you can tell the difference between official and third-party images. 

  2. Unfortunately, the second-most common code hosting option at Helmholtz, BitBucket, doesn’t include a container registry. You can check with your local administrators if they have a tool like Artifactory or JFrog available. Alternatively, part of the evolution of the HIFIS project is to provide code hosting infrastructure across the whole Helmholtz community, which will include access to package and container registries, so please keep an eye out for more HIFIS news on this blog! 

  3. Selecting which permissions to grant is an interesting question of security design that we shouldn’t go into too much here, but a general guideline is “everything needed to do the job required, and not a step more”. That is, give only the permissions that are actually needed right now, not permissions that might be useful at some point.

    This probably doesn’t matter so much in the scientific world, where open research is increasingly important, but it’s a good principle when dealing with computers in general. Consider a badly-written tool (they do exist… 😉) that is designed to clean up images that aren’t needed any more. One mistake in the filter for deciding which images aren’t needed any more, and this sort of tool could rampage through all the registries that it is connected to, deleting everything it can see. (This sort of thing happens more often than you would think - see this bug and this bug and this bug and this bug – all just from one small rm command!) By limiting the access scope to read-only, we can limit how much these sorts of problems affect us. At least until we decide to run this particularly (thankfully fictional) clean-up tool ourselves, and make the same mistake… 

  4. It has its own container system? And it’s more suited to scientific applications? Why are these blog posts about Docker then – why not just go straight to this clearly more convenient tool?

    Two reasons: Firstly, Docker is way more popular than Singularity, or indeed any other similar tool. This means more documentation, more starter images to base our own changes on, more people to find and fix bugs in the software, and more support in third-party tools like GitLab. Secondly, Singularity only runs on Linux, and the installation process involves cloning the latest source code, installing a compiler for the Go programming language, and compiling the whole project ourselves.

    Given that Singularity can run Docker images, we can use Docker in the knowledge that we can also get the advantages of Singularity later on.