artistic title image

Docker for Science (Part 2)

This post is part of a short blog series on using Docker for scientific applications. My aim is to explain the motivation behind Docker, show you how it works, and offer an insight into the different ways that you might want to use it in different research contexts.

Quick links:

An Example Dockerfile

Let’s get straight to business: Here’s what an example Dockerfile for a simple Python project might look like. (The comments are added to make it easier to reference later in this post.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# (1)
FROM python:3.8.5

# (2)
WORKDIR /opt/my-project

# (3)
COPY . /opt/my-project

# (4)
RUN pip install -r requirements.txt

# (5)
ENTRYPOINT [ "python3", "main.py" ]

Building Our Example Project

First let’s figure out how to turn this Dockerfile into a container that we can run. The first step is to get the code – you can find it in this repository so you can clone it and follow along.

The first step to getting this ready to run is docker build. To build an image, you need a Dockerfile, a name for the image, and a context. The Dockerfile is what tells Docker how to build the image, the name is what Docker will use to reference this image later (e.g. python or hello-world), and the context is the set of files from your file system that Docker will have access to when it tries to build the project.

Usually the context is the project directory (usually also the directory where the build command is run from). Likewise, by convention, a Dockerfile is generally called Dockerfile (with no extension), and lives in the project’s root directory. If this isn’t the case, there are additional flags to pass to docker build that specify where it is located. The name is given with the -t flag, also specifying any tags that you want to provide (as always, these default to :latest). The -t flag can be provided multiple times, so you can tag one build with multiple tags, for example if your current build should belong to both the latest tag, and a fixed tag for this release version.

Having cloned the example repository, you can run this build process like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
$ # builds the file at ./Dockerfile, with the current working directory as the context,
$ # with the name `my-analyser`.
$ docker build -t my-analyser .
Sending build context to Docker daemon  20.48kB
Step 1/5 : FROM python:3.8.5
3.8.5: Pulling from library/python
d6ff36c9ec48: Pull complete 
c958d65b3090: Pull complete 
edaf0a6b092f: Pull complete 
80931cf68816: Pull complete 
7dc5581457b1: Pull complete 
87013dc371d5: Pull complete 
dbb5b2d86fe3: Pull complete 
4cb6f1e38c2d: Pull complete 
0b3d7b2fc317: Pull complete 
Digest: sha256:4c62d8c5ef331e485143c7a664fd6deeea4595ac17008ef5c10cc470d259e39f
Status: Downloaded newer image for python:3.8.5
 ---> 62aa40094bb1
Step 2/5 : WORKDIR /opt/my-project
Removing intermediate container 3e718c528a63
 ---> f6845bcf9e20
Step 3/5 : COPY . /opt/my-project
 ---> 8977a9a29d1c
Step 4/5 : RUN pip install -r requirements.txt
 ---> Running in 8da06d6427d0
Collecting numpy==1.19.1
  Downloading numpy-1.19.1-cp38-cp38-manylinux2010_x86_64.whl (14.5 MB)
Collecting click==7.1.2
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Installing collected packages: numpy, click
Successfully installed click-7.1.2 numpy-1.19.1
Removing intermediate container 8da06d6427d0
 ---> ba22084bd57e
Step 5/5 : ENTRYPOINT [ "python3", "main.py" ]
 ---> Running in d1c9dc9bc09f
Removing intermediate container d1c9dc9bc09f
 ---> d12d76ae371b
Successfully built d12d76ae371b
Successfully tagged my-analyser:latest

There are a few things to notice here. Firstly, Docker sends the build context (that’s the . part) to the Docker daemon. We’ll discuss the role of the Docker daemon a bit in the next post, but for now, the daemon is the process that actually does the work here. After that, we start going through the steps defined in the Dockerfile (you’ll notice the five steps each match up to the five commands). We’ll go through what each command is actually doing in a moment, although it might be interesting to get an idea for what each line is doing before reading onwards.

Before we explore the individual commands, however, we should figure out how to actually run this compiled image. The Python script that we’re running is a fairly simple one – it has two commands, one to tell us how many items of data we’ve got, and another to give us the average values from that data. We can run it like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ docker run my-analyser
Usage: main.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  analyse-data
  count-datapoints
$ docker run my-analyser count-datapoints
My Custom Application
datapoint count = 100
$ docker run my-analyser analyse-data
My Custom Application
height = 1.707529904338
weight = 76.956408654431

This is very similar to the hello-world container that we ran, except without any need to download anything (because the container has already been built on our system). We’ll look at transfering the container to other computers in the next post, but, in principle, this is all we need to do to get a completely self-sufficient container containing all the code that we need to run our project.

For now, let’s go through the Dockerfile step-by-step and clarify what each command given there is doing.

Step-by-step Through The Dockerfile

The first thing (1) a Dockerfile needs is a parent image. In our case, we’re using one of the pre-built Python images. This is an official image provided by Docker that starts with a basic Debian Linux installation, and installs Python on top of it. We can also specify the exact version that we want (here we use 3.8.5).

There are a large number of these pre-built official images available, for tools such as Python, R, and Julia. There are also unofficial images that often bring together a variety of scientific computing tools for convenience. For example, the Jupyter Notebooks team have a wide selection of different images with support for different setups. Alternatively, most Linux distributions, including Ubuntu and Debian1, are available as parent images. You may need to do more work to get these set up (for example, you’ll need to manually install Python first) but you also have more flexibility to get things set up exactly how you want.

Once we’ve got our base image, we want to make this image our own. Each of the commands in this file adds a new layer on top of the previous one. The first command we use (2) is fairly simple – it just sets the current working directory. It’s a bit like running cd to get to the directory you want to start working in. Here, we set it to /opt/my-project. It doesn’t really matter what we use here, but I recommend /opt/<project-name> as a reasonable default.

The next step (3) is to add our own code to the image. The image that we’re building will be (mostly)2 isolated from the computer that we run it from, so if we want our code to be built into this project, we need to explicitly put it there. The COPY command is the way to do that. It creates a new layer that contains files from our local system (.) in the location in the image that we specify (/opt/my-project).

At this point, we have a python project inside a Docker image. However, our project probably has some third-party dependencies that will also need to be installed. As I pointed out before, the Docker container that we’re aiming for is isolated from the computer that we will run it from, which also means that any dependencies that we need must also be installed inside the container. The RUN (4) command allows us to run arbitrary commands inside the container. After running the command, Docker then creates a new layer with the changes that were made by the command that we ran.

Here, we run the pip command to install all of our dependencies3. We load the dependencies from a file called requirements.txt – if you’re not so used to this system, this is a way of defining dependencies in a reproducible way, so that any future user can look through a project and see exactly what the will need to run it. It’s important to emphasize that Docker doesn’t need to replace requirements.txt, CMake, or other dependency management tools. Rather, Docker can work together with these and other tools to help provide additional reproducibility guarantees.

The final part of our Dockerfile is the ENTRYPOINT command (5).

Part of the idea of Docker is that each Docker container does one thing, and it does it well. (You might recognise the UNIX philosophy here.) As a result, a Docker container should generally contain one application, and only the dependencies that that application needs to run. The ENTRYPOINT command, along with the CMD command tells Docker which application should run.

The difference between the ENTRYPOINT and CMD is a bit subtle, but it roughly comes down to how you use the docker run command. When we ran it in the previous post, we generally used the default commands set by the containers – for hello-world, the default command was the executable that printed out the welcome message, while in python, the default command was the Python REPL. However, it’s possible to overwrite this command from the docker run command. For example, we can run the Python container to jump straight into a bash shell, skipping the Python process completely:

1
2
$ docker run -it python:3.8.5 bash # note the addition of 'bash' here to specify a different command to run
root@f30676215731:/# 

This ability to replace the default command comes from using CMD. In the Python Dockerfile, there is a line that looks like CMD python, which essentially tells Docker “if nobody has a better plan, just run the Python executable”.

On the other hand, the arguments to ENTRYPOINT will just be put before whatever this command ends up being. (It is possible to override this as well, but it’s not as common.) For example, consider the following Dockerfile:

1
2
3
4
5
6
7
8
FROM ubuntu:20.04

# using `echo` allows us to "debug" what arguments get
# passed to the ENTRYPOINT command
ENTRYPOINT [ "echo" ]

# this command can be overridden
CMD [ "Hello, World" ]

When we run this container, we get the following options:

1
2
3
4
5
$ docker run echotest # should print the default value CMD value
Hello, World
$ docker run echotest override arguments # should print the overidden arguments
override arguments
$ docker run -it --entrypoint bash echotest # overrides the entrypoint

As a rule, I would recommend using ENTRYPOINT when building a container for a custom application, and CMD when you’re building a container that you expect to be a base layer, or an environment in which you expect people to run a lot of other commands. In our case, using ENTRYPOINT allows us to add subcommands to the main.py script that can be run easily from the command line, as demonstrate in the opening examples. If we’d used CMD instead of ENTRYPOINT, then running docker run my-analyser count-datapoints would have just tried to run the count-entrypoints command in the system, which doesn’t exist, and would have caused an error.

Next: Part 3 – Practical Applications in Science

In this second of three parts, we’ve looked at an example project with an example Dockerfile. We explored how to build and run this Dockerfile, and we explored some of the most important commands needed to set up the Dockerfile for a project.

In the final part, I want to explore some of the different ways that someone might use Docker as part of research. For example, how to distribute Docker containers to other places, how to run Docker containers on HPC systems, building Docker via Continuous Integration, and other places where you might see Docker being used.

View part three here.


Get In Touch

HIFIS offers free-of-charge workshops and consulting to research groups within the Helmholtz umbrella. You can read more about what we offer on our services page. If you work for a Helmholtz-affiliated institution, and think that something like this would be useful to you, send us an e-mail at support@hifis.net, or fill in our consultation request form.

Footnotes

  1. If you’re look deeper into Docker, you might notice that a distribution called “Alpine Linux” crops up a lot. This is an alternative distribution that is specifically designed to be as light as possible. This can save a lot of space in docker images, but it also comes with some additional complexities. I recommend starting with a Debian-based distribution, particularly for Python-based projects, and then switching to Alpine Linux later if you find that your docker images are getting too large to handle. 

  2. “Mostly” is an important caveat here! To usefully run a Docker container, we need to send some input in and get some sort of output out – this is mostly handled with command-line arguments and the console output of whatever runs inside Docker. However, for some applications (less so scientific ones), we will also want to access a service running inside the container, e.g. a webserver. Alternatively, we may want to access files inside the container while running it, or even allow the container to access files from the “parent” computer that’s running it. These things can all be enabled using different arguments to the docker run command.

    I’ll talk a little bit more about some specifics here in the final part of this series, where I’ll also mention tools like Singularity (that you’re more likely to run into on HPC systems), and explain some of the limitations of these tools a bit more clearly. 

  3. If you have a lot of different Python projects, you might (rightly!) ask why I haven’t used something like virtualenv to isolate the Python environment. The answer is that, in this case, it’s not really necessary. The Docker image that we build will have isolation built-in – and not only for Python, but for all our other dependencies too.