Remove Redundancies¶
Gems and Jewels to Collect¶
In the course of this episode you will learn a couple of techniques to remove redundancies in your GitLab CI pipeline, so that the file is easier to read and much easier to maintain.
Introduction¶
Up to now the CI pipeline does its job:
- It makes sure the source code complies to coding and licensing guidelines,
- tests are automatically executed, and
- it also runs the application.
The implementation, though, comes with a lot of duplications and redundancies.
A very popular principle in software engineering and beyond is the DRY principle - Don’t Repeat Yourself. This means, you should not repeat concepts you already introduced somewhere else but add them in a way to your project that you can reuse them in different contexts. The most important reason is maintainability. If you touch certain aspects of your code or documentation you do not want to search through the whole code base or documentation for duplications. Because this manual step is failure prone, you will most probably miss out on important parts which introduces inconsistencies.
A GitLab CI pipeline can quickly grow in terms of lines of code. You should constantly take care about repetitions in your pipeline. Refactor your pipeline as soon as you are about to introduce duplications. In this lesson, we will learn how these redundancies can be removed while keeping the same functionality of the pipeline.
Set Global Defaults for Keywords¶
GitLab CI allows defining global default values in CI pipelines with the
default
keyword.
A subset of those keywords that are also applicable in the default
section is given in the following list;
the script
keyword, for instance, is not allowed to be defined as a default:
Let’s have an example:
stages:
- test
default:
image: python:3.11
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
test:python:
stage: test
script:
- poetry run pytest tests/
test:python:3.12:
stage: test
image: python:3.12
script:
- poetry run pytest tests/
As you can see all defaults like a default image
and default before_script
Shell commands can be subordinated in the default
section of the CI
pipeline.
These defaults are used in all CI jobs as long as they are not overridden
there.
Please compare those CI jobs above.
Both use the default before_script
section, but only the first job uses
the default image
set, while the test:python:3.11
job overrides the
global image
keyword.
It needs to be mentioned here that defaults can be written down without
using the default
keyword at all:
stages:
- test
image: python:3.11
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
test:python:
stage: test
script:
- poetry run pytest tests/
test:python:3.12:
stage: test
image: python:3.12
script:
- poetry run pytest tests/
Reuse Configurations¶
A powerful keyword to reduce repetitions in your pipeline is the
extends
keyword.
A similar concept in YAML that can be used for this purpose are
YAML anchors.
First, let us explain the extends
keyword which appears to be simpler.
The essence of this keyword is that you may add a block of YAML in the CI
pipeline that is not a CI job and is therefore not executed on its own but
some reusable block that can be referenced later on in the YAML file in
different locations.
If you use this block of YAML somewhere in a CI job definition with the
extends
keyword, all defaults will be overridden in the same way as it is
done in CI jobs that do not use the extends
keyword.
Let us explore the following example:
stages:
- stage-1
.my-extension: # block to be reused in CI jobs
stage: stage-1
before_script:
- echo "Output in before_script section."
script:
- echo "Output in script section."
my-ci-job-1:
extends: .my-extension # reuse block in CI job
my-ci-job-2:
extends: .my-extension # reuse block in CI job
The example shows that those names of reusable blocks have a leading period,
e.g. .my-extension
, and that they can be reused with the extends
keyword
inside your CI jobs or even inside other reusable blocks by specifying
the name of the reusable block.
Let us assume for the sake of this example that you do not want to use the
default section for the before_script
block but declare another block that
you can refer to multiple times somewhere else.
stages:
- stage-1
.my-extension: # reusable YAML block
stage: stage-1
before_script:
- echo "Output 1 in before_script section."
- echo "Output 2 in before_script section."
script:
- echo "Output in script section."
my-ci-job-1:
extends: .my-extension # reuse block in CI job
my-ci-job-2:
extends: .my-extension # reuse block in CI job
For our example CI pipeline we could write this down as follows:
stages:
- test
default:
image: python:3.11
.testing: # block to be reused in CI jobs
stage: test
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
script:
- poetry run pytest tests/
test:python:3.11:
extends: .testing # reuse block in CI job
test:python:3.12:
image: python:3.12
extends: .testing # reuse block in CI job
The greatest benefit of using the extends
keyword is that you only have
a single location which you need to change if you decide to adapt,
for example, the Shell command in the script
section and add some
command-line options for the Pytest call.
The extends
keyword does not work with YAML lists, though.
For these cases YAML has got a concept called YAML anchors.
YAML anchors are very similar to extensions but have a slightly different syntax. There are even two different syntaxes depending on the context. The example above could look like this if YAML anchors were used:
stages:
- stage-1
.my-sequence-anchor: &my-sequence-anchor-name # reusable block as YAML sequence
- echo "Output 2 in before_script section."
- echo "Output 3 in before_script section."
.my-hash-anchor: &my-hash-anchor-name # reusable block as nested YAML hash
stage: stage-1
before_script:
- echo "Output 1 in before_script section."
- *my-sequence-anchor-name # reuse YAML sequence in nested YAML hash
script:
- echo "Output in script section."
my-ci-job-1:
<<: *my-hash-anchor-name # reuse nested YAML hash in CI job
my-ci-job-2:
<<: *my-hash-anchor-name # reuse nested YAML hash in CI job
Here, we defined two blocks, one that is a simple YAML sequence and one
that is a nested YAML hash.
Then, the former one is used in the later block.
As you can see, in contrast to extensions you can use them for YAML
sequences and for nested YAML hashes.
Declaring such a reusable YAML block is done by assigning a name to the
block prefixed by a period, followed by a colon and an anchor name with a
leading ampersand, e.g. .my-sequence-anchor: &my-sequence-anchor-name
or
.my-hash-anchor: &my-hash-anchor-name
.
Referring to a block is done either by writing an asterisk followed by the
anchor name in case of a YAML sequence, for example,
*my-sequence-anchor-name
, or by writing two lower-than signs and a colon,
followed by an asterisk and the anchor name in case of a nested YAML hash,
for example, <<: *my-hash-anchor-name
.
The corresponding implementation for the example CI pipeline could look like this:
stages:
- test
default:
image: python:3.11
.before-testing: &before-testing # a reusable block as a YAML sequence
- pip install --upgrade pip
- pip install poetry
- poetry install
.testing: &testing # a reusable block as a nested YAML hash
stage: test
before_script:
- *before-testing # reuse YAML sequence in nested YAML hash
script:
- poetry run pytest tests/
test:python:3.11:
<<: *testing # reuse nested YAML hash in CI job
test:python:3.12:
image: python:3.12
<<: *testing # reuse nested YAML hash in CI job
Both versions, the extends
keyword and the YAML anchors, of reusable
YAML improves readability as well as maintainability significantly, because
many duplications were stripped away.
Use matrix
Jobs¶
In this section we will first talk about a simple example for the so-called
matrix
keyword and then extend the concept to a more general one.
The motivation is to provide a list of variable values with n
elements to a
single CI job definition so that it is instantiated n
times.
Let us consider our previous test
stage that contains three CI jobs
testing with different Python interpreter versions.
To reduce these redundancies arising from our three test jobs in our
current CI pipeline, we would like to define it just once but instantiate
it several times by providing a list of variable values.
This can be done with the matrix
keyword that defines parameterized CI jobs.
By applying this approach to our example CI pipeline we arrive at the
following pipeline:
stages:
- test
.testing:
stage: test
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
script:
- poetry run pytest tests/
test:python:
image: python:${VERSION}
extends: .testing
parallel:
matrix:
- VERSION: ["3.10", "3.11", "3.12"]
As a result, the respective CI job is defined only once and the
duplications have been removed nicely.
With this version of a CI job definition, a corresponding CI job instance will
be created for each list item, i.e. three times.
All of these jobs will then be executed in parallel, because they belong to
the same stage since the parameterized CI job is assigned to stage test
.
Matrices in a More General Context¶
So far we used the matrix
keyword just with one list of variable values,
but it is even capable of working with matrices as the name of the keyword
implies.
A m x n
matrix is a table like structure with m
rows and n
columns.
It can be used for noting down different permutations of two lists of items:
b1 | b2 | b3 | b4 | |
---|---|---|---|---|
a1 | c11 | c12 | c13 | c14 |
a2 | c21 | c22 | c23 | c24 |
a3 | c31 | c32 | c33 | c34 |
In this example we get twelve permutations regarding two lists consisting of three and four items, respectively.
The
matrix
keyword
is a similar concept in GitLab CI.
Two variables with m
and n
values can be specified in a CI job, so that
m
times n
instances of a CI job are executed in parallel like in this
example:
stages:
- run
my-ci-job:
stage: run
image: python
script:
- python -m my_python_module --param1 ${PARAMETER_1} --param2 ${PARAMETER_2}
parallel:
matrix:
- PARAMETER_1: ["1", "2", "3"]
PARAMETER_2: ["1", "2"]
This is the table of the different permutations of each of the elements in both lists:
PARAMETER_1 | PARAMETER_2 |
---|---|
1 | 1 |
1 | 2 |
2 | 1 |
2 | 2 |
3 | 1 |
3 | 2 |
The matrix
keyword is even working with multiple matrices such as
two matrices each with two lists of m
times n
elements.
This simplifies the YAML file quite a bit since the job is specified only once.
Please be aware of the limitation that the number of permutations in a
parameterized CI job can not exceed 50.
Exercise
Exercise 1: Refactor CI Pipeline and Remove Redundancies in Exercise Project¶
Technical debt builds up quickly, so removing redundancy and refactoring your CI pipeline should start as early as possible and repeated on a regular basis.
In this exercise we’ll practice refactoring our CI pipeline from the exercise project. Remember:
- Define defaults globally with the
default
keyword. - Reuse YAML blocks in different CI jobs with the
extends
keyword and YAML anchors. - Execute parameterized CI jobs in parallel with the
matrix
keyword.
Please note that the dependencies
keyword does not work in combination with
the matrix
keyword.
You can not refer to a specific parameterized CI job, e.g. the build job,
in the dependencies
section of the test job, that means, this does not
work:
build:gcc:
[...]
image: gcc:${VERSION}
artifacts:
paths:
- "build"
parallel:
matrix:
- VERSION: [ "11", "12", "13" ]
[...]
test:gcc:
[...]
image: gcc:${VERSION}
dependencies:
- "build:gcc [${VERSION}]"
parallel:
matrix:
- VERSION: [ "11", "12", "13" ]
[...]
artifacts
section:
build:gcc:
[...]
image: gcc:${VERSION}
artifacts:
paths:
- build_gcc_${VERSION}
parallel:
matrix:
- VERSION: [ "11", "12", "13" ]
[...]
test:gcc:
[...]
image: gcc:${VERSION}
dependencies:
- "build:gcc"
parallel:
matrix:
- VERSION: [ "11", "12", "13" ]
[...]
build:
image: my_image:${VERSION}
stage: build
script:
- make build_${VERSION} # building the app in directory build_[1:3]
artifacts:
paths:
- build_${VERSION} # one artifact directory per parameterized CI job
parallel:
matrix:
- VERSION: [ "1", "2", "3" ]
test:
image: my_image:${VERSION}
stage: test
script:
- make test_${VERSION} # testing the app in directory build_[1:3]
dependencies:
- "build" # refer to all artifact directories build_[1:3] from CI job build
parallel:
matrix:
- VERSION: [ "1", "2", "3" ]
Take Home Messages
In this episode you learned how to simplify our CI pipeline by using
defaults, the extends
keyword, YAML anchors and the matrix
keyword.
The default
keyword reduces duplications because defaults are set once for
the whole pipeline.
The extends
keyword and YAML anchors provide a way to define reusable
blocks of YAML.
The matrix
keyword let the pipeline shrink as well by parameterizing the
pipeline which creates instances of a CI job for all permutations of the
parameters given.
Next Episodes¶
In the upcoming episodes we focus on further performance improvements and pipeline optimizations.