Optimize the Pipeline Performance¶
Gems and Jewels to Collect¶
In this episode you will learn more about pipeline optimization techniques. Our first technique to be presented is caching. You may also be able to speed up the pipeline by setting a specific running order of jobs in a stage or even by removing stages completely from the pipeline. In order to save resources you may also interrupt running CI pipelines if a newer version of a particular CI pipeline starts.
Introduction¶
Sometimes a CI pipeline runs for a long time. The longer it takes the later we get feedback about code style violations, defects in the code or errors during execution of the application. As a rule of thumb you should start thinking about optimizing the pipeline as soon as it runs longer than roughly 10 minutes. In the following a couple of techniques are explored.
Caching and GitLab CI¶
The first technique that might come to our mind is caching.
During a pipeline run, a lot of resources will be downloaded.
Unless new versions are available, we could reuse those fetched files again
in later CI jobs.
Technically, this is possible because the CI runners are configured to utilize
a separate caching service.
Artifacts created during a CI pipeline can be uploaded to this service if the
cache should be used and downloaded by the next CI job that reuses these cached
files.
One example in the context of Python packages is caching packages managed
with dependency management systems like pip
, pipenv
or poetry
.
stages:
- lint
- test
default:
image: python:3.11
variables: # defining environment variables accessible in the whole pipeline
PY_COLORS: '1' # colour python output
CACHE_DIR: ".cache"
CACHE_PATH: "$CI_PROJECT_DIR/$CACHE_DIR"
POETRY_VIRTUALENVS_PATH: "$CACHE_PATH/venv"
POETRY_CACHE_DIR: "$CACHE_PATH/poetry"
PIP_CACHE_DIR: "$CACHE_PATH/pip"
.dependencies:
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
cache:
key:
files:
- poetry.lock
prefix: $CI_JOB_IMAGE
paths:
- "$CACHE_DIR"
license-compliance:
stage: lint
extends: .dependencies
script:
- poetry run reuse lint
codestyle:
stage: lint
extends: .dependencies
script:
- poetry run black --check --diff src/
- poetry run isort --check --diff src/
test:python:
image: python:${VERSION}
stage: test
extends: .dependencies
script:
- poetry run pytest tests/
parallel:
matrix:
- VERSION: ["3.10", "3.11", "3.12"]
In this example, by using the
cache
keyword
we declared a directory called .cache/
to be a directory that ought to be
cached by GitLab CI.
The key
sub-key of this keyword gives each cache a unique identifying name.
All CI jobs that reference the same cache name also use the same cache even if
they are part of different CI pipeline runs.
The Python dependencies are specified in a file called poetry.lock
.
If this file changes, the cache must be invalidated.
Therefore, it is useful to use the file checksum of poetry.lock
as the cache key.
This can be achieved in GitLab CI by specifying the files
subkey.
Think carefully about the cache key so that caches are always used when
possible and recreated if necessary.
Ultimately, you will notice that this pipeline is much faster compared to the
same pipeline without caching.
This is because all defined CI jobs reuse the .cache/
directory that contains
the virtual environment and artifacts downloaded, managed and used by pip
and
poetry
.
Defining Environment variables
¶
The variables
keyword is a way to define environment variables globally
for the whole CI pipeline or specifically for particular CI jobs as part of the
CI job definition.
Here is an example:
You can access these environment variables the same way as you access Shell
environment variables, i.e. by the variable name prefixed by a dollar sign,
like $SPEED_OF_LIGHT
:
variables:
SPEED_OF_LIGHT: "299792458"
my-custom-job:
script:
- echo "The speed of light in vacuum is $SPEED_OF_LIGHT m/s."
In our example CI pipeline above we defined five environment variables:
- Variable
CACHE_DIR
is the directory name that contains all cached artifacts. - Variable
CACHE_PATH
is the full path to the cache directory using a predefined CI variable called$CI_PROJECT_DIR
, which is the path to the project directory on the CI runner where all the CI actions take place. - Variable
POETRY_VIRTUALENVS_PATH
defines the path of the virtual environment created and used by __Poetry_. - Variable
POETRY_CACHE_DIR
defines the path to the directory Poetry uses for caching. - Variable
PIP_CACHE_DIR
defines the path to the directory Pip uses for caching.
As you can see, you can use predefined variables like $CI_PROJECT_DIR
as well as any other predefined variable in variables
sections.
Later on in the pipeline these environment variables become handy because we
know where the pipeline stores artifacts so that we can tell the GitLab CI
runner to cache this particular .cache
directory in which everything is
included.
The needs
keyword¶
Usually, the ordering of the CI jobs is given by the order of the stages. All stages will be executed in sequence, while all jobs in a stage will be executed in parallel. This is how it looks like in a diagram:
This order can be changed with the needs
keyword which defines another
running order of CI jobs.
Some CI job might need to wait for another CI job to finish successfully
because it depends on the result of the first one.
Two examples might be that the former job creates artifacts that the other one
wants to reuse or the former job builds the application that is tested later
on.
Both examples could use the needs
keyword to define the running order, but
they might have different implications depending on whether both jobs are
contained in the same stage or not.
If they are contained in the same stage, the second job does not run in
parallel but in sequence after the first one.
If they are contained in different stages, the second job will also be executed
after the first one, but not necessarily after the whole stage of the
pipeline finished.
The depending job might be executed earlier than the stage ordering and
immediately after the job finishes successfully that the second job depends
on.
Please note that the needs
keyword takes a list of CI job names as values.
These CI job names define the jobs that this job depends on.
In consequence, the job with the needs
keyword needs to wait for all
mentioned jobs to finish successfully before it will be executed itself.
In the following, we will discuss two examples. In the first one we show how to reorder CI jobs inside the same stage. The objective would be to execute two jobs in one stage consecutively rather than in parallel. The first two jobs in the next example run in sequence, although they are part of the same stage. The third job starts as soon as the first stage completes successfully.
stages:
- stage-1
- stage-2
my-ci-job-1:
stage: stage-1
script:
- echo "Execute job 1"
- sleep 20
my-ci-job-2:
stage: stage-1
script:
- echo "Execute job 2"
- sleep 40
needs: ["my-ci-job-1"]
my-ci-job-3:
stage: stage-2
script:
- echo "Execute job 3"
For our example CI pipeline, we could run the license-compliance
job before
the lint
job:
stages:
- lint
- test
- run
default:
image: python:3.11
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
license-compliance:
stage: lint
script:
- poetry run reuse lint
codestyle:
stage: lint
script:
- poetry run black --check --diff .
- poetry run isort --check --diff .
needs: ["license-compliance"]
test:python:
image: python:${VERSION}
stage: test
script:
- poetry run pytest tests/
parallel:
matrix:
- VERSION: ["3.10", "3.11", "3.12"]
run:
stage: run
script:
- poetry run python -m astronaut_analysis
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
artifacts:
paths:
- results/
Let us have a look at the changed diagram:
Second, we show how to set a new order across stages. Here, the objective is to let a depending job start as soon as the job it depends on finishes successfully. This way the later job does not wait until the whole previous stage passes. In the following example, the third job needs the first job to finish successfully. This makes the pipeline much faster because the third job does not wait for the slower second job to finish.
stages:
- stage-1
- stage-2
my-ci-job-1:
stage: stage-1
script:
- echo "Execute job 1"
- sleep 20
my-ci-job-2:
stage: stage-1
script:
- echo "Execute job 2"
- sleep 40
my-ci-job-3:
stage: stage-2
script:
- echo "Execute job 3"
needs: ["my-ci-job-1"]
In the case of our example, suppose we run the tests as soon as the lint
job
finishes successfully in our example CI pipeline, leaving the
license-compliance
aside.
In addition, the running
job could also be executed as soon as the test cases
regarding the version 3.10 of the Python interpreter finishes successfully,
leaving all other test jobs aside.
stages:
- lint
- test
- run
default:
image: python:3.11
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
license-compliance:
stage: lint
script:
- poetry run reuse lint
codestyle:
stage: lint
script:
- poetry run black --check --diff .
- poetry run isort --check --diff .
test:python:
image: python:${VERSION}
stage: test
script:
- poetry run pytest tests/
parallel:
matrix:
- VERSION: ["3.10", "3.11", "3.12"]
needs: ["codestyle"]
running:
stage: run
script:
- poetry run python -m astronaut_analysis
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
artifacts:
paths:
- results/
needs: ["test:python: [3.11]"]
Now, the Directed Acyclic Graph (DAG) of that CI pipeline that displays the sequence and interrelations of the CI jobs looks like this:
The overall pipeline might become much faster by introducing a new ordering that is not based on the stage ordering.
Please note the special notation of the needs
keyword in the previous example
in case of defining a dependency to a parameterized CI job.
There, the specific parameter value needs to be given in squared brackets
alongside the job name, in this example this is test:python: [3.10]
.
Stageless Pipelines¶
Stageless CI pipelines do not define any stages.
They leave out the stages
keyword completely and set the running order of
the CI jobs only by using the needs
keyword.
Your pipeline might look similar to this example that executes the first two
jobs in parallel but the third one after the first one:
my-ci-job-1:
script:
- echo "Execute job 1"
- sleep 20
my-ci-job-2:
script:
- echo "Execute job 2"
- sleep 40
my-ci-job-3:
script:
- echo "Execute job 3"
needs: ["my-ci-job-1"]
By applying this concept to our example CI pipeline, we could make the tests
depend on the lint
job and the run
job on the test:python: [3.10]
job.
This is equivalent to the following stageless pipeline:
default:
image: python:3.11
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
license-compliance:
script:
- poetry run reuse lint
needs: []
codestyle:
script:
- poetry run black --check --diff .
- poetry run isort --check --diff .
needs: []
test:python:
image: python:${VERSION}
script:
- poetry run pytest tests/
parallel:
matrix:
- VERSION: ["3.10", "3.11", "3.12"]
needs: ["codestyle"]
running:
script:
- poetry run python -m astronaut_analysis
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
artifacts:
paths:
- results/
needs: ["test:python: [3.11]"]
The resulting CI pipeline might be faster than the CI pipeline with stages defined. The corresponding DAG of the CI pipeline shown above is depicted in the following diagram:
Help Saving Resources¶
The last keyword to be explained in this episode is the interruptible
keyword.
It might be unreasonable to continue executing CI pipelines if a newer
version of a pipeline is about to start just to save infrastructure
resources.
By setting CI jobs as interruptible
these jobs are allowed to be canceled
before they finished running.
If a job is interrupted the whole pipeline will be stopped in favour of a
newer one.
This could be exemplified like so:
my-job-1:
script:
- echo "Execute job 1"
- sleep 15
interruptible: true
my-job-2:
script:
- echo "Execute job 2"
- sleep 30
interruptible: true
my-job-3:
script:
- echo "Execute job 3"
needs: ["my-job-2"]
For our own CI pipeline, we allow all lint and test jobs to be
interruptible but not the run
job because this one creates artifacts
that could be incomplete or missing if the job is canceled:
default:
image: python:3.11
before_script:
- pip install --upgrade pip
- pip install poetry
- poetry install
license-compliance:
script:
- poetry run reuse lint
needs: []
interruptible: true
codestyle:
script:
- poetry run black --check --diff .
- poetry run isort --check --diff .
needs: []
interruptible: true
test:python:${VERSION}:
image: python:${VERSION}
script:
- poetry run pytest tests/
parallel:
matrix:
- VERSION: ["3.10", "3.11", "3.12"]
needs: ["codestyle"]
interruptible: true
running:
script:
- poetry run python -m astronaut_analysis
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
artifacts:
paths:
- results/
needs: ["test:python: [3.11]"]
Consequently, the pipeline will be stopped if the running
job has not been
reached when a newer version of the pipeline starts.
This might save a lot of resources in the long run and might not block
CI runners for a long time if CI pipelines are computational expensive and
run for a significant amount of time.
Note:
When using the needs
keyword, artifacts are not downloaded from previous
stages anymore.
In those jobs which use the needs
keyword you can enable downloading artifacts
from the CI job that the job depends on by using the artifacts
sub-key like
this:
stages:
- build
- test
default:
image: python:3.11
building:
stage: build
script:
- make build
artifacts:
paths:
- build
testing:
stage: test
script:
- make test
needs:
- job: build
artifacts: true
Exercise
Exercise 1: Optimize CI Pipeline Performance of the Exercise Project¶
The following exercise is about optimizing the CI pipeline for our exercise project. Remember: - Caching will speed up your pipeline, even though it might not be applicable in the exercise project. - Using stageless pipelines helps to avoid CI jobs blocking each other. - Making jobs interruptible will cancel a pipeline if a newer run has started, thus saving resources.
Take Home Messages
In this episode we were presenting some more concepts to optimize the whole CI pipeline. This can be done by caching dependencies, by defining a more efficient running order in a pipeline or even by defining stageless pipelines. Last, we were configuring CI jobs as interruptible so that a pipeline can be stopped in favour of a newer pipeline which saves infrastructure resources.
Next Episodes¶
In the last episode of this workshop we will work again on the topic of removing duplications and reusing particular parts of the CI pipeline.