When designing a Continuous Integration system, there are a lot of possible challenges and choices. Like the Blank page syndrome, we are faced with (almost) infinite possibilities. What is the right choice? Which system should we choose? How much should we invest? How to train developers/users?

To navigate complex choices I tend to resort to ignore all the practical details and use principles and values as North Star to guide me through the mess. The risk here is to end up with a flame war kind of debate where the discussion goes around “what is the best tool™”. Such a discussion is wrong. Is pointless, time consuming and will nowhere bring you near the solution you need.

The discussion to have is which trade offs you are willing to make today. Not tomorrow, when you don’t know what will be the context around you. Today.
Choosing a CI system is a revocable decision, but the costs associated with changing it will make it happen only when it had become really detrimental (precondition: the willness to change). Due to this factor, I consider this choice as based on a simple and long lasting trade off: the more aligned to your current needs, the less complex, the less expensive (money, time, people) in the long run. There is no “best” here, but “better in the long run for us”.

The CI problem

Before stating the principles, let me take a detour on my analysis of the “CI problem”.

I consider this discussion as composed by these separate problems:

developer toolchain; answers to “how do I install all the needed tools & requirements?”
reproducibility across environments; answers to “how to I run the same tools that others are running?”
command discovery; answers to “which is the command to perform this task?” and “how do I make sure I’m using the same command everyone is using?”
troubleshooting; answers to “this is not working for me, who may I get help from?”

All of them are intertwined, but not so much that can’t be considered each one by itself.

Developer toolchain

This issue complexity grows with the number of OSs and system architectures to support. This is a hard problem to solve. Being able to install tools reliably across all 3 of them is hard, evaluating cost/benefit is difficult and setting up your working environment is not something that happens frequently.

There are promising tools that may help here and there (looking at you Nix OS!) but then you have opinions. Vim or Emacs? Maybe VSCode or IntelliJ? What about Docker or Podman? goenv or asdf? make or mage? What about task? This list can go on forever.

We don’t care how tools are installed, we care those tools are available and at the version we expect. So let’s strat from here: provide guidance to check for tool availability and check the version you expect is installed. Installation is left to the reader.

Doing this does not mean having to write documentation that no one maintains. Write a small scripts that is capable of checking tools & versions, have your devs run it regularly. Scripts to run them all is the perfect example.

Reproducibility across environments

Keep in mind how deep the rabbit hole is. Depending on the depth you want to achieve there are various tools.

Why not start from keeping things simple? Just start by providing tools and versions informations. Are we running the linter@v1.2.3? This information should be added to your (Git?) repo. Make it part of your codebase. Write it in a file. Write small tooling to double check this. Aliging your local environment and the CI is just a matter of keeping this in sync, not the easiest task but the involved complexity is low. Being it a file is possible to cat <file> | grep <tool> for comparing versions. Tools like asdf may prove very helpful here.

Growing on the complexity scale, Docker images are the next step, but this heavily depends on what does your CI system of choice integrates (mostly will support Docker). Tools like localstack or testcontainers may be handy here. Be mindful of disk space thouh, as Docker images may ehaust disk space quickly.

Command discovery

Command discovery is the ability to know which command everybody else (and the CI runners) run to perform a task. How do you build the binary? The dev one? The relase one? How do you lint the code?

This is where I think tools like make do not really shine. They bundle too much complexity already, but their primary goal is not being simple, is being flexible. Not that simplicity and flexibility must be antagonists, but they usually are. Allowing for creating complex things, they bring in more cognitive load. They also tend to fail around debugging and being easy for newcomers (i.e. support for “dry-run modes”).

This is also where documentation tends to fail in the long term. It becomes stale and out of date.

To solve this, is great to write down commands in a task runner (a software whose goal is to run tasks, more or less integrated with your development environment) that developers use regularly.

Troubleshooting

Linter failed on CI for someone else, but you can’t reproduce. These issues can take an incredible amount of time to be solved, can be flaky, can happen for unrelated reasons and terribly frustrate developers and maintainers.

In my experience the only way to address these is to make all steps explicit enough that providing help does not require prior or deep knowledge of the toolchain. If you resort to always asking to the same person or team for “CI problems”, if you need guides or documentation, the system may be too complex and frustrating.

This is the area where I think the Unix philosophy of small single purpose tools shines and make our life easier. Instead of trying to have a single tool to do everything, make it small and precise, but with a high degree of reliability.

For any setup you choose there will always be something that do not work, but are devs able to unblock themselves or will they get stuck and need help from someone else? Will they be able to get help by anyone or just very specific (busy) people?

Principles

Detour done, here are the principles:

Ease of use
The choosen system should be the easiest fulfilling the requirements. Additional complexity “for future use” should be avoided at all costs. The more complex the tool, the more cognitive demanding its maintenance, and as our focus is on the product having heavy maintenance toolchain is a big risk in the long run in my opinion.
Needs to be compatible with the build tool of choice
There is no point in having tools that do not integrate well together. Look for constraints in your setup and build with them in mind.
CI is just another dev computer
Whatever runs on the CI needs to be reproducible locally or in a clean env. There must be no need to run things in CI to ensure the CI will pass. If you consider the CI just another development computer (with added security requirements), your setup will be reproducible in another CI system or on the new dev laptop for example.
Manage complex dependencies (or don’t)
Are you a multinational company with millions of lines of code, thousands of developers working on it and complex dependencies across project? If the answer is no, don’t think about managing complex dependencies (or tools to do that) and take the opposite route: aim for 0 dependencies. Don’t try to handled them, remove them.
Delegation over complication
Instead of trying to cram everything into a single tool, delegate to specialised tools where needed, and keep the integration point simple.

All these principles together aims at a single goal: in 6 months when you will be looking at your local dev environment and CI integration you will be able to undertand it and modify it.

A philosophy of CI design