Title: Managing a fleet of NixOS Part 1 - Design choices | |
Author: Solène | |
Date: 02 September 2022 | |
Tags: bento nixos nix | |
Description: In this series of articles, I'll explain my steps toward | |
designing an infrastructure to centrally manage a fleet of NixOS | |
systems. | |
# Introduction | |
I have a grand project in my mind, and I need to think about it before | |
starting any implementation. The blog is a right place for me to | |
explain what I want to do and the different solutions. | |
It's related to NixOS. I would like to ease the management of a fleet | |
of NixOS workstations that could be anywhere. | |
This could be useful for companies using NixOS for their employees, to | |
manage all the workstations remotely, but also for people who may | |
manage NixOS systems in various places (cloud, datacenter, house, | |
family computers). | |
In this central management, it makes sense to not have your users with | |
root access, they would have to call their technical support to ask for | |
a change, and their system could be updated quickly to reflect the | |
request. This can be super useful for remote family computers when | |
they need an extra program not currently installed, and that you took | |
the responsibility of handling your system... | |
With NixOS, this setup totally makes sense, you can potentially | |
reproduce users bugs as you have their configuration, stage new changes | |
for testing, and users can roll back to a previous working state in | |
case of big regression. | |
Cachix company made it possible before I figure a solution. It's still | |
not late to propose an open source alternative. | |
Cachix Deploy | |
# Defining the project | |
The purpose of this project is to have a central management system on | |
which you keep the configuration files for all the NixOS around, and | |
allow the administrator to make the remote NixOS to pick up the new | |
configuration as soon as possible when required. | |
We can imagine three different implementations at the highest level: | |
* a scheduled job on each machine looking for changes in the source. | |
The source could be a git repository, a tarball or anything that could | |
be used to carry the configuration. | |
* NixOS systems could connect to something like a pub/sub and wait for | |
an event from the central management to trigger a rebuild, the event | |
may or not contain information / sources. | |
* the central management system could connect to the remote NixOS to | |
trigger the build / push the build | |
These designs have all pros and cons. Let's see them more in details. | |
## Solution 1 - Scheduled job | |
In this scenario, The NixOS system would use a cron or systemd timer to | |
periodically check for changes and trigger the update. | |
### Pros | |
* low maintenance | |
* could interactively ask the user when they want to upgrade if not now | |
### Cons | |
* may not run at all if the system is not up at the correct time, or | |
could be run at a delayed time depending on situation | |
* can't force an update as soon as possible | |
* not really bandwidth effective if you often poll | |
* no feedback from the central management about who made/receive the | |
update (except by adding a call to the server?) | |
## Solution 2 - Remote systems are listening for changes (publisher / subscribe… | |
In this scenario, the NixOS system would always be connected to the | |
central management, using some kind of protocol like MQTT, BOCH or | |
similar. | |
### Pros | |
* you know which systems are up | |
* events from central management are instantaneous and should wait for | |
an acknowledgment | |
* updates should propagate very quickly | |
* could interactively ask the user when they want to upgrade if not now | |
### Cons | |
* this can lead to privacy issue as you know when each host is | |
connected | |
* this adds complexity to the server | |
* this adds complexity on each client | |
* firewalls usually don't like long-lived connections, HTTPS based | |
solution would help bypass firewalls | |
## Solution 3 - The central management pushes the updates to the remote systems | |
In this scenario, the NixOS system would be reachable over a protocol | |
allowing to run commands like SSH. The central management system would | |
run a remote upgrade on it, or push the changes using tools like | |
deploy-rs, colmena, morph or similar... | |
Awesome-nix list: deployment-tools | |
### Pros | |
* update is immediate | |
* SSH could be exposed over TOR or I2P for maximum firewall bypassing | |
capability | |
### Cons | |
* offline systems may be complicated to update, you would need to try | |
to connect to them often until they are reachable | |
* you can connect to the remote machine and potentially spy the user. | |
In the alternatives above, you can potentially achieve the same by | |
reconfiguring the computer to allow this, but it would have to be done | |
on purpose | |
# Making a choice | |
I tried to state the pros and cons of each setup, but I can't see a | |
clear winner. However, I'm not convinced by the Solution 1 as you | |
don't have any feedback or direct control on the systems, I prefer to | |
abandon it. | |
The Solutions 2 and 3 are still in the competition, we basically ended | |
with a choice between a PUSH and a PULL workflow. | |
# Conclusion | |
In order to choose between 2 and 3, I will need to experiment with the | |
Solution 2 technologies as I never used them (MQTT, RabbitMQ, BOCH | |
etc…). |