Title: Managing a fleet of NixOS Part 2 - A KISS design | |
Author: Solène | |
Date: 03 September 2022 | |
Tags: bento nixos nix | |
Description: In this series of articles, I'll explain my steps toward | |
designing an infrastructure to centrally manage a fleet of NixOS | |
systems. | |
# Introduction | |
Let's continue my series trying to design a NixOS fleet management. | |
Yesterday, I figured out 3 solutions: | |
1. periodic data checkout | |
2. pub/sub - event driven | |
3. push from central management to workstations | |
I retained solutions 2 and 3 only because they were the only providing | |
instantaneous updates. However, I realize we could have a hybrid setup | |
because I didn't want to let the KISS solution 1 away. | |
In my opinion, the best we can create is a hybrid setup of 1 and 3. | |
# A new solution | |
In this setup, all workstations will connect periodically to the | |
central server to look for changes, and then trigger a rebuild. This | |
simple mechanism can be greatly extended per-host to fit all our needs: | |
* periodicity can be configured per-host | |
* the rebuild service can be triggered on purpose manually by the user | |
clicking on a button on their computer | |
* the rebuild service can be triggered on purpose manually by a remote | |
sysadmin having access to the system (using a VPN), this partially | |
implements solution 3 | |
* the central server can act as a binary cache if configured per-host, | |
it can be used to rebuild each configuration beforehand to avoid | |
rebuilding on the workstations, this is one of Cachix Deploy arguments | |
* using ssh multiplexing, remote checks for the repository can have a | |
reduced bandwidth usage for maximum efficiency | |
* a log of the update can be sent to the sftp server | |
* the sftp server can be used to check connectivity and activate a | |
rollback to previous state if you can't reach it anymore (like "magic | |
rollback" with deploy-rs) | |
* the sftp server is a de-facto available target for potential backups | |
of the workstation using restic or duplicity | |
The mechanism is so simple, it could be adapted to many cases, like | |
using GitHub or any data source instead of a central server. I will | |
personally use this with my laptop as a central system to manage remote | |
servers, which is funny as my goal is to use a server to manage | |
workstations :-) | |
# File access design | |
One important issue I didn't approach in the previous article is how to | |
distribute the configuration files: | |
* each workstation should be restricted to its own configuration only | |
* how to send secrets, we don't want them in the nix-store | |
* should we use flakes or not? Better to have the choice | |
* the sysadmin on the central server should manage everything in a | |
single git repository and be able to use common configuration files | |
across the hosts | |
Addressing each of these requirements is hard, but in the end I've been | |
able to design a solution that is simple and flexible: | |
Design pattern for managing users | |
The workflow is the following: | |
* the sysadmin writes configuration files for each workstation in a | |
dedicated directory | |
* the sysadmin creates a symlink to a directory of common modules in | |
each workstation directories | |
* after a change, the sysadmin runs a program that will copy each | |
workstation configuration into a directory in a chroot, symlinks have | |
to be resolved | |
* OPTIONAL: we can dry-build each host configuration to check if they | |
work | |
* OPTIONAL: we can build each host configuration to provide them as a | |
binary cache | |
The directory holding configuration is likely to have a flake.nix file | |
(can be a symlink to something generic), a configuration file, a | |
directory with a hierarchy of files to copy as-this in the system to | |
copy things like secrets or configuration files not managed by NixOS, | |
and a symlink to a directory of nix files factorized for all hosts. | |
The NixOS clients will connect to their dedicated users with ssh using | |
their private key, this allows to separate each client on the host | |
system and restrict what they can access using the SFTP chroot feature. | |
A diagram of a real world case with 3 users would look like this: | |
Real world example with 3 users | |
# Work required for the implementation | |
The setup is very easy and requires only a few components: | |
* a program to translates the configuration repository into separate | |
directories in the chroot | |
* some NixOS configuration to create the SFTP chroots, we just need to | |
create a nix file with a list of pair of values containing "hostname" | |
"ssh-public-key" for each remote host, this will automate the creation | |
of the ssh configuration file | |
* a script on the user side that connects and look for changes and run | |
nixos-rebuild if something changes, maybe rclone could be used to | |
"sync" over SFTP efficiently | |
* a systemd timer for the user script | |
* a systemd socket triggering the user script, so people can just open | |
http://localhost:9999 to trigger the socket and forcing the update, | |
create a bookmark named "UPDATE MY MACHINE" on the user system | |
# Conclusion | |
I absolutely love this design, it's simple, and each piece can easily | |
be replaced to fit one's need. Now, I need to start writing all the | |
bits to make it real, and offer it to the world 🎉. | |
There is a NixOS module named autoUpgrade, I'm aware of its existence, | |
but while it's absolutely perfect for the average user workstation or | |
server, it's not practical for managing a fleet of NixOS efficiently. |