| Title: Managing a fleet of NixOS Part 2 - A KISS design | |
| Author: Solène | |
| Date: 03 September 2022 | |
| Tags: bento nixos nix | |
| Description: In this series of articles, I'll explain my steps toward | |
| designing an infrastructure to centrally manage a fleet of NixOS | |
| systems. | |
| # Introduction | |
| Let's continue my series trying to design a NixOS fleet management. | |
| Yesterday, I figured out 3 solutions: | |
| 1. periodic data checkout | |
| 2. pub/sub - event driven | |
| 3. push from central management to workstations | |
| I retained solutions 2 and 3 only because they were the only providing | |
| instantaneous updates. However, I realize we could have a hybrid setup | |
| because I didn't want to let the KISS solution 1 away. | |
| In my opinion, the best we can create is a hybrid setup of 1 and 3. | |
| # A new solution | |
| In this setup, all workstations will connect periodically to the | |
| central server to look for changes, and then trigger a rebuild. This | |
| simple mechanism can be greatly extended per-host to fit all our needs: | |
| * periodicity can be configured per-host | |
| * the rebuild service can be triggered on purpose manually by the user | |
| clicking on a button on their computer | |
| * the rebuild service can be triggered on purpose manually by a remote | |
| sysadmin having access to the system (using a VPN), this partially | |
| implements solution 3 | |
| * the central server can act as a binary cache if configured per-host, | |
| it can be used to rebuild each configuration beforehand to avoid | |
| rebuilding on the workstations, this is one of Cachix Deploy arguments | |
| * using ssh multiplexing, remote checks for the repository can have a | |
| reduced bandwidth usage for maximum efficiency | |
| * a log of the update can be sent to the sftp server | |
| * the sftp server can be used to check connectivity and activate a | |
| rollback to previous state if you can't reach it anymore (like "magic | |
| rollback" with deploy-rs) | |
| * the sftp server is a de-facto available target for potential backups | |
| of the workstation using restic or duplicity | |
| The mechanism is so simple, it could be adapted to many cases, like | |
| using GitHub or any data source instead of a central server. I will | |
| personally use this with my laptop as a central system to manage remote | |
| servers, which is funny as my goal is to use a server to manage | |
| workstations :-) | |
| # File access design | |
| One important issue I didn't approach in the previous article is how to | |
| distribute the configuration files: | |
| * each workstation should be restricted to its own configuration only | |
| * how to send secrets, we don't want them in the nix-store | |
| * should we use flakes or not? Better to have the choice | |
| * the sysadmin on the central server should manage everything in a | |
| single git repository and be able to use common configuration files | |
| across the hosts | |
| Addressing each of these requirements is hard, but in the end I've been | |
| able to design a solution that is simple and flexible: | |
| Design pattern for managing users | |
| The workflow is the following: | |
| * the sysadmin writes configuration files for each workstation in a | |
| dedicated directory | |
| * the sysadmin creates a symlink to a directory of common modules in | |
| each workstation directories | |
| * after a change, the sysadmin runs a program that will copy each | |
| workstation configuration into a directory in a chroot, symlinks have | |
| to be resolved | |
| * OPTIONAL: we can dry-build each host configuration to check if they | |
| work | |
| * OPTIONAL: we can build each host configuration to provide them as a | |
| binary cache | |
| The directory holding configuration is likely to have a flake.nix file | |
| (can be a symlink to something generic), a configuration file, a | |
| directory with a hierarchy of files to copy as-this in the system to | |
| copy things like secrets or configuration files not managed by NixOS, | |
| and a symlink to a directory of nix files factorized for all hosts. | |
| The NixOS clients will connect to their dedicated users with ssh using | |
| their private key, this allows to separate each client on the host | |
| system and restrict what they can access using the SFTP chroot feature. | |
| A diagram of a real world case with 3 users would look like this: | |
| Real world example with 3 users | |
| # Work required for the implementation | |
| The setup is very easy and requires only a few components: | |
| * a program to translates the configuration repository into separate | |
| directories in the chroot | |
| * some NixOS configuration to create the SFTP chroots, we just need to | |
| create a nix file with a list of pair of values containing "hostname" | |
| "ssh-public-key" for each remote host, this will automate the creation | |
| of the ssh configuration file | |
| * a script on the user side that connects and look for changes and run | |
| nixos-rebuild if something changes, maybe rclone could be used to | |
| "sync" over SFTP efficiently | |
| * a systemd timer for the user script | |
| * a systemd socket triggering the user script, so people can just open | |
| http://localhost:9999 to trigger the socket and forcing the update, | |
| create a bookmark named "UPDATE MY MACHINE" on the user system | |
| # Conclusion | |
| I absolutely love this design, it's simple, and each piece can easily | |
| be replaced to fit one's need. Now, I need to start writing all the | |
| bits to make it real, and offer it to the world 🎉. | |
| There is a NixOS module named autoUpgrade, I'm aware of its existence, | |
| but while it's absolutely perfect for the average user workstation or | |
| server, it's not practical for managing a fleet of NixOS efficiently. |