| Title: Managing a fleet of NixOS Part 1 - Design choices | |
| Author: Solène | |
| Date: 02 September 2022 | |
| Tags: bento nixos nix | |
| Description: In this series of articles, I'll explain my steps toward | |
| designing an infrastructure to centrally manage a fleet of NixOS | |
| systems. | |
| # Introduction | |
| I have a grand project in my mind, and I need to think about it before | |
| starting any implementation. The blog is a right place for me to | |
| explain what I want to do and the different solutions. | |
| It's related to NixOS. I would like to ease the management of a fleet | |
| of NixOS workstations that could be anywhere. | |
| This could be useful for companies using NixOS for their employees, to | |
| manage all the workstations remotely, but also for people who may | |
| manage NixOS systems in various places (cloud, datacenter, house, | |
| family computers). | |
| In this central management, it makes sense to not have your users with | |
| root access, they would have to call their technical support to ask for | |
| a change, and their system could be updated quickly to reflect the | |
| request. This can be super useful for remote family computers when | |
| they need an extra program not currently installed, and that you took | |
| the responsibility of handling your system... | |
| With NixOS, this setup totally makes sense, you can potentially | |
| reproduce users bugs as you have their configuration, stage new changes | |
| for testing, and users can roll back to a previous working state in | |
| case of big regression. | |
| Cachix company made it possible before I figure a solution. It's still | |
| not late to propose an open source alternative. | |
| Cachix Deploy | |
| # Defining the project | |
| The purpose of this project is to have a central management system on | |
| which you keep the configuration files for all the NixOS around, and | |
| allow the administrator to make the remote NixOS to pick up the new | |
| configuration as soon as possible when required. | |
| We can imagine three different implementations at the highest level: | |
| * a scheduled job on each machine looking for changes in the source. | |
| The source could be a git repository, a tarball or anything that could | |
| be used to carry the configuration. | |
| * NixOS systems could connect to something like a pub/sub and wait for | |
| an event from the central management to trigger a rebuild, the event | |
| may or not contain information / sources. | |
| * the central management system could connect to the remote NixOS to | |
| trigger the build / push the build | |
| These designs have all pros and cons. Let's see them more in details. | |
| ## Solution 1 - Scheduled job | |
| In this scenario, The NixOS system would use a cron or systemd timer to | |
| periodically check for changes and trigger the update. | |
| ### Pros | |
| * low maintenance | |
| * could interactively ask the user when they want to upgrade if not now | |
| ### Cons | |
| * may not run at all if the system is not up at the correct time, or | |
| could be run at a delayed time depending on situation | |
| * can't force an update as soon as possible | |
| * not really bandwidth effective if you often poll | |
| * no feedback from the central management about who made/receive the | |
| update (except by adding a call to the server?) | |
| ## Solution 2 - Remote systems are listening for changes (publisher / subscribe… | |
| In this scenario, the NixOS system would always be connected to the | |
| central management, using some kind of protocol like MQTT, BOCH or | |
| similar. | |
| ### Pros | |
| * you know which systems are up | |
| * events from central management are instantaneous and should wait for | |
| an acknowledgment | |
| * updates should propagate very quickly | |
| * could interactively ask the user when they want to upgrade if not now | |
| ### Cons | |
| * this can lead to privacy issue as you know when each host is | |
| connected | |
| * this adds complexity to the server | |
| * this adds complexity on each client | |
| * firewalls usually don't like long-lived connections, HTTPS based | |
| solution would help bypass firewalls | |
| ## Solution 3 - The central management pushes the updates to the remote systems | |
| In this scenario, the NixOS system would be reachable over a protocol | |
| allowing to run commands like SSH. The central management system would | |
| run a remote upgrade on it, or push the changes using tools like | |
| deploy-rs, colmena, morph or similar... | |
| Awesome-nix list: deployment-tools | |
| ### Pros | |
| * update is immediate | |
| * SSH could be exposed over TOR or I2P for maximum firewall bypassing | |
| capability | |
| ### Cons | |
| * offline systems may be complicated to update, you would need to try | |
| to connect to them often until they are reachable | |
| * you can connect to the remote machine and potentially spy the user. | |
| In the alternatives above, you can potentially achieve the same by | |
| reconfiguring the computer to allow this, but it would have to be done | |
| on purpose | |
| # Making a choice | |
| I tried to state the pros and cons of each setup, but I can't see a | |
| clear winner. However, I'm not convinced by the Solution 1 as you | |
| don't have any feedback or direct control on the systems, I prefer to | |
| abandon it. | |
| The Solutions 2 and 3 are still in the competition, we basically ended | |
| with a choice between a PUSH and a PULL workflow. | |
| # Conclusion | |
| In order to choose between 2 and 3, I will need to experiment with the | |
| Solution 2 technologies as I never used them (MQTT, RabbitMQ, BOCH | |
| etc…). |