Title: Managing a fleet of NixOS Part 1

	Title: Managing a fleet of NixOS Part 1 - Design choices
	Author: Solène
	Date: 02 September 2022
	Tags: bento nixos nix
	Description: In this series of articles, I'll explain my steps toward
	designing an infrastructure to centrally manage a fleet of NixOS
	systems.

	# Introduction

	I have a grand project in my mind, and I need to think about it before
	starting any implementation. The blog is a right place for me to
	explain what I want to do and the different solutions.

	It's related to NixOS. I would like to ease the management of a fleet
	of NixOS workstations that could be anywhere.

	This could be useful for companies using NixOS for their employees, to
	manage all the workstations remotely, but also for people who may
	manage NixOS systems in various places (cloud, datacenter, house,
	family computers).

	In this central management, it makes sense to not have your users with
	root access, they would have to call their technical support to ask for
	a change, and their system could be updated quickly to reflect the
	request. This can be super useful for remote family computers when
	they need an extra program not currently installed, and that you took
	the responsibility of handling your system...

	With NixOS, this setup totally makes sense, you can potentially
	reproduce users bugs as you have their configuration, stage new changes
	for testing, and users can roll back to a previous working state in
	case of big regression.

	Cachix company made it possible before I figure a solution. It's still
	not late to propose an open source alternative.

	Cachix Deploy

	# Defining the project

	The purpose of this project is to have a central management system on
	which you keep the configuration files for all the NixOS around, and
	allow the administrator to make the remote NixOS to pick up the new
	configuration as soon as possible when required.

	We can imagine three different implementations at the highest level:

	* a scheduled job on each machine looking for changes in the source.
	The source could be a git repository, a tarball or anything that could
	be used to carry the configuration.
	* NixOS systems could connect to something like a pub/sub and wait for
	an event from the central management to trigger a rebuild, the event
	may or not contain information / sources.
	* the central management system could connect to the remote NixOS to
	trigger the build / push the build

	These designs have all pros and cons. Let's see them more in details.

	## Solution 1 - Scheduled job

	In this scenario, The NixOS system would use a cron or systemd timer to
	periodically check for changes and trigger the update.

	### Pros

	* low maintenance
	* could interactively ask the user when they want to upgrade if not now

	### Cons

	* may not run at all if the system is not up at the correct time, or
	could be run at a delayed time depending on situation
	* can't force an update as soon as possible
	* not really bandwidth effective if you often poll
	* no feedback from the central management about who made/receive the
	update (except by adding a call to the server?)

	## Solution 2 - Remote systems are listening for changes (publisher / subscribe…

	In this scenario, the NixOS system would always be connected to the
	central management, using some kind of protocol like MQTT, BOCH or
	similar.

	### Pros

	* you know which systems are up
	* events from central management are instantaneous and should wait for
	an acknowledgment
	* updates should propagate very quickly
	* could interactively ask the user when they want to upgrade if not now

	### Cons

	* this can lead to privacy issue as you know when each host is
	connected
	* this adds complexity to the server
	* this adds complexity on each client
	* firewalls usually don't like long-lived connections, HTTPS based
	solution would help bypass firewalls

	## Solution 3 - The central management pushes the updates to the remote systems

	In this scenario, the NixOS system would be reachable over a protocol
	allowing to run commands like SSH. The central management system would
	run a remote upgrade on it, or push the changes using tools like
	deploy-rs, colmena, morph or similar...

	Awesome-nix list: deployment-tools

	### Pros

	* update is immediate
	* SSH could be exposed over TOR or I2P for maximum firewall bypassing
	capability

	### Cons

	* offline systems may be complicated to update, you would need to try
	to connect to them often until they are reachable
	* you can connect to the remote machine and potentially spy the user.
	In the alternatives above, you can potentially achieve the same by
	reconfiguring the computer to allow this, but it would have to be done
	on purpose

	# Making a choice

	I tried to state the pros and cons of each setup, but I can't see a
	clear winner. However, I'm not convinced by the Solution 1 as you
	don't have any feedback or direct control on the systems, I prefer to
	abandon it.

	The Solutions 2 and 3 are still in the competition, we basically ended
	with a choice between a PUSH and a PULL workflow.

	# Conclusion

	In order to choose between 2 and 3, I will need to experiment with the
	Solution 2 technologies as I never used them (MQTT, RabbitMQ, BOCH
	etc…).