Title: Managing a fleet of NixOS Part 2

	Title: Managing a fleet of NixOS Part 2 - A KISS design
	Author: Solène
	Date: 03 September 2022
	Tags: bento nixos nix
	Description: In this series of articles, I'll explain my steps toward
	designing an infrastructure to centrally manage a fleet of NixOS
	systems.

	# Introduction

	Let's continue my series trying to design a NixOS fleet management.

	Yesterday, I figured out 3 solutions:

	1. periodic data checkout
	2. pub/sub - event driven
	3. push from central management to workstations

	I retained solutions 2 and 3 only because they were the only providing
	instantaneous updates. However, I realize we could have a hybrid setup
	because I didn't want to let the KISS solution 1 away.

	In my opinion, the best we can create is a hybrid setup of 1 and 3.

	# A new solution

	In this setup, all workstations will connect periodically to the
	central server to look for changes, and then trigger a rebuild. This
	simple mechanism can be greatly extended per-host to fit all our needs:

	* periodicity can be configured per-host
	* the rebuild service can be triggered on purpose manually by the user
	clicking on a button on their computer
	* the rebuild service can be triggered on purpose manually by a remote
	sysadmin having access to the system (using a VPN), this partially
	implements solution 3
	* the central server can act as a binary cache if configured per-host,
	it can be used to rebuild each configuration beforehand to avoid
	rebuilding on the workstations, this is one of Cachix Deploy arguments
	* using ssh multiplexing, remote checks for the repository can have a
	reduced bandwidth usage for maximum efficiency
	* a log of the update can be sent to the sftp server
	* the sftp server can be used to check connectivity and activate a
	rollback to previous state if you can't reach it anymore (like "magic
	rollback" with deploy-rs)
	* the sftp server is a de-facto available target for potential backups
	of the workstation using restic or duplicity

	The mechanism is so simple, it could be adapted to many cases, like
	using GitHub or any data source instead of a central server. I will
	personally use this with my laptop as a central system to manage remote
	servers, which is funny as my goal is to use a server to manage
	workstations :-)

	# File access design

	One important issue I didn't approach in the previous article is how to
	distribute the configuration files:

	* each workstation should be restricted to its own configuration only
	* how to send secrets, we don't want them in the nix-store
	* should we use flakes or not? Better to have the choice
	* the sysadmin on the central server should manage everything in a
	single git repository and be able to use common configuration files
	across the hosts

	Addressing each of these requirements is hard, but in the end I've been
	able to design a solution that is simple and flexible:

	Design pattern for managing users

	The workflow is the following:

	* the sysadmin writes configuration files for each workstation in a
	dedicated directory
	* the sysadmin creates a symlink to a directory of common modules in
	each workstation directories
	* after a change, the sysadmin runs a program that will copy each
	workstation configuration into a directory in a chroot, symlinks have
	to be resolved
	* OPTIONAL: we can dry-build each host configuration to check if they
	work
	* OPTIONAL: we can build each host configuration to provide them as a
	binary cache

	The directory holding configuration is likely to have a flake.nix file
	(can be a symlink to something generic), a configuration file, a
	directory with a hierarchy of files to copy as-this in the system to
	copy things like secrets or configuration files not managed by NixOS,
	and a symlink to a directory of nix files factorized for all hosts.

	The NixOS clients will connect to their dedicated users with ssh using
	their private key, this allows to separate each client on the host
	system and restrict what they can access using the SFTP chroot feature.

	A diagram of a real world case with 3 users would look like this:

	Real world example with 3 users

	# Work required for the implementation

	The setup is very easy and requires only a few components:

	* a program to translates the configuration repository into separate
	directories in the chroot
	* some NixOS configuration to create the SFTP chroots, we just need to
	create a nix file with a list of pair of values containing "hostname"
	"ssh-public-key" for each remote host, this will automate the creation
	of the ssh configuration file
	* a script on the user side that connects and look for changes and run
	nixos-rebuild if something changes, maybe rclone could be used to
	"sync" over SFTP efficiently
	* a systemd timer for the user script
	* a systemd socket triggering the user script, so people can just open
	http://localhost:9999 to trigger the socket and forcing the update,
	create a bookmark named "UPDATE MY MACHINE" on the user system

	# Conclusion

	I absolutely love this design, it's simple, and each piece can easily
	be replaced to fit one's need. Now, I need to start writing all the
	bits to make it real, and offer it to the world 🎉.

	There is a NixOS module named autoUpgrade, I'm aware of its existence,
	but while it's absolutely perfect for the average user workstation or
	server, it's not practical for managing a fleet of NixOS efficiently.