Title: Linux NILFS file system: automatic continuous snapshots

	Title: Linux NILFS file system: automatic continuous snapshots
	Author: Solène
	Date: 05 October 2022
	Tags: linux filesystem nilfs
	Description: In this article, I present the Linux file system NILFS and
	its automatic continuous snapshoting system.

	# Introduction

	Today, I'll share about a special Linux file system that I really
	enjoy. It's called NILFS and has been imported into Linux in 2009, so
	it's not really a new player, despite being stable and used in
	production it never got popular.

	In this file system, there is a unique system of continuous checkpoint
	creation. A checkpoint is a snapshot of your system at a given point
	in time, but it can be deleted automatically if some disk space must be
	reclaimed. A checkpoint can be transformed into a snapshot that will
	never be removed.

	This mechanism works very well for workstations or file servers on
	which redundancy is nonexistent, and on which backups are done every
	day/weeks which give room for unrecoverable mistakes.

	NILFS project official website
	Wikipedia page about NILFS

	# NILFS concepts

	As NILFS is a Copy-On-Write file system (CoW), which mean when you make
	a change to a file, the original chunk on the disk isn't modified but a
	new chunk is created with the new content, this play well with making
	an history of the files.

	From my experience, it performs very well on SSD devices on a desktop
	system, even during heavy I/O operation.

	The continuous checkpoint creation system may be very confusing, so
	I'll explain how to learn about this mechanism and how to tame it.

	# Garbage collection

	The concept of a garbage collector may appear given for most people,
	but if it doesn't speak to you, let me give a quick explanation. In
	computer science, a garbage collector is a task that will look at
	unused memory and make it available again.

	On NILFS, as a checkpoint is created every few seconds, used data is
	never freed and one would run out of disk pretty quickly. But here is
	the `nilfs_cleanerd` program, the garbage collector, that will look at
	the oldest checkpoint and delete them to reclaim the disk space under
	certain condition. Its default strategy is trying to keep checkpoints
	as long as possible, until it needs to make some room to avoid issues,
	it may not suit a workload creating a lot of files and that's why it
	can be tuned very precisely. For most desktop users, the defaults
	should work fine.

	The garbage collector is automatically started on a volume upon mount.
	You can use the command `nilfs-clean` to control that daemon, reload
	its configuration, stop it etc...

	When you delete a file on a NILFS file system, it doesn't free up any
	disk space because it's still available in a previous checkpoint, you
	need to wait for the according checkpoints to be removed to have some
	space freed.

	# How to find the current size of your data set

	As the output of `df` for a NILFS filesystem will give you the real
	data used on the disk for your data AND the snapshots/checkpoints, it
	can't be used to know how much free disk is available/used.

	In order to figure the current disk usage (without accounting older
	checkpoints/snapshots), we will use the command lscp to look at the
	number of blocks contained in the most recent checkpoint. On Linux, a
	block is 4096 bytes, we can then turn the total in bytes into gigabytes
	by dividing three time by 1024 (bytes -> kilobytes -> megabytes ->
	gigabytes).

	```shell
	lscp \| awk 'END { print $(NF-1)*4096/1024/1024/1024 }'
	```

	This number is the current size of what you have on the partition.

	# Create a checkpoint / snapshot

	It's possible to create a snapshot of your current system state using
	the command `mkcp`.

	```
	mkcp --snapshot
	```

	Or you can turn a checkpoint into a snapshot using the command chcp.

	```
	chcp ss /dev/sda1 28579
	```

	The opposite operation (snapshot to checkpoint) can be done using `chcp
	cp`.

	# How to recover files after a big mistake

	Let's say you deleted an important in-progress work, you don't have any
	backup and no way to retrieve it, fortunately you are using NILFS and a
	checkpoint was created every few seconds, so the files are still there
	and at reach!

	The first step is to pause the garbage collector to avoid losing the
	files: `nilfs-clean --suspend`. After this, we can think slowly about
	the next steps without having to worry.

	The next step is to list the checkpoints using the command `lscp` and
	look at the date/time in which the files still existed and preferably
	in their latest version, so the best is to get just before the
	deletion.

	Then, we can mount the checkpoint (let's say number 12345 for the
	example) on a different directory using the following command:

	```shell
	mount -t nilfs2 -r -o cp=12345 /dev/sda1 /mnt
	```

	If it went fine, you should be able to browse the data in `/mnt` to
	recover your files.

	Once you finished recovering your files, umount `/mnt` and resume the
	garbage collector with `nilfs-clean --resume`.

	# Going further

	Here is a list of extra pieces you may want to read to learn more about
	nilfs2:

	* nilfs_cleanerd and nilfs_cleanerd.conf man pages to tune the garbage
	collector
	* man pages for lscp / mkcp / rmcp / chcp to manage snapshots and
	checkpoints manually