| Title: BTRFS deduplication using bees | |
| Author: Solène | |
| Date: 16 August 2022 | |
| Tags: nixos btrfs linux | |
| Description: This explains how to use bees to enable offline | |
| deduplication on a BTRFS file system | |
| # Introduction | |
| BTRFS is a Linux file system that uses a Copy On Write (COW) model. It | |
| is providing many features like on the fly compression, volumes | |
| management, snapshots and clones etc... | |
| Wikipedia page about Copy on write | |
| However, BTRFS doesn't natively support deduplication, which is a | |
| feature that looks for chunks in files to see if another file share | |
| that block, if so, only one chunk of data can be used for both files. | |
| In some scenarios, this can drastically reduce the disk space usage. | |
| This is where we can use "bees", a program that can do offline | |
| deduplication for BTRFS file systems. In this context, offline means | |
| it's done when you run a command, while it could be live/on the fly | |
| where deduplication is instantly applied. HAMMER file system from | |
| DragonFly BSD is doing offline deduplication, while ZFS is doing it | |
| live. There are pros and cons for both models, ZFS documentation | |
| recommends 1 GB of memory per Terabyte of disk when deduplication is | |
| enabled, because it requires to have all chunks hashes in memory. | |
| Bees GitHub page project | |
| # Usage | |
| Bees is a service you need to install and start on your system, it has | |
| some limitations and caveats documented, but it should work for most | |
| users. | |
| You can define a BTRFS file system on which you want deduplication and | |
| a load target. Bees will work silently when your system is below the | |
| load threshold, and will stop when the load exceeds the limit, this is | |
| a simple mechanism to prevent bees to eat all your system resources | |
| after some freshly modified/created files need to be scanned. | |
| First time you run bees on a file system that is not empty, it may take | |
| a while to scan everything, but then it's really quiet except if you do | |
| heavy I/O operation like downloading big files, but it's doing a good | |
| job at staying behind the scene. | |
| # Installation on NixOS | |
| Add this code to /etc/nixos/configuration.nix and run "nixos-rebuild | |
| switch" to apply the changes. | |
| ``` | |
| services.beesd.filesystems = { | |
| root = { | |
| spec = "LABEL=nixos"; | |
| hashTableSizeMB = 256; | |
| verbosity = "crit"; | |
| extraOptions = [ "--loadavg-target" "2.0" ]; | |
| }; | |
| }; | |
| ``` | |
| The code suppose your root partition is labelled "nixos", you want a | |
| hash table of 256 MB (this will be used by bees) and you don't want | |
| bees to run when the system load is more than 2.0. | |
| You may want to tune the values, mostly the hash size, depending on | |
| your file system size. Bees is for terabytes file systems, but this | |
| doesn't mean you can use it for the average user disks. | |
| # Results | |
| I tried on my workstation with a lot of build artifacts and git | |
| repositories, bees reduced the disk usage from 160 GB to 124 GB, so | |
| it's a huge win here. | |
| Later, I tried again on some Steam games with a few proton versions, it | |
| didn't save much on the games but saved a lot on the proton | |
| installations. | |
| On my local cache server, it did save nothing, but is to be expected. | |
| # Conclusion | |
| BTRFS is a solid alternative to ZFS, it requires less memory while | |
| providing volumes, snapshots and compression. The only thing it needed | |
| for me was deduplication, and I'm glad it's offline, so it doesn't use | |
| too much memory. |