| Title: How to check your data integrity? | |
| Author: Solène | |
| Date: 17 March 2017 | |
| Tags: unix security | |
| Description: | |
| Today, the topic is data degradation, bit rot, birotting, damaged files | |
| or whatever you call it. It's when your data get corrupted over the | |
| time, due to disk fault or some unknown reason. | |
| # What is data degradation ? # | |
| I shamelessy paste one line from wikipedia: "*Data degradation is the | |
| gradual corruption of computer data due to an accumulation of | |
| non-critical failures in a data storage device. The phenomenon is also | |
| known as data decay or data rot.*". | |
| [Data degradation on | |
| Wikipedia](https://en.wikipedia.org/wiki/Data_degradation) | |
| So, how do we know we encounter a bit rot ? | |
| bit rot = (checksum changed) && NOT (modification time changed) | |
| While updating a file could be mistaken as bit rot, there is a | |
| difference | |
| update = (checksum changed) && (modification time changed) | |
| # How to check if we encounter bitrot ? # | |
| There is no way you can prevent bitrot. But there are some ways to | |
| detect it, so you can restore a corrupted file from a backup, or | |
| repair it with the right tool (you can't repair a file with a hammer, | |
| except if it's some kind of HammerFS ! :D ) | |
| In the following I will describe software I found to check (or even | |
| repair) bitrot. If you know others tools which are not in this list, I | |
| would be happy to hear about it, please mail me. | |
| In the following examples, I will use this method to generate bitrot | |
| on a file: | |
| % touch -d "2017-03-16T21:04:00" | |
| my_data/some_file_that_will_be_corrupted | |
| % generate_checksum_database_with_tool | |
| % echo "a" >> my_data/some_file_that_will_be_corrupted | |
| % touch -d "2017-03-16T21:04:00" | |
| my_data/some_file_that_will_be_corrupted | |
| % start_tool_for_checking | |
| We generate the checksum database, then we alter a file by adding a | |
| "a" at the end of the file and we restore the modification and acess | |
| time of the file. Then, we start the tool to check for data | |
| corruption. | |
| The first **touch** is only for convenience, we could get the | |
| modification time with **stat** command and pass the same value to | |
| touch after modification of the file. | |
| ## bitrot ## | |
| This is a python script, it's **very** easy to use. I will scan a | |
| directory and create a database with the checksum of the files and | |
| their modification date. | |
| **Initialization usage:** | |
| % cd /home/my_data/ | |
| % bitrot | |
| Finished. 199.41 MiB of data read. 0 errors found. | |
| 189 entries in the database, 189 new, 0 updated, 0 renamed, 0 | |
| missing. | |
| Updating bitrot.sha512... done. | |
| % echo $? | |
| 0 | |
| **Verify usage (case OK):** | |
| % cd /home/my_data/ | |
| % bitrot | |
| Checking bitrot.db integrity... ok. | |
| Finished. 199.41 MiB of data read. 0 errors found. | |
| 189 entries in the database, 0 new, 0 updated, 0 renamed, 0 | |
| missing. | |
| % echo $? | |
| 0 | |
| Exit status is 0, so our data are not damaged. | |
| **Verify usage (case Error):** | |
| % cd /home/my_data/ | |
| % bitrot | |
| Checking bitrot.db integrity... ok. | |
| error: SHA1 mismatch for ./sometextfile.txt: expected | |
| 17b4d7bf382057dc3344ea230a595064b579396f, got | |
| db4a8d7e27bb9ad02982c0686cab327b146ba80d. Last good hash checked on | |
| 2017-03-16 21:04:39. | |
| Finished. 199.41 MiB of data read. 1 errors found. | |
| 189 entries in the database, 0 new, 0 updated, 0 renamed, 0 | |
| missing. | |
| error: There were 1 errors found. | |
| % echo $? | |
| 1 | |
| fails, it's easy to write a script running every day/week/month. | |
| [Github page](https://github.com/ambv/bitrot/) | |
| bitrot is available in OpenBSD ports in sysutils/bitrot since 6.1 | |
| release. | |
| ## par2cmdline ## | |
| This tool works with PAR2 archives (see below for more informations | |
| about what PAR ) and from them, it will be able to check your data | |
| integrity **AND** repair it. | |
| While it has some pros like being able to repair data, the cons is | |
| that it's not very easy to use. I would use this one for checking | |
| integrity of long term archives that won't changes. The main drawback | |
| comes from PAR specifications, the archives are created from a | |
| filelist, if you have a directory with your files and you add new | |
| files, you will need to recompute ALL the PAR archives because the | |
| filelist changed, or create new PAR archives only for the new files, | |
| but that will make the verify process more complicated. That doesn't | |
| seems suitable to create new archives for every bunchs of files added | |
| in the directory. | |
| PAR2 let you choose the percent of a file you will be able to repair, | |
| by default it will create the archives to be able to repair up to 5% | |
| of each file. That means you don't need a whole backup for the files | |
| (while it's would be a bad idea) and only an approximately extra of 5% | |
| of your data to store. | |
| **Create usage:** | |
| % cd /home/ | |
| % par2 create -a integrity_archive -R my_data | |
| Skipping 0 byte file: /home/my_data/empty_file | |
| Source file count: 17 | |
| Source block count: 2000 | |
| Redundancy: 5% | |
| Recovery block count: 100 | |
| Recovery file count: 7 | |
| [text cut here] | |
| Opening: my_data/[....] | |
| Computing Reed Solomon matrix. | |
| Constructing: done. | |
| Wrote 381200 bytes to disk | |
| Writing recovery packets | |
| Writing verification packets | |
| Done | |
| % echo $? | |
| 0 | |
| integrity_archive.par2 | |
| integrity_archive.vol000+01.par2 | |
| integrity_archive.vol001+02.par2 | |
| integrity_archive.vol003+04.par2 | |
| integrity_archive.vol007+08.par2 | |
| integrity_archive.vol015+16.par2 | |
| integrity_archive.vol031+32.par2 | |
| integrity_archive.vol063+37.par2 | |
| my_data | |
| **Verify usage (OK):** | |
| % par2 verify integrity_archive.par2 | |
| Loading "integrity_archive.par2". | |
| Loaded 36 new packets | |
| Loading "integrity_archive.vol000+01.par2". | |
| Loaded 1 new packets including 1 recovery blocks | |
| Loading "integrity_archive.vol001+02.par2". | |
| Loaded 2 new packets including 2 recovery blocks | |
| Loading "integrity_archive.vol003+04.par2". | |
| Loaded 4 new packets including 4 recovery blocks | |
| Loading "integrity_archive.vol007+08.par2". | |
| Loaded 8 new packets including 8 recovery blocks | |
| Loading "integrity_archive.vol015+16.par2". | |
| Loaded 16 new packets including 16 recovery blocks | |
| Loading "integrity_archive.vol031+32.par2". | |
| Loaded 32 new packets including 32 recovery blocks | |
| Loading "integrity_archive.vol063+37.par2". | |
| Loaded 37 new packets including 37 recovery blocks | |
| Loading "integrity_archive.par2". | |
| No new packets found | |
| The block size used was 3812 bytes. | |
| There are a total of 2000 data blocks. | |
| The total size of the data files is 7595275 bytes. | |
| [...cut here...] | |
| Target: "my_data/....." - found. | |
| % echo $? | |
| 0 | |
| **Verify usage (with error):** | |
| par2 verify integrity_archive.par.par2 | |
| Loaded 36 new packets | |
| Loading "integrity_archive.par.vol000+01.par2". | |
| Loaded 1 new packets including 1 recovery blocks | |
| Loading "integrity_archive.par.vol001+02.par2". | |
| Loaded 2 new packets including 2 recovery blocks | |
| Loading "integrity_archive.par.vol003+04.par2". | |
| Loaded 4 new packets including 4 recovery blocks | |
| Loading "integrity_archive.par.vol007+08.par2". | |
| Loaded 8 new packets including 8 recovery blocks | |
| Loading "integrity_archive.par.vol015+16.par2". | |
| Loaded 16 new packets including 16 recovery blocks | |
| Loading "integrity_archive.par.vol031+32.par2". | |
| Loaded 32 new packets including 32 recovery blocks | |
| Loading "integrity_archive.par.vol063+37.par2". | |
| Loaded 37 new packets including 37 recovery blocks | |
| Loading "integrity_archive.par.par2". | |
| No new packets found | |
| The block size used was 3812 bytes. | |
| There are a total of 2000 data blocks. | |
| The total size of the data files is 7595275 bytes. | |
| [...cut here...] | |
| Target: "my_data/....." - found. | |
| Target: "my_data/Ebooks/Lovecraft/Quete Onirique de Kadath | |
| l'Inconnue.epub" - damaged. Found 95 of 95 data blocks. | |
| 1 file(s) exist but are damaged. | |
| 16 file(s) are ok. | |
| You have 2000 out of 2000 data blocks available. | |
| You have 100 recovery blocks available. | |
| Repair is possible. | |
| You have an excess of 100 recovery blocks. | |
| None of the recovery blocks will be used for the repair. | |
| 1 | |
| % par2 repair integrity_archive.par.par2 | |
| Loading "integrity_archive.par.par2". | |
| Loaded 36 new packets | |
| Loading "integrity_archive.par.vol000+01.par2". | |
| Loaded 1 new packets including 1 recovery blocks | |
| Loading "integrity_archive.par.vol001+02.par2". | |
| Loaded 2 new packets including 2 recovery blocks | |
| Loading "integrity_archive.par.vol003+04.par2". | |
| Loaded 4 new packets including 4 recovery blocks | |
| Loading "integrity_archive.par.vol007+08.par2". | |
| Loaded 8 new packets including 8 recovery blocks | |
| Loading "integrity_archive.par.vol015+16.par2". | |
| Loaded 16 new packets including 16 recovery blocks | |
| Loading "integrity_archive.par.vol031+32.par2". | |
| Loaded 32 new packets including 32 recovery blocks | |
| Loading "integrity_archive.par.vol063+37.par2". | |
| Loaded 37 new packets including 37 recovery blocks | |
| Loading "integrity_archive.par.par2". | |
| No new packets found | |
| The block size used was 3812 bytes. | |
| There are a total of 2000 data blocks. | |
| The total size of the data files is 7595275 bytes. | |
| [...cut here...] | |
| Target: "my_data/....." - found. | |
| Target: "my_data/Ebooks/Lovecraft/Quete Onirique de Kadath | |
| l'Inconnue.epub" - damaged. Found 95 of 95 data blocks. | |
| 1 file(s) exist but are damaged. | |
| 16 file(s) are ok. | |
| You have 2000 out of 2000 data blocks available. | |
| You have 100 recovery blocks available. | |
| Repair is possible. | |
| You have an excess of 100 recovery blocks. | |
| None of the recovery blocks will be used for the repair. | |
| l'Inconnue.epub" - found. | |
| 0 | |
| working with PAR archives exists. They should be able to all works | |
| with the same PAR files. | |
| [Parchive on Wikipedia](https://en.wikipedia.org/wiki/Parchive) | |
| [Github page](https://github.com/Parchive/par2cmdline) | |
| par2cmdline is available in OpenBSD ports in archivers/par2cmdline. | |
| If you find a way to add new files to existing archives, please mail | |
| me. | |
| ## mtree ## | |
| One can write a little script using **mtree** (in base system on | |
| OpenBSD and FreeBSD) which will create a file with the checksum of | |
| every files in the specified directories. If mtree output is different | |
| since last time, we can send a mail with the difference. This is a | |
| process done in base install of OpenBSD for /etc and some others files | |
| to warn you if it changed. | |
| While it's suited for directories like /etc, in my opinion, this is | |
| not the best tool for doing integrity check. | |
| ## ZFS ## | |
| I would like to talk about ZFS and data integrity because this is | |
| where ZFS is very good. If you are using ZFS, you may not need any | |
| other software to take care about your data. When you write a file, | |
| ZFS will also store its checksum as metadata. By default, the option | |
| "checksum" is activated on dataset, but you may want to disable it for | |
| better performance. | |
| There is a command to ask ZFS to check the integrity of the | |
| files. Warning: scrub is very I/O intensive and can takes from hours | |
| to days or even weeks to complete depending on your CPU, disks and the | |
| amount of data to scrub: | |
| # zpool scrub zpool | |
| The scrub command will recompute the checksum of every file on the ZFS | |
| pool, if something is wrong, it will try to repair it if possible. A | |
| repair is possible in the following cases: | |
| If you have multiple disks like raid-Z or raid-1 (mirror), ZFS will be | |
| look on the differents disks if the non corrupted version of the file | |
| exists, if it finds it, it will restore it on the disk(s) where it's | |
| corrupted. | |
| If you have set the ZFS option "copies" to 2 or 3 (1 = default), that | |
| means that the file is written 2 or 3 time on the disk. Each file of | |
| the dataset will be allocated 2 or 3 time on the disk, so take care if | |
| you want to use it on a dataset containing heavy files ! If ZFS find | |
| thats a version of a file is corrupted, it will check the others | |
| copies of it and tries to restore the corrupted file is possible. | |
| You can see the percentage of filesystem already scrubbed with | |
| zfs status zpool | |
| and the scrub can be stopped with | |
| zfs scrub -s zpool | |
| Like ZFS, BTRFS is able to scrub its data and report bit rot, and | |
| repair | |
| it if data is available in another disk. | |
| To start a scrub, run: | |
| btrfs scrub start / | |
| You can check progress using: | |
| btrfs scrub status / | |
| It's possible to use `btrfs scrub cancel /` to stop a scrub, and resume | |
| it later with `btrfs scrub resume /`, however btrfs tries its best to | |
| scrub the data without affecting much the responsiveness of the system. | |
| ### AIDE ### | |
| Its name is an acronym for "Advanced Intrusion Detection Environment", | |
| it's an complicated software which can be used to check for bitrot. I | |
| would not recommend using it if you only need bitrot detection. | |
| Here is a few hints if you want to use it for checking your file | |
| integrity: | |
| **/etc/aide.conf** | |
| /home/my_data/ R | |
| # Rule definition | |
| All=m+s+i+md5 | |
| report_summarize_changes=yes | |
| (R for recursive). "All" line list the checks we do on each file. For | |
| bitrot checking, we want to check modification time, size, checksum | |
| and inode of the files. The `report_summarize_change` displays a | |
| list of changes if something is wrong. | |
| This is the most basic config file you can have. Then you will have to | |
| run **aide** to create the database and then run aide to create a new | |
| database and compare the two databases. It doesn't update its database | |
| itself, you will have to move the old database and tell it where to | |
| found the older database. | |
| # My use case # | |
| I have different kind of data. On a side, I have static data like | |
| pictures, clips, music or things that won't change over time and the | |
| other side I have my mails, documents and folders where the content | |
| changes regularly (creation, deletetion, modification). I am able to | |
| afford a backup for 100% of my data with some history of the backup on | |
| a few days, so I won't be interested about file repairing. | |
| I want to be warned quickly if a file get corrupted, so I can still | |
| get the backup in my history but I don't keep every versions of my | |
| files for too long. I choose to go with the python tool **bitrot**, | |
| it's very easy to use and it doesn't become a mess with my folders | |
| getting updated often. | |
| I would go with par2cmdline if I could not be able to backup all my | |
| data. Having 5% or 10% of redundancy of my files *should* be enough to | |
| restore it in case of corruption without taking too much space. |