We had a minor crash in our data center. It's called "minor" because
it only affected about a dozen systems and the total number of systems
is over 700.
Still, we had data loss and possibly some silent corruption. That
silent corruption is a really big issue.
Since we use ZFS for almost all application data, we could simply
scrub them. No checksum errors? Perfect, the data is fine then.
But what about operating systems? We run full-blown Linux virtual
machines, so every customer gets his own /, /usr, /lib and so on. What
about all that data? Did that get corrupted as well? Very sadly, we
still use ext4 here.
On some systems we got lucky: Bad superblocks. This results in more
work for me (because I have to re-build those systems -- which is not
*that* much work, though, since we use config management for virtually
everything), but I can be sure that these systems indeed *are*
affected.
Other systems just crashed and successfully rebooted. Now what?
I'm pretty much fed up with this situation. All filesystems should
have checksums in 2017. Fuck performance. Performance is worth nothing
if you operate on faulty data.
I'm currently in the process of writing a tool that checksums files
and stores the checksum in extended attributes. This is *far* from
satisfactory. Still, in scenarios like the one above, we could
manually "scrub" our data to further assess the situation.