We  had  a minor crash in our data center. It's called "minor" because
 it only affected about a dozen systems and the total number of systems
 is over 700.

 Still,  we  had  data  loss  and possibly some silent corruption. That
 silent corruption is a really big issue.

 Since we use ZFS for almost all  application  data,  we  could  simply
 scrub them. No checksum errors? Perfect, the data is fine then.

 But  what  about  operating  systems?  We run full-blown Linux virtual
 machines, so every customer gets his own /, /usr, /lib and so on. What
 about  all  that  data? Did that get corrupted as well? Very sadly, we
 still use ext4 here.

 On some systems we got lucky: Bad superblocks. This  results  in  more
 work  for me (because I have to re-build those systems -- which is not
 *that* much work, though, since we use config management for virtually
 everything),  but  I  can  be  sure  that  these  systems indeed *are*
 affected.

 Other systems just crashed and successfully rebooted. Now what?

 I'm pretty much fed up with this  situation.  All  filesystems  should
 have checksums in 2017. Fuck performance. Performance is worth nothing
 if you operate on faulty data.

 I'm currently in the process of writing a tool  that  checksums  files
 and  stores  the  checksum  in extended attributes. This is *far* from
 satisfactory. Still,  in  scenarios  like  the  one  above,  we  could
 manually "scrub" our data to further assess the situation.