Today was crazy. Discovered we have a really interesting bug with one of our
k8s clusters. Turns out the kernel version we are running has an old and not
so great Ceph module. This module causes the block device to hang when
(I think) terminating a pod using an RBD for storage. The consequences of this
RBD getting stuck are many. For starters, the docker API begin to slow down and
causes a significant increase in docker api timeout errors. The second observed
problem is that some pods won't start back up if others in the deployment
haven't yet finished terminating, causing deployments to fail and finally
causing our monitoring system to lose it's mind because it's trying to scan
disks and getting stuck trying to read block devices that are hung.
Today, I feel a combination of amused and exhausted.