Rook Ceph Crashes
We were getting a crashing error on Rook-Ceph’s operator, that caused it to go into a crashloop. This prevented new disks to be provisioned in Ceph, a major blocker as we’re expanding out cluster. No configuration had been changed in the cluster recently, which caused some confusion!
The error was a go panic:
panic: assignment to entry in nil map
github.com/rook/rook/pkg/operator/ceph/disruption/clusterdisruption.resetPDBConfig(...)
/home/runner/work/rook/rook/pkg/operator/ceph/disruption/clusterdisruption/osd.go:689
Digging into the code brought up this function:
From: rook/pkg/operator/ceph/disruption/clusterdisruption/osd.go
func resetPDBConfig(pdbStateMap *corev1.ConfigMap) {
pdbStateMap.Data[drainingFailureDomainKey] = ""
delete(pdbStateMap.Data, drainingFailureDomainDurationKey)
delete(pdbStateMap.Data, pgHealthCheckDurationKey)
}
This seems to be reading a configmap & re-setting values. Looking at the configmaps deployed on our cluster brought up a match - rook-ceph-pdbstatemap, which was strangely empty. This would explain the ‘assignment to entry in nil map’, the rook operator was expecting a key ‘drainingFailureDomainKey’ which didn’t exist.
I span up a 1 node instance of ceph on minikube to compare the contents of the rook-ceph-pdbstatemap:
draining-failure-domain: ''
set-no-out: ''
This seemed promising, I added these to the operational ceph cluster & restarted the rook operator - no crashes & the operator was back.
I’m unsure why this config map was empty, possible a bug in the operator caused it to be overwrote or etcd lost the data somehow? One to monitor in the future.