Investigating Spark Operator issues
I’ve been setting up Spark to run on our Kubernetes cluster & hit an issue with the spark-operator to run spark jobs on Kubernetes. I need to mount a configmap into the spark container when it executes a job, this is supported in the SparkApplication yaml used to submit jobs, but when executing the job, the containers were not mounting the configMap.
Searching around, I found this issue which seemed to match the problems I was encountering & seemed to point to the operator requiring webhook to be configured to modify the pod spec dynamically. Checking my config, the webhook was installed & available, so that was a bit of a dead end… However reviewing the spark-operator logs I did see something odd:
http: TLS handshake error from x.x.x.x:yyyyy: remote error: tls: bad certificate
Searching for this, brought up this issue, which indicated restarting the operator would fix it. The supposed problem in this issue was fixed in 2020, but I’m using a later version of the operator & restarting repeatably didn’t seem a great option.
A temporary fix was suggested in another issue, adding a liveness check to kill the container when the certificates are out of sync, but that’s just patching round the problem, not idea for a production system.
livenessProbe:
initialDelaySeconds: 1
periodSeconds: 1
failureThreshold: 1
exec:
command:
- sh
- -c
- |
set -e
curl -iks -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
https://kubernetes.default.svc/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/-webhook-config \
| grep -o '"caBundle": "[^"]*"' \
| awk -F'"' '{print $4}' \
| base64 -d > /tmp/expected_ca_bundle.crt
expected_ca_bundle=$(cat /etc/webhook-certs/ca-cert.pem)
actual_ca_bundle=$(cat /tmp/expected_ca_bundle.crt)
if [ "$expected_ca_bundle" != "$actual_ca_bundle" ]; then
exit 1
fi
But, why was the webhook failing at all, why was it only for some people & why were only (some) helm chart users affected?
Excellent detective work by jgeores finally made the problem make sense to me - the webhooks certificates are created by an init container when the spark-operator is installed by the helm chart. When a helm upgrade occurs, the spark-operator init container can be restarted, recreating the certificates, but the operator pod itself won’t be restarted - leading to it holding old certificates. It then fails to access the webhook.
But I hadn’t upgraded the helm chart since installing it… but I had installed it using Fleet CI which would use helm upgrades to keep the operator up to date with is deployment config. Doing things in a reproducable, infrastructure as code way had been my downfall!
Removing spark-operator from Fleet isn’t great, we are using it to allow us to use CI/CD to keep all our Kubernetes applications deployed and up to date, but it’s the quick ‘fix’ for now. Now, do I go down the rabbit hole of fixing the helm chart…