When a “simple” Redis disk resize in Kubernetes turns into a VPN incident: a post-mortem
TL;DR: the short version
On the 2nd of December, 2025 we got an alert that a critical system component (Redis) was running out of disk space. Fixing the capacity issue itself was straightforward. The real trouble started when we tried to “clean things up” afterwards and make the configuration match reality.
Because this component runs in Kubernetes and is set up in a way that doesn’t allow straightforward resource adjustment, we attempted a safe-looking workaround: recreate the controller object without deleting the actual running pods. Kubernetes interpreted that change differently than we expected and restarted the whole Redis cluster. The restart process wasn’t “cautious” enough, it moved too fast and caused the cluster to come back partially degraded. While we were repairing the degraded nodes, a backend service got overwhelmed by reconnect traffic and started failing with out-of-memory errors. After scaling and tuning, the service recovered. Total downtime was ~50 minutes.
What happened
We received an alert that several instances of the Redis cluster were running out of free disk space. But increasing the disk space in our case is associated with certain problems. The thing is, in Kubernetes things such as Redis are usually created via a StatefulSet. It’s specifically designed for workloads that need stable identities and storage.
However, some parts of the StatefulSet are locked, and it can be a problem. StatefulSets define storage via volumeClaimTemplates. Important details:
- the storage template is effectively locked once created
- you can’t simply edit it in place to increase disk size
So even if you know exactly what you want, Kubernetes won’t let you change that specific field in the StatefulSet spec.
The immediate fix: increase capacity without restarts
Our storage backend supports online volume expansion. That means we can increase disk capacity underneath a running pod without restarting it.
So we resized the PersistentVolumeClaims (PVCs) directly. This resolved the disk pressure quickly and safely.
But it created a new problem:
- the declarative configuration (Git + StatefulSet template) was still describing the old disk size
- the actual PVC/PV state in the cluster now reflected the new disk size
This mismatch is a classic “configuration drift.”
The “cleanup” attempt: make the code match reality
Because the StatefulSet storage template is immutable, the only straightforward way to align the declared state with the real state is to:
- Delete the StatefulSet object
- Recreate it with the updated storage template
Of course, deleting the StatefulSet normally results in deleting child resources. So we used a Kubernetes trick:
Orphan deletion: “fire the manager, keep the factory running”
We deleted the StatefulSet using orphan / non-cascading deletion. The idea was:
- Kubernetes forgets the “manager” object…
- …but the pods (and their disks) stay alive
Then we recreated the StatefulSet so Git and the cluster would be consistent again.
This looked safe — but Kubernetes had a different interpretation.
The unexpected part: Kubernetes restarted the entire Redis cluster
After recreating the StatefulSet, Kubernetes reconciled the workload based on the new spec. Even though the actual disks were already resized, Kubernetes “didn’t know” that all changes had already been applied, so it started the reconciliation process, which resulted in a full Redis cluster restart.
A distributed database/cache cluster can survive a restart sequence, but only if the restart is performed cautiously.
That brings us to the biggest amplifier in this incident.
Why it escalated: restarts were too optimistic
Our restart behavior was not defensive enough:
- no startup and readiness probes
- a pod being just “running” was treated as “good enough” (but in reality it still couldn’t handle the actual requests load)
- Kubernetes proceeded to the next pod too quickly
In plain language: we had a restart process that didn’t really wait for Redis to be healthy before moving on.
At first, this still didn’t look catastrophic: the backend service complained about Redis connections, but overall service graphs didn’t immediately collapse.
Degraded Redis cluster: replicas got “stuck” to old masters
When the dust settled, the Redis cluster came up partially degraded:
- some replica nodes couldn’t reattach properly
- they retained stale information about which master they belonged to
At that point, recovery required manual Redis Cluster operations:
- forgetting obsolete node identities on masters
- resetting affected replicas
- re-joining them to the cluster and waiting for the replication catch-up
This procedure took longer than expected. And while it was still in progress, we hit the second-stage failure.
Secondary impact: backend service overload and out-of-memory failures
While Redis nodes were rejoining and syncing, the backend service faced reconnect storms and elevated load. This turned into cascading out-of-memory failures.
Two things made this worse:
- we initially misattributed which component was actually OOMing
- scaling the backend service was delayed due to limited spare capacity under current resource constraints
Once the failed component was correctly identified, we stabilized the system by:
- scaling the backend service
- increasing memory limits for the affected component
- …and temporarily relaxing probes to let instances enter the service immediately to smooth the recovery — otherwise it had a cascade effect: once the pod check says that a pod is “ready,” it quickly becomes overloaded with traffic and goes back to “not ready” again. This increases the load on the next pod, the situation repeats, and so on.
After that, the service gradually recovered and the clients’ traffic returned back to normal. Total downtime was appr. 50 minutes: 8:10 UTC - 09:00 UTC
Events timeline
7:00 UTC: disk size was increased in Git
7:05 UTC: the disk size was manually updated for all PVCs in the cluster
7:10 UTC: all pods have finished online disk resizing
8:10 UTC: sts orphan deletion and recreation were performed
8:10-8:15 UTC: Kubernetes quickly performed a full StatefulSet restart. At this moment problems started to appear
8:15-8:35 UTC: manual cluster restoration — replicas recreation & reallocation
8:35-8:50 UTC: dealing with application OOM-errors. Once these problems were mitigated, the service was restored with some limitations (i.e. slow responses)
10:00 UTC: temporary workarounds used for the emergency relaunch of the service were reverted. The service is fully restored and operational now
Lessons learnt
1) It wasn’t disk resize that caused the incident
The capacity fix was straightforward. The incident started with the state alignment step and the controller recreation.
2) Immutable fields push you toward risky workflows
When the system won’t let you edit a field, the workaround often involves replacement — and that replacement can trigger wide reconciliation behavior.
3) Startup probes and strict readiness are not just “nice to have”
They’re the brakes on a rolling restart. Without them, distributed systems restart in the most fragile way possible. This is the worst part of this incident — having probes would have not allowed Kubernetes to restart the next pod until the current one is really ready.
4) Recovery plans must assume slow cluster healing
Rejoins and replication catch-ups take time. Dependent services need to be designed to degrade gracefully during that window.
What are we going to do?
1) Implement missing startup probes
Simple startup probes like “do not mark the pod as ‘ready’ until the dataset is loaded” will prevent this situation from happening again.
2) Make our own “control plane” for Redis
This is a more complex thing, but it is very helpful in a situation when a restarted pod is “stuck” to an old master IP address.
3) Decrease dependency on this Redis cluster
As of now, it looks like Redis-cluster is the SPOF (single point of failure) for the whole service, so we are planning to migrate some important information to other data sources — like a separate Redis cluster for each server. We also plan to start using Kafka for some of the backend’s components.