AdGuard VPN BlogWhen a “simple” Redis disk resize in Kubernetes turns into a VPN incident: a post-mortem

When a “simple” Redis disk resize in Kubernetes turns into a VPN incident: a post-mortem

December 10, 2025 7 min read

TL;DR: the short version

On the 2nd of December, 2025 we got an alert that a critical system component (Redis) was running out of disk space. Fixing the capacity issue itself was straightforward. The real trouble started when we tried to “clean things up” afterwards and make the configuration match reality.

Because this component runs in Kubernetes and is set up in a way that doesn’t allow straightforward resource adjustment, we attempted a safe-looking workaround: recreate the controller object without deleting the actual running pods. Kubernetes interpreted that change differently than we expected and restarted the whole Redis cluster. The restart process wasn’t “cautious” enough, it moved too fast and caused the cluster to come back partially degraded. While we were repairing the degraded nodes, a backend service got overwhelmed by reconnect traffic and started failing with out-of-memory errors. After scaling and tuning, the service recovered. Total downtime was ~50 minutes.

What happened

We received an alert that several instances of the Redis cluster were running out of free disk space. But increasing the disk space in our case is associated with certain problems. The thing is, in Kubernetes things such as Redis are usually created via a StatefulSet. It’s specifically designed for workloads that need stable identities and storage.

However, some parts of the StatefulSet are locked, and it can be a problem. StatefulSets define storage via volumeClaimTemplates. Important details:

the storage template is effectively locked once created
you can’t simply edit it in place to increase disk size

So even if you know exactly what you want, Kubernetes won’t let you change that specific field in the StatefulSet spec.

The immediate fix: increase capacity without restarts

Our storage backend supports online volume expansion. That means we can increase disk capacity underneath a running pod without restarting it.

So we resized the PersistentVolumeClaims (PVCs) directly. This resolved the disk pressure quickly and safely.

But it created a new problem:

the declarative configuration (Git + StatefulSet template) was still describing the old disk size
the actual PVC/PV state in the cluster now reflected the new disk size

This mismatch is a classic “configuration drift.”

The “cleanup” attempt: make the code match reality

Because the StatefulSet storage template is immutable, the only straightforward way to align the declared state with the real state is to:

Delete the StatefulSet object
Recreate it with the updated storage template

Of course, deleting the StatefulSet normally results in deleting child resources. So we used a Kubernetes trick:

Orphan deletion: “fire the manager, keep the factory running”

We deleted the StatefulSet using orphan / non-cascading deletion. The idea was:

Kubernetes forgets the “manager” object…
…but the pods (and their disks) stay alive

Then we recreated the StatefulSet so Git and the cluster would be consistent again.

This looked safe — but Kubernetes had a different interpretation.

The unexpected part: Kubernetes restarted the entire Redis cluster

After recreating the StatefulSet, Kubernetes reconciled the workload based on the new spec. Even though the actual disks were already resized, Kubernetes “didn’t know” that all changes had already been applied, so it started the reconciliation process, which resulted in a full Redis cluster restart.

A distributed database/cache cluster can survive a restart sequence, but only if the restart is performed cautiously.

That brings us to the biggest amplifier in this incident.

Why it escalated: restarts were too optimistic

Our restart behavior was not defensive enough:

no startup and readiness probes
a pod being just “running” was treated as “good enough” (but in reality it still couldn’t handle the actual requests load)
Kubernetes proceeded to the next pod too quickly

In plain language: we had a restart process that didn’t really wait for Redis to be healthy before moving on.

At first, this still didn’t look catastrophic: the backend service complained about Redis connections, but overall service graphs didn’t immediately collapse.

Degraded Redis cluster: replicas got “stuck” to old masters

When the dust settled, the Redis cluster came up partially degraded:

some replica nodes couldn’t reattach properly
they retained stale information about which master they belonged to

At that point, recovery required manual Redis Cluster operations:

forgetting obsolete node identities on masters
resetting affected replicas
re-joining them to the cluster and waiting for the replication catch-up

This procedure took longer than expected. And while it was still in progress, we hit the second-stage failure.

Secondary impact: backend service overload and out-of-memory failures

While Redis nodes were rejoining and syncing, the backend service faced reconnect storms and elevated load. This turned into cascading out-of-memory failures.

Two things made this worse:

we initially misattributed which component was actually OOMing
scaling the backend service was delayed due to limited spare capacity under current resource constraints

Once the failed component was correctly identified, we stabilized the system by:

scaling the backend service
increasing memory limits for the affected component
…and temporarily relaxing probes to let instances enter the service immediately to smooth the recovery — otherwise it had a cascade effect: once the pod check says that a pod is “ready,” it quickly becomes overloaded with traffic and goes back to “not ready” again. This increases the load on the next pod, the situation repeats, and so on.

After that, the service gradually recovered and the clients’ traffic returned back to normal. Total downtime was appr. 50 minutes: 8:10 UTC - 09:00 UTC

Events timeline

7:00 UTC: disk size was increased in Git
7:05 UTC: the disk size was manually updated for all PVCs in the cluster
7:10 UTC: all pods have finished online disk resizing
8:10 UTC: sts orphan deletion and recreation were performed
8:10-8:15 UTC: Kubernetes quickly performed a full StatefulSet restart. At this moment problems started to appear
8:15-8:35 UTC: manual cluster restoration — replicas recreation & reallocation
8:35-8:50 UTC: dealing with application OOM-errors. Once these problems were mitigated, the service was restored with some limitations (i.e. slow responses)
10:00 UTC: temporary workarounds used for the emergency relaunch of the service were reverted. The service is fully restored and operational now

Lessons learnt

1) It wasn’t disk resize that caused the incident
The capacity fix was straightforward. The incident started with the state alignment step and the controller recreation.

2) Immutable fields push you toward risky workflows
When the system won’t let you edit a field, the workaround often involves replacement — and that replacement can trigger wide reconciliation behavior.

3) Startup probes and strict readiness are not just “nice to have”
They’re the brakes on a rolling restart. Without them, distributed systems restart in the most fragile way possible. This is the worst part of this incident — having probes would have not allowed Kubernetes to restart the next pod until the current one is really ready.

4) Recovery plans must assume slow cluster healing
Rejoins and replication catch-ups take time. Dependent services need to be designed to degrade gracefully during that window.

What are we going to do?

1) Implement missing startup probes
Simple startup probes like “do not mark the pod as ‘ready’ until the dataset is loaded” will prevent this situation from happening again.

2) Make our own “control plane” for Redis
This is a more complex thing, but it is very helpful in a situation when a restarted pod is “stuck” to an old master IP address.

3) Decrease dependency on this Redis cluster
As of now, it looks like Redis-cluster is the SPOF (single point of failure) for the whole service, so we are planning to migrate some important information to other data sources — like a separate Redis cluster for each server. We also plan to start using Kafka for some of the backend’s components.

December 10, 2025 7 min read Post Mortem

Vasily Bagirov

Recommended articles

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Windows

Use any browser or app and never worry about your anonymity again. The entire world is at your fingertips with AdGuard VPN.

Download

By downloading the program you accept the terms of the License agreement

Learn more

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Mac

In just two clicks, select a city from anywhere in the world — we have 55+ locations — and your data is invisible to prying eyes.

Download

By downloading the program you accept the terms of the License agreement

Learn more

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Android

Remain anonymous wherever you go with AdGuard VPN! Dozens of locations, fast and reliable connection — all in your pocket.

Download

By downloading the program you accept the terms of the License agreement

Google Play

By downloading the program you accept the terms of the License agreement

Learn more

9,332 9332 user reviews

Excellent!

AdGuard VPNfor iOS

Boost your online protection by taking it with you wherever you go. Use AdGuard VPN to enjoy your favorite movies and shows!

App Store

By downloading the program you accept the terms of the License agreement

Learn more

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Android TV

Discover AdGuard VPN for Android TV! Enjoy seamless streaming, enhanced security, and easy setup.

Download

By downloading the program you accept the terms of the License agreement

Google Play

By downloading the program you accept the terms of the License agreement

Learn more

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Chrome

AdGuard VPNfor Edge

AdGuard VPNfor Firefox

AdGuard VPNfor Opera

Hide your true location and emerge from another place in the world — access any content without speed limits and preserve your web anonymity.

Get to a different location in one click, hide your IP, and make your web surfing safe and anonymous.

Protect your privacy, hide your real location, and decide to where you need the VPN and where you don't!

Be a ninja in your Opera browser: move quickly to any part of the world and remain unnoticed.

Install

By downloading the program you accept the terms of the License agreement

Learn more

Install

By downloading the program you accept the terms of the License agreement

Learn more

Install

By downloading the program you accept the terms of the License agreement

Learn more

Install

By downloading the program you accept the terms of the License agreement

Learn more

9,332 9332 user reviews

Excellent!

AdGuard VPNfor routers

Install AdGuard VPN on your router to secure your entire network. Decide which devices to protect and when

This option is only available with an AdGuard VPN subscription

Buy AdGuard VPN

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Linux

Get the best free VPN for Linux and enjoy seamless web browsing, enhanced security, Internet traffic encryption, and DNS leak protection. Choose from multiple VPN servers and access the locations you want

Learn more

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Apple TV

Discover AdGuard VPN for Apple TV! Enjoy seamless streaming, enhanced security, and easy setup

This option is only available with an AdGuard VPN subscription

Buy AdGuard VPN

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Xbox

Protect your Xbox with AdGuard VPN and enjoy seamless online gaming, enhanced security, and easy setup

This option is only available with an AdGuard VPN subscription

Buy AdGuard VPN

9,332 9332 user reviews

Excellent!

AdGuard VPNfor PS4/PS5

Protect your PlayStation with AdGuard VPN and enjoy seamless online gaming, enhanced security, and easy setup. Choose from multiple VPN servers and access the locations you want

This option is only available with an AdGuard VPN subscription

Buy AdGuard VPN

9,332 9332 user reviews

Excellent!

AdGuard VPNfor Chromecast

Install AdGuard VPN on your Google TV (Chromecast Gen 4) or on your network router (Chromecast Gen 3) and enjoy streaming content with Chromecast while staying anonymous online and accessing content from anywhere. For Chromecast Gen 3, you need an AdGuard VPN subscription

This option is only available with an AdGuard VPN subscription

Buy AdGuard VPN

When a “simple” Redis disk resize in Kubernetes turns into a VPN incident: a post-mortem

TL;DR: the short version

What happened

The immediate fix: increase capacity without restarts

The “cleanup” attempt: make the code match reality

Orphan deletion: “fire the manager, keep the factory running”

The unexpected part: Kubernetes restarted the entire Redis cluster

Why it escalated: restarts were too optimistic

Degraded Redis cluster: replicas got “stuck” to old masters

Secondary impact: backend service overload and out-of-memory failures

Events timeline

Lessons learnt

What are we going to do?

Recommended articles

Discord delays global age checks rollout after fierce pushback: full breakdown

AdGuard VPN Browser Extension on Meta Quest Browser: Your trusty VPN in VR

Banning VPNs for minors means age checks for all — and a potentially devastating blow to privacy