AdGuard VPN May 8th incident report: Post Mortem
In recent days, a significant portion of AdGuard VPN users have encountered problems with their apps. The problems affected about 25% of AdGuard VPN for Windows users and about 10% of AdGuard VPN users of other platforms. By the time of this publication, all issues had been resolved. Make sure your AdGuard VPN for Windows app is updated to the latest version. If you are a user of a different AdGuard VPN app or an AdGuard VPN browser extension, you don't need to do anything, it should be working normally by now.
We apologize for the inconvenience caused and hope that this incident will not affect your overall satisfaction with AdGuard VPN. This was a one-time occurrence that will not happen again.
Below is the timeline of events that led to the incident, and then we will list the measures we are taking to avoid anything like this in the future.
Events timeline
May 8
It all started with the release of AdGuard VPN v2.3.0 for Windows on May 8th. The mandatory testing before the release went smoothly, but we overlooked one hard-to-notice problem: under certain circumstances the app began endlessly trying to reach the backend server to check the VPN servers' connectivity.
May 9
As a result of this mistake, the load on the backend server had significantly risen. However, despite seeing that something was off, we couldn't reproduce the problem and drew the conclusion that the high load was the consequence of some of the changes we made to the app's operating principles.
May 10
We started receiving first complaints about constant connection drops and "Can't connect to server" errors. According to the user reports, restarting the app helped in the short term. Also, at this time a task was created on GitHub describing this problem.
May 13
The number of user complaints had reached a level high enough to indicate the critical nature of the situation. We still couldn't reproduce the problem on our end, which greatly complicated and delayed the process of finding the root of it. We had several hypotheses, but to test them we needed the help of affected users — huge thanks to everyone who sacrificed their time to help us. But despite all the efforts we weren't able to find a solution.
May 14
According to statistical data and our monitoring system, by 09:00 UTC the amount of requests to the backend server per second had drastically increased compared to the numbers registered on 9th-10th of May.
Between 09:00 UTC and 16:00 UTC we took measures aimed at minimizing the spreading of the "bad" update among the users:
- Blocked the option to update to v2.3.0 from the app
- Removed the option to download the AdGuard VPN for Windows app from the official website
- Everyone who had contacted the tech support was recommended to roll back to the previous version until the problem is fixed
By 15:00 UTC the amount of requests to the backend reached almost 1000 times the norm. To mitigate the negative impact on the AdGuard's infrastructure, at around 13:30 UTC we decided to greatly reduce the server rate limit. Unfortunately, the side effect of this measure was the increase in the number of problems some users encountered when accessing the authenticaion server. This affected not only Windows users, but also users of AdGuard VPN for other platforms.
By 18:00 UTC we managed to identify the cause of the problem, and by 23:30 UTC we were ready with a v2.3.1 hotfix.
By 22:00 UTC the rate limit was returned to the regular value for all apps besides AdGuard VPN for Windows, which helped to fix the problem for the majority of affected users.
May 15
At around 00:45 UTC the v2.3.1 update was released and became available to users.
Conclusions and prevention measures
We are currently in the process of making necessary changes to our quality assurance procedures for product updates, with an emphasis on the apps' network activity monitoring (and their interaction with the authentication server in particular). We are going to optimize the entire internal infrastructure monitoring system as well.
Apart from that, moving forward we will be more quick and transparent in informing our users about any critical issues that could lead to major negative consequences.
We would like to once again apologize to all our users who were affected by this incident and express our gratitude to everyone who provided feedback and helped us resolve this difficult situation.