On November 29th, 2024, Napta experienced an incident affecting its users. We would like to share a detailed timeline of events, the root cause, and the actions taken to prevent similar issues in the future.
9:00AM - 9:20 AM CET
Application slowdown due to server allocation failures from the cloud provider (likely due to Black Friday demand). A client reported being unable to connect to the platform but after successfully acquiring servers, the issue seemed resolved. It was later determined that the client’s connection issue was not directly linked to the server allocation failure (see below).
1:26 PM CET
Our monitoring dashboards detected sporadic 504 Gateway Timeout errors affecting a small percentage of requests. Initial investigations were launched.
1:38 PM CET
To address the issue, an instance refresh was performed and completed by 1:51 PM CET. However, the errors persisted, and further investigations were conducted.
2:43 PM CET
All backend tasks began failing health checks and were restarted by our container orchestration system. This resulted in a brief period of downtime, during which the application was unavailable.
2:53 PM CET
New backend tasks were successfully deployed, and the application was restored. No additional 504 errors were observed. Monitoring and investigations continued.
3:55 PM CET
A few clients reported that Napta was stuck on an infinite loading screen. Upon investigation, we discovered that backend-to-database connections for these specific clients were stuck. Simultaneously, another asynchronous service handling a high number of tasks was creating excessive database connections, which were also stuck. Terminating the database connections of the asynchronous service resolved the backend issue temporarily. To mitigate the situation, we halted the asynchronous service.
4:36 PM CET
The root cause was identified and fixed. The issue stemmed from a regression in the asynchronous service, which caused database connections not to close properly and being randomly stuck. As the database connections were shared between the backend and asynchronous services via PgBouncer, backend was impacted by the issue on asynchronous service. The fix was deployed successfully, restoring full functionality.
To prevent similar incidents in the future, we have implemented the following improvements:
Enhanced Monitoring and Alerts:
Additional monitoring and alerting are being implemented for the asynchronous service to detect anomalies earlier.
PgBouncer Configuration Review:
We reviewed PgBouncer’s configuration to ensure that stuck connections can be cleared after a timeout.
We sincerely apologize for the inconvenience caused by this incident and thank our clients for their patience and understanding. Ensuring the stability and reliability of Napta is our top priority, and we are committed to learning from this incident to provide an even better experience moving forward.
If you have any further questions, please don’t hesitate to contact our support team.
The Napta Team