Degraded performance

Incident Report for Napta

Postmortem

On November 29th, 2024, Napta experienced an incident affecting its users. We would like to share a detailed timeline of events, the root cause, and the actions taken to prevent similar issues in the future.

Timeline of Events

9:00AM - 9:20 AM CET

Application slowdown due to server allocation failures from the cloud provider (likely due to Black Friday demand). A client reported being unable to connect to the platform but after successfully acquiring servers, the issue seemed resolved. It was later determined that the client’s connection issue was not directly linked to the server allocation failure (see below).

1:26 PM CET

Our monitoring dashboards detected sporadic 504 Gateway Timeout errors affecting a small percentage of requests. Initial investigations were launched.

1:38 PM CET

To address the issue, an instance refresh was performed and completed by 1:51 PM CET. However, the errors persisted, and further investigations were conducted.

2:43 PM CET

All backend tasks began failing health checks and were restarted by our container orchestration system. This resulted in a brief period of downtime, during which the application was unavailable.

2:53 PM CET

New backend tasks were successfully deployed, and the application was restored. No additional 504 errors were observed. Monitoring and investigations continued.

3:55 PM CET

A few clients reported that Napta was stuck on an infinite loading screen. Upon investigation, we discovered that backend-to-database connections for these specific clients were stuck. Simultaneously, another asynchronous service handling a high number of tasks was creating excessive database connections, which were also stuck. Terminating the database connections of the asynchronous service resolved the backend issue temporarily. To mitigate the situation, we halted the asynchronous service.

4:36 PM CET

The root cause was identified and fixed. The issue stemmed from a regression in the asynchronous service, which caused database connections not to close properly and being randomly stuck. As the database connections were shared between the backend and asynchronous services via PgBouncer, backend was impacted by the issue on asynchronous service. The fix was deployed successfully, restoring full functionality.

Action Plan

To prevent similar incidents in the future, we have implemented the following improvements:

Enhanced Monitoring and Alerts:

Additional monitoring and alerting are being implemented for the asynchronous service to detect anomalies earlier.

PgBouncer Configuration Review:

We reviewed PgBouncer’s configuration to ensure that stuck connections can be cleared after a timeout.

‌Closing Remarks

We sincerely apologize for the inconvenience caused by this incident and thank our clients for their patience and understanding. Ensuring the stability and reliability of Napta is our top priority, and we are committed to learning from this incident to provide an even better experience moving forward.

If you have any further questions, please don’t hesitate to contact our support team.

The Napta Team

Posted Dec 20, 2024 - 16:57 UTC

Resolved

This incident has been resolved. Sorry for the inconvenience, we will communicate soon on the root cause.

Posted Nov 29, 2024 - 15:34 UTC

Update

We are still monitoring.

Posted Nov 29, 2024 - 14:23 UTC

Monitoring

The issue is mitigated and we're now monitoring it.

Posted Nov 29, 2024 - 13:53 UTC

Investigating

We are currently experiencing degraded performances on app.napta.io, our team is looking into this.

Posted Nov 29, 2024 - 13:43 UTC

This incident affected: Application.