Degraded performance

Incident Report for Napta

Postmortem

On December 14th, 2023, Napta experienced degraded performance on its platform for approximately 27 minutes, impacting customers between 7:59 AM and 8:26 AM UTC. Below is a detailed summary of the incident, its root cause, and the actions we have taken to address it.

Incident Summary

Between 7:59 AM and 8:26 AM UTC, Napta users experienced degraded performance, primarily due to slowdowns caused by high demand during the morning peak. The root cause was traced to a failure in our automated cluster configuration change jobs, which were blocked due to an issue on GitLab’s infrastructure. This issue prevented critical adjustments to our production cluster, which are normally executed automatically.

The incident was detected at 8:07 AM UTC by our monitoring system, which raised an alarm for exceeding critical response time thresholds. Manual intervention was required to resolve the issue, and service was fully restored by 8:26 AM UTC.

‌Timeline

December 13th, 2023

9:52 PM UTC: GitLab identified an issue on their platform that prevented GitLab jobs from running correctly.

December 14th, 2023

6:02 AM UTC: Our automated cluster configuration change job failed due to the ongoing GitLab issue.
6:15 AM UTC: A secondary backup job, designed to execute if the first job fails, also encountered the same issue due to GitLab’s infrastructure problem.
7:59 AM UTC: Platform slowdowns began to occur as a result of increased traffic during the morning peak, combined with the absence of the necessary cluster configuration updates.
8:07 AM UTC: Our monitoring system raised an alert for exceeded critical response times. Investigations commenced immediately.
8:24 AM UTC: Manual intervention restored the correct cluster configuration.
8:26 AM UTC: All platform performance issues were resolved.

‌

Actions to Prevent Future Incidents

Add Job Redundancy

We have implemented additional redundancy in our GitLab jobs to ensure that critical cluster configuration updates can be executed through alternative paths if primary jobs fail.

Enhance Alerting

New alarms were introduced to detect and notify the team earlier about failed configuration jobs, enabling faster response times and minimizing potential impact.

Implement Autoscaling Improvements

We will optimize our autoscaling configurations to better handle traffic surges during high-demand periods, reducing the risk of degraded performance even in cases of delayed configuration updates.

‌

Closing Remarks

We sincerely apologize for the inconvenience this incident caused to our customers. At Napta, ensuring a seamless experience for our users is our top priority, and we are committed to learning from this incident to improve the reliability of our platform.

If you have any further questions or concerns, please do not hesitate to contact our support team.

‌

The Napta Team

Posted Dec 19, 2023 - 14:27 UTC

Resolved

Our monitoring shows that users are no longer experiencing issues.
We will mark this incident as resolved, thank you for your patience.

Posted Dec 14, 2023 - 08:34 UTC

Monitoring

The issue is mitigated and we're now monitoring it.

Posted Dec 14, 2023 - 08:33 UTC

Identified

We've identified the cause of the problem and are working on it

Posted Dec 14, 2023 - 08:29 UTC

Investigating

We are currently experiencing degraded performances on app.napta.io, our team is looking into this.

Posted Dec 14, 2023 - 08:07 UTC

This incident affected: Application.