On December 14th, 2023, Napta experienced degraded performance on its platform for approximately 27 minutes, impacting customers between 7:59 AM and 8:26 AM UTC. Below is a detailed summary of the incident, its root cause, and the actions we have taken to address it.
Between 7:59 AM and 8:26 AM UTC, Napta users experienced degraded performance, primarily due to slowdowns caused by high demand during the morning peak. The root cause was traced to a failure in our automated cluster configuration change jobs, which were blocked due to an issue on GitLab’s infrastructure. This issue prevented critical adjustments to our production cluster, which are normally executed automatically.
The incident was detected at 8:07 AM UTC by our monitoring system, which raised an alarm for exceeding critical response time thresholds. Manual intervention was required to resolve the issue, and service was fully restored by 8:26 AM UTC.
December 13th, 2023
December 14th, 2023
Add Job Redundancy
We have implemented additional redundancy in our GitLab jobs to ensure that critical cluster configuration updates can be executed through alternative paths if primary jobs fail.
Enhance Alerting
New alarms were introduced to detect and notify the team earlier about failed configuration jobs, enabling faster response times and minimizing potential impact.
Implement Autoscaling Improvements
We will optimize our autoscaling configurations to better handle traffic surges during high-demand periods, reducing the risk of degraded performance even in cases of delayed configuration updates.
We sincerely apologize for the inconvenience this incident caused to our customers. At Napta, ensuring a seamless experience for our users is our top priority, and we are committed to learning from this incident to improve the reliability of our platform.
If you have any further questions or concerns, please do not hesitate to contact our support team.
The Napta Team