[by_company_00017] Front App Outage

Incident Report for Front

Postmortem

At 19:05 UTC (11:05 am PST), Front began a routine application deployment. Front typically deploys new versions of the application multiple times per day as part of normal operation. This app release included a change in the way we connect to an internal caching system. The change passed all our testing requirements, but due to an environment-specific configuration on the caching system it immediately caused a problem in our customer environment.

‌

Front was alerted within minutes of the increased error rate and immediately started investigating. At 19:20 UTC we initiated a rollback of the deployment across all regions. Because of the progressive nature of the rollback, some customers saw the application recover quickly after while for others it took up to 30 minutes. By 19:50 all rollbacks were complete and most customers were back to normal.

‌

However, for one slice of Front customers the rollback triggered a second issue that kept the application from recovering. For clarity, Front divides customers into one of 40 “cells” that isolate them from one another. This architecture is designed to limit the scope of certain failures, like this one. But for the 2% of customers on the affected cell we had to take additional steps to recover.

‌

This second issue was triggered by a spike in database load caused by the stampede of traffic from customers returning to the application. In this cell a particular workload caused high database contention that ultimately timed out and then started over, which prevented us from escaping the problem. Traffic continued to back up and we were unable to progress through all the messages.

‌

Once we identified that the problematic workload was in the evaluation of rules, we were able to block all rules from processing. This allowed the database and the application to immediately recover, coming back online for users at 21:39 UTC. Rule evaluations continued to back up, and we were able to slowly restart processing to make sure the database didn’t reenter a bad state. All rule evaluations were caught up by 22:20 UTC.

‌

We would like to apologize for the disruption this outage has caused and for the unusually long duration. In the immediate aftermath of the incident we have identified a number of actions we can take to prevent this kind of event from occurring again.

‌

The first step is to ensure we have a test environment that appropriately represents the configuration of the production customer environment for the caching system. This issue should have been detected before we ever deployed it. The second task is to investigate the rollback system to see if we can safely improve the speed, so that if we have a similar issue in the future we can recover even faster.

Posted Nov 22, 2025 - 00:20 UTC

Resolved

This incident has been resolved.

Posted Nov 21, 2025 - 22:18 UTC

Update

Recovering app for [by_company_00017] customers, delay for rules and calendar remains.

Posted Nov 21, 2025 - 21:45 UTC

Update

Continuing investigation for [by_company_00017] customers only.

Posted Nov 21, 2025 - 21:15 UTC

Update

Investigating continued outages for [by_company_00017] customers only.

Posted Nov 21, 2025 - 20:40 UTC

Update

Us-West-2 customers are now recovered. We are continuing to monitor

Posted Nov 21, 2025 - 20:18 UTC

Update

Fully recovered for us-west-1 and eu-west-1 regions. Continued recovery for us-west-2 customers

Posted Nov 21, 2025 - 20:15 UTC

Update

Recovering messages for all regions but restoring back to operational state. Continuing to monitor

Posted Nov 21, 2025 - 19:57 UTC

Monitoring

A rollback has started recovering the failures in us-west-1. We are monitoring and further investigating root cause

Posted Nov 21, 2025 - 19:48 UTC

Investigating

We are currently investigating an outage.

Posted Nov 21, 2025 - 19:21 UTC

This incident affected: App, Rules and Workflows, and Calendar.