Some endpoints with certificates seeing 3200 errors

Incident Report for ngrok

Postmortem

At approximately 2200 Pacific, ngrok engineers progressively rolled out a change to improve endpoint update speed and resiliency. Testing and monitoring across regions did not see issues then; however, customer reports of 3200 errors started coming in several hours later. After working with Customer Support, engineers identified a problem with the change that caused some endpoints with associated certificates to not be exposed at our edge. This caused the 3200s for impacted endpoints.

Only endpoints created by a new ngrok Edge or an agent connecting/reconnecting during the incident window were impacted, and only then if they had an associated certificate through "automatic TLS certificates" or manually uploaded TLS certificates.

At 0625 Pacific, the scope of the issue became apparent, and teams began remediation. By 0830 an appropriate fix was identified, and the team started rolling this out. The fix was released to regions in serial to assess any impact. At 0930, the highest traffic regions were complete, with other regions following up to 15 minutes later.

The team identified several tests that can catch this in the future. We are also working on monitoring that catches edge case errors, as seen in this incident, that our usual monitors don't detect due to the sheer volume of traffic handled by ngrok.

Posted Aug 03, 2023 - 22:35 UTC

Resolved

This incident should be resolved for all customers in all regions. Please let our support team know if you see any further issues.

Posted Aug 02, 2023 - 17:02 UTC

Update

We are continuing to work on a fix for this issue.

Posted Aug 02, 2023 - 16:51 UTC

Update

We are continuing to work on a fix for this issue.

Posted Aug 02, 2023 - 16:47 UTC

Update

We are continuing to work on a fix for this issue.

Posted Aug 02, 2023 - 16:34 UTC

Update

We are continuing to work on a fix for this issue.

Posted Aug 02, 2023 - 16:30 UTC

Update

We are rolling back changes identified as causing this issue. Customers will see recovery in regions over the next 30 minutes. We are prioritizing the highest traffic regions first.

Posted Aug 02, 2023 - 16:22 UTC

Identified

We are investigating an incident in which customers with certificates attached to an endpoint, primarily custom domains, are seeing 3200 errors. The issue has been identified, and we are working on remediation.

Posted Aug 02, 2023 - 15:42 UTC

This incident affected: Endpoint Connectivity (Region - AP, Region - AU, Region - EU, Region - IN, Region - JP, Region - SA, Region - US).