At approximately 2200 Pacific, ngrok engineers progressively rolled out a change to improve endpoint update speed and resiliency. Testing and monitoring across regions did not see issues then; however, customer reports of 3200 errors started coming in several hours later. After working with Customer Support, engineers identified a problem with the change that caused some endpoints with associated certificates to not be exposed at our edge. This caused the 3200s for impacted endpoints.
Only endpoints created by a new ngrok Edge or an agent connecting/reconnecting during the incident window were impacted, and only then if they had an associated certificate through "automatic TLS certificates" or manually uploaded TLS certificates.
At 0625 Pacific, the scope of the issue became apparent, and teams began remediation. By 0830 an appropriate fix was identified, and the team started rolling this out. The fix was released to regions in serial to assess any impact. At 0930, the highest traffic regions were complete, with other regions following up to 15 minutes later.
The team identified several tests that can catch this in the future. We are also working on monitoring that catches edge case errors, as seen in this incident, that our usual monitors don't detect due to the sheer volume of traffic handled by ngrok.