End-users experienced 503 & 504 errors and poor performance on single sign-on (SSO) transactions.
There was no system resource (CPU, memory, disk, network) exhaustion on the application or database servers during the incident. There was no datacenter-specific infrastructure issue during the incident.
The main root cause was discovered to be a proxy instance that sits in front of the PingOne configuration database. The short term fix was to replace the proxy with an instances that has greater network bandwidth. These system updates were made on 9/15 at roughly 22:00MT, and increased overall proxy bandwidth by a factor of eight. The longer term fix is to remove the proxy from PingOne's database critical path entirely. Looking at historical data shows that proxy bandwidth usage had been trending upwards. It had come close to a network cap a few times, but 9/12 was the first instance of hitting the ceiling. A large majority of the bandwidth used is from the ADConnect-API. After investigation, we found that much of this bandwidth is from customers running very old versions of ADConnect.