Overview of Symptoms
End-users experienced 503 & 504 errors and poor performance on single sign-on (SSO) transactions.
Conditions of Event
There was no system resource (CPU, memory, disk, network) exhaustion on the application or database servers during the incident. There was no datacenter-specific infrastructure issue during the incident.
Incident Timeline
- 6:00MT Scheduler-service started batch job
- 6:13 Token-processor to config-service queries started to fail
- 6:57 First user-facing errors began
- 7:02 SSO performance significantly declined
- 7:04 High-sev (paging) alert: critical path failure
- 7:04 SRE began investigation
- 7:25 SRE restarted token-processor instances
- 7:45 SRE deployed additional token-processor and config-service instances
- 8:59 SRE restarted config-service instances
- 9:03 SRE increased timeout on load balancer health checks to reduce flapping
- 9:34 Engineers identified scheduler-service activity as a potential cause
- 9:34 SRE shut down scheduler-service
- 10:07 Critical path failure resolved
- 10:10 SSO performance back to normal
- 10:13 Errors stopped and incident resolved
Root Cause
The main root cause was discovered to be a proxy instance that sits in front of the PingOne configuration database. The short term fix was to replace the proxy with an instances that has greater network bandwidth. These system updates were made on 9/15 at roughly 22:00MT, and increased overall proxy bandwidth by a factor of eight. The longer term fix is to remove the proxy from PingOne's database critical path entirely. Looking at historical data shows that proxy bandwidth usage had been trending upwards. It had come close to a network cap a few times, but 9/12 was the first instance of hitting the ceiling. A large majority of the bandwidth used is from the ADConnect-API. After investigation, we found that much of this bandwidth is from customers running very old versions of ADConnect.
Action items
- SRE-6202 Replace proxy systems in front of the config database with a larger instance (completed)
- SRE-6217 Add alarms on bandwidth utilization of proxy systems, correlated with AWS instances sizing
- SRE-6128 Engineer the connections to RDS to remove the need for a proxy
- SRE-6245 Investigate reported status site customized notifications
- SSD-3280 Improve health metrics, alerting, and logging within our caching services
- SSD-3259 Reduce SSO dependency on caching services
- SSD-3293 Assess adding a cache for ADConnect API
- Plan communication to customers about upgrading off of old ADConnect versions. Current supported versions are 3.0+ only.