Service Interruption - SSO transactions
Incident Report for Ping Identity

Overview of Symptoms

End-users experienced 503 & 504 errors and poor performance on single sign-on (SSO) transactions.

Conditions of Event

There was no system resource (CPU, memory, disk, network) exhaustion on the application or database servers during the incident. There was no datacenter-specific infrastructure issue during the incident.

Incident Timeline

  • 6:00MT Scheduler-service started batch job
  • 6:13 Token-processor to config-service queries started to fail
  • 6:57 First user-facing errors began
  • 7:02 SSO performance significantly declined
  • 7:04 High-sev (paging) alert: critical path failure
  • 7:04 SRE began investigation
  • 7:25 SRE restarted token-processor instances
  • 7:45 SRE deployed additional token-processor and config-service instances
  • 8:59 SRE restarted config-service instances
  • 9:03 SRE increased timeout on load balancer health checks to reduce flapping
  • 9:34 Engineers identified scheduler-service activity as a potential cause
  • 9:34 SRE shut down scheduler-service
  • 10:07 Critical path failure resolved
  • 10:10 SSO performance back to normal
  • 10:13 Errors stopped and incident resolved

Root Cause

The main root cause was discovered to be a proxy instance that sits in front of the PingOne configuration database. The short term fix was to replace the proxy with an instances that has greater network bandwidth. These system updates were made on 9/15 at roughly 22:00MT, and increased overall proxy bandwidth by a factor of eight. The longer term fix is to remove the proxy from PingOne's database critical path entirely. Looking at historical data shows that proxy bandwidth usage had been trending upwards. It had come close to a network cap a few times, but 9/12 was the first instance of hitting the ceiling. A large majority of the bandwidth used is from the ADConnect-API. After investigation, we found that much of this bandwidth is from customers running very old versions of ADConnect.

Action items

  • SRE-6202 Replace proxy systems in front of the config database with a larger instance (completed)
  • SRE-6217 Add alarms on bandwidth utilization of proxy systems, correlated with AWS instances sizing
  • SRE-6128 Engineer the connections to RDS to remove the need for a proxy
  • SRE-6245 Investigate reported status site customized notifications
  • SSD-3280 Improve health metrics, alerting, and logging within our caching services
  • SSD-3259 Reduce SSO dependency on caching services
  • SSD-3293 Assess adding a cache for ADConnect API
  • Plan communication to customers about upgrading off of old ADConnect versions. Current supported versions are 3.0+ only.
Posted about 1 year ago. Sep 14, 2016 - 11:31 MDT

Resolved
This incident has been resolved. Engineers will be posting a RCA within 48 hours with full details of the incident.
Posted about 1 year ago. Sep 12, 2016 - 10:50 MDT
Monitoring
Site Reliability is currently monitoring SSO systems for performance and stability. Automated testing has seen a significant drop in errors over the past 15 minutes, and we are seeing an improvement in overall performance. ETA is still 30 minutes at this point.
Posted about 1 year ago. Sep 12, 2016 - 10:21 MDT
Identified
Site Reliability has identified possible issues with PingOne's SSO services. We are currently working on a fix. Resolution ETA is less than 30 minutes.
Posted about 1 year ago. Sep 12, 2016 - 09:46 MDT
Update
Engineering teams are investigating an issue with PingOne Single Sign-on systems. . Site Reliability is continuing to investigate and will update this message when the incident has been identified.
Posted about 1 year ago. Sep 12, 2016 - 08:54 MDT
Update
Engineering teams are investigating an issue with PingOne Single Sign-on systems. Site Reliability will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.
Posted about 1 year ago. Sep 12, 2016 - 08:52 MDT
Investigating
Monitoring systems have detected an issue with PingOne production systems. The Site Reliability Engineering team has been notified and is currently working the issue. Site Reliability will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.
Posted about 1 year ago. Sep 12, 2016 - 07:39 MDT
This incident affected: PingOne Services (North America Critical Path, Australia Critical Path, Single Sign-on).