Service Interuption - SSO transactions
Incident Report for Ping Identity

Overview of Symptoms

End-users experienced 503 & 504 errors and poor performance on single sign-on (SSO) transactions.

Conditions of Event

There was no system resource (CPU, memory, disk, network) exhaustion on the application or database servers during the incident. There was no datacenter-specific infrastructure issue during the incident.

Incident Timeline - All Times in MDT

  • 0653 - First signs of latency increase for TPN.
  • 0657 - Average latency goes from <50ms to over 10,000ms.
  • 0657 - Synthetics failure pages the on-call
  • 0705 - SRE on-call begins investigation
  • 0710 - SRE determines that Scheduler Service is still deactivated.
  • 0719 - Status Page posting is made
  • 0726 - Latency returns to normal

Root Cause

The main root cause was discovered to be a proxy instance that sits in front of the PingOne configuration database. The short term fix was to replace the proxy with an instances that has greater network bandwidth. These system updates were made on 9/15 at roughly 22:00MT, and increased overall proxy bandwidth by a factor of eight. The longer term fix is to remove the proxy from PingOne's database critical path entirely. Looking at historical data shows that proxy bandwidth usage had been trending upwards. It had come close to a network cap a few times, but 9/12 was the first instance of hitting the ceiling. A large majority of the bandwidth used is from the ADConnect-API. After investigation, we found that much of this bandwidth is from customers running very old versions of ADConnect.

Action items

  • SRE-6202 Replace proxy systems in front of the config database with a larger instance (completed)
  • SRE-6217 Add alarms on bandwidth utilization of proxy systems, correlated with AWS instances sizing
  • SRE-6128 Engineer the connections to RDS to remove the need for a proxy
  • SRE-6245 Investigate reported status site customized notifications
  • SSD-3280 Improve health metrics, alerting, and logging within our caching services
  • SSD-3259 Reduce SSO dependency on caching services
  • SSD-3293 Assess adding a cache for ADConnect API
  • Plan communication to customers about upgrading off of old ADConnect versions. Current supported versions are 3.0+ only.
Posted 11 months ago. Sep 23, 2016 - 13:10 MDT

Resolved
This incident has been resolved.

[UPDATED] Engineers will be posting a RCA within the next 7 days. We are taking additional time to investigate SSO issues, and will provide the customer RCA next week.
Posted 11 months ago. Sep 15, 2016 - 09:01 MDT
Monitoring
Site Reliability is currently monitoring SSO systems for performance and stability. Automated testing has seen a significant drop in errors over the past 15 minutes, and we are seeing an improvement in overall performance.
Posted 11 months ago. Sep 15, 2016 - 07:44 MDT
Investigating
Monitoring systems have detected an issue with PingOne production systems. The Site Reliability Engineering team has been notified and is currently working the issue. Site Reliability will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.
Posted 11 months ago. Sep 15, 2016 - 07:19 MDT
This incident affected: PingOne Services (North America Critical Path, Australia Critical Path).