System Uptime

System uptime in the past 90 days.

Past Incidents Past Incidents

Welcome to Ping Identity's system status site.

Service Interruption - SSO transactions

Incident Report for Ping Identity

Postmortem

Overview of Symptoms

End-users experienced 503 & 504 errors and poor performance on single sign-on (SSO) transactions.

Conditions of Event

There was no system resource (CPU, memory, disk, network) exhaustion on the application or database servers during the incident. There was no datacenter-specific infrastructure issue during the incident.

Incident Timeline

6:00MT Scheduler-service started batch job
6:13 Token-processor to config-service queries started to fail
6:57 First user-facing errors began
7:02 SSO performance significantly declined
7:04 High-sev (paging) alert: critical path failure
7:04 SRE began investigation
7:25 SRE restarted token-processor instances
7:45 SRE deployed additional token-processor and config-service instances
8:59 SRE restarted config-service instances
9:03 SRE increased timeout on load balancer health checks to reduce flapping
9:34 Engineers identified scheduler-service activity as a potential cause
9:34 SRE shut down scheduler-service
10:07 Critical path failure resolved
10:10 SSO performance back to normal
10:13 Errors stopped and incident resolved

Root Cause

The main root cause was discovered to be a proxy instance that sits in front of the PingOne configuration database. The short term fix was to replace the proxy with an instances that has greater network bandwidth. These system updates were made on 9/15 at roughly 22:00MT, and increased overall proxy bandwidth by a factor of eight. The longer term fix is to remove the proxy from PingOne's database critical path entirely. Looking at historical data shows that proxy bandwidth usage had been trending upwards. It had come close to a network cap a few times, but 9/12 was the first instance of hitting the ceiling. A large majority of the bandwidth used is from the ADConnect-API. After investigation, we found that much of this bandwidth is from customers running very old versions of ADConnect.

Action items

SRE-6202 Replace proxy systems in front of the config database with a larger instance (completed)
SRE-6217 Add alarms on bandwidth utilization of proxy systems, correlated with AWS instances sizing
SRE-6128 Engineer the connections to RDS to remove the need for a proxy
SRE-6245 Investigate reported status site customized notifications
SSD-3280 Improve health metrics, alerting, and logging within our caching services
SSD-3259 Reduce SSO dependency on caching services
SSD-3293 Assess adding a cache for ADConnect API
Plan communication to customers about upgrading off of old ADConnect versions. Current supported versions are 3.0+ only.

Posted Sep 14, 2016 - 17:31 UTC

Resolved

This incident has been resolved. Engineers will be posting a RCA within 48 hours with full details of the incident.

Posted Sep 12, 2016 - 16:50 UTC

Monitoring

Site Reliability is currently monitoring SSO systems for performance and stability. Automated testing has seen a significant drop in errors over the past 15 minutes, and we are seeing an improvement in overall performance. ETA is still 30 minutes at this point.

Posted Sep 12, 2016 - 16:21 UTC

Identified

Site Reliability has identified possible issues with PingOne's SSO services. We are currently working on a fix. Resolution ETA is less than 30 minutes.

Posted Sep 12, 2016 - 15:46 UTC

Update

Engineering teams are investigating an issue with PingOne Single Sign-on systems. . Site Reliability is continuing to investigate and will update this message when the incident has been identified.

Posted Sep 12, 2016 - 14:54 UTC

Update

Engineering teams are investigating an issue with PingOne Single Sign-on systems. Site Reliability will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

Posted Sep 12, 2016 - 14:52 UTC

Investigating

Monitoring systems have detected an issue with PingOne production systems. The Site Reliability Engineering team has been notified and is currently working the issue. Site Reliability will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.

Posted Sep 12, 2016 - 13:39 UTC

This incident affected: PingOne for Enterprise - Global (Single Sign-on).

System Status

System Status

System Uptime

System Uptime

Past Incidents Past Incidents

Overview of Symptoms

Conditions of Event

Incident Timeline

Root Cause

Action items