System Uptime

System uptime in the past 90 days.

Past Incidents Past Incidents

Welcome to Ping Identity's system status site.

Service Interuption - SSO transactions

Incident Report for Ping Identity

Postmortem

Overview of Symptoms

End-users experienced 503 & 504 errors and poor performance on single sign-on (SSO) transactions.

Conditions of Event

There was no system resource (CPU, memory, disk, network) exhaustion on the application or database servers during the incident. There was no datacenter-specific infrastructure issue during the incident.

Incident Timeline - All Times in MDT

0653 - First signs of latency increase for TPN.
0657 - Average latency goes from <50ms to over 10,000ms.
0657 - Synthetics failure pages the on-call
0705 - SRE on-call begins investigation
0710 - SRE determines that Scheduler Service is still deactivated.
0719 - Status Page posting is made
0726 - Latency returns to normal

Root Cause

The main root cause was discovered to be a proxy instance that sits in front of the PingOne configuration database. The short term fix was to replace the proxy with an instances that has greater network bandwidth. These system updates were made on 9/15 at roughly 22:00MT, and increased overall proxy bandwidth by a factor of eight. The longer term fix is to remove the proxy from PingOne's database critical path entirely. Looking at historical data shows that proxy bandwidth usage had been trending upwards. It had come close to a network cap a few times, but 9/12 was the first instance of hitting the ceiling. A large majority of the bandwidth used is from the ADConnect-API. After investigation, we found that much of this bandwidth is from customers running very old versions of ADConnect.

Action items

SRE-6202 Replace proxy systems in front of the config database with a larger instance (completed)
SRE-6217 Add alarms on bandwidth utilization of proxy systems, correlated with AWS instances sizing
SRE-6128 Engineer the connections to RDS to remove the need for a proxy
SRE-6245 Investigate reported status site customized notifications
SSD-3280 Improve health metrics, alerting, and logging within our caching services
SSD-3259 Reduce SSO dependency on caching services
SSD-3293 Assess adding a cache for ADConnect API
Plan communication to customers about upgrading off of old ADConnect versions. Current supported versions are 3.0+ only.

Posted Sep 23, 2016 - 19:10 UTC

Resolved

This incident has been resolved.

[UPDATED] Engineers will be posting a RCA within the next 7 days. We are taking additional time to investigate SSO issues, and will provide the customer RCA next week.

Posted Sep 15, 2016 - 15:01 UTC

Monitoring

Site Reliability is currently monitoring SSO systems for performance and stability. Automated testing has seen a significant drop in errors over the past 15 minutes, and we are seeing an improvement in overall performance.

Posted Sep 15, 2016 - 13:44 UTC

Investigating

Monitoring systems have detected an issue with PingOne production systems. The Site Reliability Engineering team has been notified and is currently working the issue. Site Reliability will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.

Posted Sep 15, 2016 - 13:19 UTC

System Status

System Status

System Uptime

System Uptime

Past Incidents Past Incidents

Service Interuption - SSO transactions

Postmortem

Overview of Symptoms

Conditions of Event

Incident Timeline - All Times in MDT

Root Cause

Action items

Resolved

Monitoring

Investigating