PingID Service Interruption
Incident Report for Ping Identity

Incident Summary

PingID stopped functioning correctly which prevented users from being able perform second factor authentication. The root cause of the outage was a data replication failure in the session management system. An unusual circumstance occurred where a failed node was replaced then rebalancing of session data stalled the entire system. Mitigation actions were taken and the system functionality was restored (see below).

Customer Impact

North America customers were not able to utilize PingID during the duration of the outage. Some customers may have experienced a broader outage depending on their specific configuration affecting users outside of North America.

Incident Timeline - May 15, 2017 (MDT)

  • 1500 - Intermittent PingID errors reported
  • 1510 - Operations Team begins investigation
  • 1600 - Internal escalation process initiated
  • 1602 - Status monitoring page updated
  • 1615 - Engaging Development Team for troubleshooting
  • 1630 - PingID service fully down
  • 1700 - Restarting web services reduces error rate
  • 1715 - Status monitoring page updated
  • 1730 - Service is restored and internal validation started
  • 1756 - Status monitoring page updated

Affected Services

  • PingID Services NA
  • PingID App NA
  • PingID Authenticator NA
  • PingID Server NA

Resolution

Restarting the web services allowed the stateless session management system to fully recover.

Ping Action Items

  • Improve error monitoring of synthetic tests to detect this type of failure sooner.
  • Improve status update process and method.
  • Implement changes to the PingID session management system to make it more resilient. ETA end of May.
Posted 5 months ago. May 18, 2017 - 14:43 MDT

Resolved
This incident has been resolved. PingID service in North America is back to normal.
Posted 5 months ago. May 15, 2017 - 17:56 MDT
Identified
Our Site Reliability Engineer has identified the issue and is working on a fix.
Posted 5 months ago. May 15, 2017 - 17:15 MDT
Investigating
Monitoring systems have detected an issue with PingID Service. The Site Reliability Engineering team has been notified and is currently working the issue. We will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.
Posted 5 months ago. May 15, 2017 - 16:02 MDT
This incident affected: PingID Services (PingID App, PingID Authenticator (North America), PingID Server (North America)).