PingID Service Interruption
Incident Report for Ping Identity

Incident Summary

PingID stopped functioning correctly which prevented users from being able perform second factor authentication. The root cause of the outage was a data replication failure in the session management system. An unusual circumstance occurred where a failed node was replaced then rebalancing of session data stalled the entire system. Mitigation actions were taken and the system functionality was restored (see below).

Customer Impact

Customers were not able to utilize PingID during the duration of the outage.

Incident Timeline - Apr 17, 2017 (MDT)

  • 1745 - PingID errors reported
  • 1750 - Operations Team begins investigation
  • 1751 - System monitoring indicates spike in HttpServerError 500
  • 1753 - Web server stack trace shows problem connecting to the session management system
  • 1756 - Internal escalation process initiated
  • 1806 - Synthetic testing validates problem
  • 1815 - Restarting web services reduces error rate
  • 1817 - Load balancing mechanism marks all nodes as down
  • 1822 - Heartbeat for all nodes return to normal
  • 1832 - Status monitoring page updated
  • 1836 - Service is restored
  • 1851 - Status monitoring page updated

Affected Services

  • PingID Services
  • PingID App
  • PingID Authenticator
  • PingID Server

Resolution

Restarting the web services allowed the stateless session management system to fully recover.

Ping Action Items

  • Improve error monitoring of synthetic tests to detect this type of failure sooner.
  • Improve status update process and method.
  • Change the PingID session management system implementation to be more resilient.
Posted 2 months ago. Apr 20, 2017 - 16:23 MDT

Resolved
This incident has been resolved. PingID service in all regions are back to normal.
Posted 2 months ago. Apr 17, 2017 - 18:51 MDT
Investigating
Monitoring systems have detected an issue with the PingID service. The Site Reliability Engineering team has been notified and is currently working the issue. We will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.
Posted 2 months ago. Apr 17, 2017 - 18:32 MDT
This incident affected: PingOne Services (North America Critical Path, Europe Critical Path, Australia Critical Path, Admin Portal Monitor, Directory API, Single Sign-on, Single Sign-On (PingOne SSO for SaaS Apps/APS), Administration Portal, OAuth Service, Administration API, AD Connect & Routing Service, PingOne dock (North America), PingOne dock (Europe), PingOne dock (Australia), Directory Login (North America), Directory Login (Europe), Directory Login (Australia), Directory API (North America), Directory API (Europe), Directory API (Australia), Office365 Service (North America), Office365 Service (Europe), Office365 Service (Australia), SCIM Provisioning (North America), SCIM Provisioning (Europe), SCIM Provisioning (Australia)) and PingID Services (PingID App, PingID Authenticator (North America), PingID Authenticator (Europe), PingID Authenticator (Australia), PingID Server (North America), PingID Server (Europe), PingID Server (Australia), Twilio SMS, Twilio REST API).