PingID Service Interruption
Incident Report for Ping Identity
Postmortem

Incident Summary

On December 8th, 2017 beginning at 04:45 UTC, an underlying database node in the multi-node database cluster became unresponsive. Once the database node was recovered, the application began responding, although very slowly. Once all nodes were responsive, services were restored.

This incident exposed an issue in the configuration between the application servers and the database cluster. When the database node failed, the application assumed the database was not in a consistent state and stopped responding to requests.

Customer Impacts

On December 8, 2017 beginning at 04:45 UTC, customers experienced the inability to authenticate with PingID MFA from our North American data centers (authenticator.pingone.com). Services began recovering at 05:39 UTC at which point some authentication sessions were successful but experienced longer than normal delays. Full services and performance were restored to all customers at 06:25 UTC.

During this incident, the PingID local bypass feature was not properly triggered due to the infrastructure level health check passing.

Incident Timeline

December 08, 2017 (all times in UTC)

  • 04:45 - Monitoring systems detect issues with PingID services. On call SRE notified.
  • 04:55 - Investigation shows a database node was not responsive.
  • 04:59 - On call SRE escalates to Incident Commander. Database SME engaged.
  • 05:15 - Database node successfully recovered and brought back into the cluster.
  • 05:22 - Testing confirms that application is still not responsive. Rolling restart of application servers started.
  • 05:39 - Services begin recovering. Some authentication requests are successful, but users experiencing longer than normal delays.
  • 06:25 - Services fully recovered.

Affected Services

PingID Service (North America)

Resolution

Partial restoration of the PingID services occurred when the failed database node was added back into the multi-node cluster. Full service restoration occurred after all database nodes had fully replicated data sets.

Ping Action Items

  • Audit all database and application configurations to ensure proper database cluster information. To be completed in December, 2017.
  • Additional nodes will be added to the PingID database cluster to ensure proper availability and data consistency in the event of an availability zone failure. To be scheduled in December, 2017.
  • PingID database cluster software will be upgraded and tuned. To be scheduled in December, 2017.
Posted 10 months ago. Dec 13, 2017 - 12:40 UTC

Resolved
This incident has been resolved.
Posted 10 months ago. Dec 08, 2017 - 06:36 UTC
Monitoring
PingID services have recovered and authentications are successful. The Site Reliability Engineering team is monitoring to ensure the system is stable.
Posted 10 months ago. Dec 08, 2017 - 06:25 UTC
Update
PingID services are still in the process of recovering. Successful authentication requests are increasing although the push notifications are significantly delayed.
Posted 10 months ago. Dec 08, 2017 - 06:07 UTC
Update
PingID services are in the process of recovering. We are continuing to monitor the recovery process.
Posted 10 months ago. Dec 08, 2017 - 05:39 UTC
Identified
The Site Reliability team has identified the issue and is working on recovering services now. Next status update in 15 minutes.
Posted 10 months ago. Dec 08, 2017 - 05:22 UTC
Investigating
Monitoring systems have detected an issue with Ping Identity's PingID Service. The Site Reliability Engineering team has been notified and is currently working the issue. We will update this message when the incident has been identified.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.
Posted 10 months ago. Dec 08, 2017 - 05:06 UTC
This incident affected: PingID Services (PingID Authenticator - North America (.com), PingID Server - North America (.com)).