PingID Service Interruption (.com)
Incident Report for Ping Identity
Postmortem

Incident Summary

On July 23rd, 2018 beginning at 06:00 UTC, a regularly scheduled database backup coinciding with a database repair task running from earlier in the day caused performance issues on the underlying PingID database cluster. From 06:00 - 06:36 UTC authentication attempts may have appeared slower than usual. At 06:36 UTC, authentication attempts began failing as database performance degraded significantly. Service was restored at 07:22 UTC after a rolling restart of the database cluster was completed and the application nodes were restarted.

Customer Impacts

MFA with PingID for workforce serviced by our North American data center (authenticator.pingone.com) was slow or unavailable during the incident. After the initial fix was implemented and did not resolve the issue, external infrastructure heartbeats were failed to allow customers with automatic bypass enabled to bypass PingID.

Incident Timeline

July 23, 2018 (all times in UTC)

  • 06:00 - Scheduled database backup begins.

  • 06:36 - Automated monitoring alerts on call Site Reliability Engineer.

  • 06:45 - Issue escalated to Incident Commander.

  • 06:53 - Issue identified as database performance issue. Both repair task and backup processes terminated.

  • 07:05 - Monitoring and manual testing still shows services not responding - rolling restart of database cluster instances initiated.

  • 07:12 - Forced infrastructure heartbeats to fail to ensure customers with the automatic bypass feature enabled would bypass PingID.

  • 07:22 - Database restart completed. Services recovered.

  • 07:25 - Manual verification confirms issue resolved.

Affected Services

  • PingID Services (.com)

Resolution

Service restoration occured after all database instances restarted.

Ping Action Items

  • Add checks to backup process to ensure no other maintenance tasks are running.
  • Fix PingID Authenticator to fail heartbeat when PingID heartbeat hangs.
Posted 21 days ago. Jul 25, 2018 - 02:02 UTC

Resolved
This incident has been resolved.
Posted 23 days ago. Jul 23, 2018 - 07:45 UTC
Monitoring
A fix has been implemented and we are closely monitoring the systems for errors.
Posted 23 days ago. Jul 23, 2018 - 07:28 UTC
Update
We are continuing to work on a fix for the issue. While the fix is being implemented, we have forced infrastructure heartbeats to fail to ensure customers with the automatic bypass feature enabled can bypass the authentication feature.
Posted 23 days ago. Jul 23, 2018 - 07:12 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted 23 days ago. Jul 23, 2018 - 06:53 UTC
Investigating
Monitoring systems have detected an issue with Ping Identity's PingID Service. The Site Reliability Engineering team has been notified and is currently working the issue. We will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.
Posted 23 days ago. Jul 23, 2018 - 06:50 UTC
This incident affected: PingID Services (PingID Authenticator - North America (.com), PingID Server - North America (.com)).