PingID Service Degradation (.com)
Incident Report for Ping Identity
Postmortem

Incident Summary

On September 23, 2019 beginning at 22:30 UTC, some customers experienced the inability to authenticate with PingID MFA in our Oregon region due to an underlying database issue. A restart of the database resolved the issue. Traffic was not automatically redirected to our Ohio region as the service was not in a full outage state in Oregon.

Customer Impacts

On September 23, 2019 beginning at 22:30 UTC, some customers experienced the inability to authenticate with PingID MFA hosted in our Oregon region. Authentications in our Ohio region were not affected. Services were fully recovered at 23:02 UTC.

Incident Timeline

September 23, 2019 (all times in UTC)

  • 22:22 - Monitoring systems detect longer than normal response times with PingID services in Oregon. Development begins investigation.
  • 22:30 - Monitoring systems detect higher than normal failure rates. Customers start experiencing authentication timeouts.
  • 22:40 - SRE team initiates diagnostics test on database cluster. One node found to be in an inconsistent state.
  • 22:45 - SRE team completes restart of problematic node. Error rate decreases, but is still above normal levels.
  • 22:55 - SRE team completes rolling restart of all database nodes.
  • 23:02 - Error rates return to normal. All services recovered.

Affected Services

  • PingID Authenticator (.com)
  • PingID Server (.com)

Resolution

Service restoration occurred after a database node restart. Normally, if a node has failed, it is taken out of the cluster automatically. In this edge case, one node was reported as operating normally to the rest of the cluster, but that node saw the rest of the cluster as not available.

Ping Action Items

  • Improve regional failover process to account for degraded services.
  • Improve monitoring for the edge case for single database node inconsistency.
Posted 25 days ago. Sep 26, 2019 - 15:55 UTC

Resolved
The issue has been resolved.
Posted 28 days ago. Sep 23, 2019 - 23:10 UTC
Investigating
We are seeing increased response times in one of our regions and are investigating the issue.
Posted 28 days ago. Sep 23, 2019 - 22:40 UTC
This incident affected: PingID Services (PingID Authenticator - North America (.com), PingID Server - North America (.com)).