PingID and Dock Outage (.com)
Incident Report for Ping Identity
Postmortem

Incident Summary

On May 2nd, 2018 beginning at 12:05 UTC, the PingID and PingOne Dock applications began getting errors connecting to a database hosted on AWS RDS. A manual failover of the database node resolved the issue.

PingOne and PingID database infrastructure runs on AWS RDS (which is an Amazon managed service) and is configured for automated failover in the event of a primary database outage. During this incident, the primary database became unhealthy, but the RDS service did not failover the database properly and required a manual step to failover and restore service.

Customer Impacts

On May 2nd, 2018 beginning at 12:05 UTC, customers experienced the inability to reach the PingOne Dock (desktop.pingone.com) or authenticate with PingID MFA (authenticator.pingone.com) hosted in our North American data centers. Services began recovering at 12:31 UTC at which point PingID authentication sessions were successful. Due to the backlog of requests to the PingOne Dock, additional servers were deployed to handle the increased load and full service was restored at 12:49 UTC.

Incident Timeline

May 2, 2018 (all times in UTC)

  • 12:05 - Monitoring systems detect issues with PingID and PingOne Dock services. On call SRE notified.
  • 12:20 - Investigation shows increased number of errors connecting to database.
  • 12:27 - On call SRE initiates manual failover of suspect RDS instance.
  • 12:31 - PingID IDP services fully restored.
  • 12:42 - Additional Dock servers deployed to handle increased load.
  • 12:49 - PingOne Dock services fully restored.

Affected Services

  • PingID Authenticator (.com)
  • PingOne Dock (.com)

Resolution

Service restoration occurred after a manual failover of an AWS RDS database instance.

Ping Action Items

  • Complete PingID RDS to Cassandra Migration.
  • Improve PingOne Dock to reduce reliance on the read/write database.
  • Investigate improvements to RDS database failover (jointly with AWS Engineering teams).
Posted about 2 months ago. May 03, 2018 - 19:41 UTC

Resolved
Services have been restored and we are closely monitoring the systems.
Posted about 2 months ago. May 02, 2018 - 12:53 UTC
Update
PingID services have been restored. The Site Reliability Engineering is still working restore services to the Dock.
Posted about 2 months ago. May 02, 2018 - 12:42 UTC
Identified
The Site Reliability Engineers have identified the issue and are working to resolve it now. ETA for resolution is 15 minutes.
Posted about 2 months ago. May 02, 2018 - 12:34 UTC
Investigating
Monitoring systems have detected an issue with North American PingOne production systems. The Site Reliability Engineering team has been notified and is currently working the issue. Site Reliability will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.
Posted about 2 months ago. May 02, 2018 - 12:22 UTC
This incident affected: PingOne Services (North America Critical Path (.com), Single Sign-on, PingOne dock - North America (.com)) and PingID Services (PingID Authenticator - North America (.com), PingID Server - North America (.com)).