On July 23rd, 2018 beginning at 06:00 UTC, a regularly scheduled database backup coinciding with a database repair task running from earlier in the day caused performance issues on the underlying PingID database cluster. From 06:00 - 06:36 UTC authentication attempts may have appeared slower than usual. At 06:36 UTC, authentication attempts began failing as database performance degraded significantly. Service was restored at 07:22 UTC after a rolling restart of the database cluster was completed and the application nodes were restarted.
MFA with PingID for workforce serviced by our North American data center (authenticator.pingone.com) was slow or unavailable during the incident. After the initial fix was implemented and did not resolve the issue, external infrastructure heartbeats were failed to allow customers with automatic bypass enabled to bypass PingID.
July 23, 2018 (all times in UTC)
06:00 - Scheduled database backup begins.
06:36 - Automated monitoring alerts on call Site Reliability Engineer.
06:45 - Issue escalated to Incident Commander.
06:53 - Issue identified as database performance issue. Both repair task and backup processes terminated.
07:05 - Monitoring and manual testing still shows services not responding - rolling restart of database cluster instances initiated.
07:12 - Forced infrastructure heartbeats to fail to ensure customers with the automatic bypass feature enabled would bypass PingID.
07:22 - Database restart completed. Services recovered.
07:25 - Manual verification confirms issue resolved.
Service restoration occured after all database instances restarted.