Admin Portal Service Interruption
Incident Report for Ping Identity
Postmortem

Incident Summary

On April 5th, 2018 beginning at 01:50 UTC, the PingOne administrative web-portal (admin.pingone.com) went offline for a period of 48 minutes. This was caused by three independent defects in internal tools, all of which were required to cause the outage. These tools are responsible for deploying new application servers to production and routing traffic in a safe 'canary release' strategy where code is automatically monitored for any errors. The system failure resulted in a condition where load balancers were set to route traffic to servers which did not exist.

Customer Impacts

On April 5th, 2018 beginning at 01:50 UTC, customers experienced the inability to login to the PingOne administrative web portal (admin.pingone.com). Full services and performance were restored to all customers at 02:38 UTC.

Incident Timeline

April 5th, 2018 (all times in UTC) * 01:50 - Monitoring systems detect issues with admin.pingone.com. On call SRE notified.

  • 01:57 - On call SRE escalates to Incident Commander.

  • 02:05 - Investigation shows no servers capable of carrying web portal traffic.

  • 02:30 - Services begin recovering after redeploy of the application.

  • 02:38 - Services fully recovered.

  • 02:40 - Automated deployment process blocked as root cause investigation continues.

April 6th, 2018 (all times in UTC) * 19:38 - Automated deployment process re-opened.

Affected Services

PingOne Admin Portal (North America)

Resolution

Services restored after application was re-deployed to production.

Ping Action Items

  • Address defect in build pipeline allowing future released to be deactivated. RESOLVED

  • Address defect in build pipeline allowing releases to be stuck activating. RESOLVED

Posted 8 months ago. Apr 10, 2018 - 16:03 UTC

Resolved
This incident has been resolved.
Posted 9 months ago. Apr 05, 2018 - 02:38 UTC
Identified
Our engineers have identified the issue and are working to restore service. Next update in 15 minutes.
Posted 9 months ago. Apr 05, 2018 - 02:26 UTC
Investigating
Monitoring systems have detected an issue with PingOne's global administration portal (https://admin.pingone.com). The Site Reliability Engineering team has been notified and is currently working the issue to resolution. Site Reliability will update this message when the incident has been identified. Automated monitoring systems will update affected components and will resolve operational status as systems recover.

For additional questions please contact support@pingidentity.com, or follow this incident on https://status.pingidentity.com for real-time service updates.
Posted 9 months ago. Apr 05, 2018 - 01:50 UTC