On April 5th, 2018 beginning at 01:50 UTC, the PingOne administrative web-portal (admin.pingone.com) went offline for a period of 48 minutes. This was caused by three independent defects in internal tools, all of which were required to cause the outage. These tools are responsible for deploying new application servers to production and routing traffic in a safe 'canary release' strategy where code is automatically monitored for any errors. The system failure resulted in a condition where load balancers were set to route traffic to servers which did not exist.
On April 5th, 2018 beginning at 01:50 UTC, customers experienced the inability to login to the PingOne administrative web portal (admin.pingone.com). Full services and performance were restored to all customers at 02:38 UTC.
April 5th, 2018 (all times in UTC) * 01:50 - Monitoring systems detect issues with admin.pingone.com. On call SRE notified.
01:57 - On call SRE escalates to Incident Commander.
02:05 - Investigation shows no servers capable of carrying web portal traffic.
02:30 - Services begin recovering after redeploy of the application.
02:38 - Services fully recovered.
02:40 - Automated deployment process blocked as root cause investigation continues.
April 6th, 2018 (all times in UTC) * 19:38 - Automated deployment process re-opened.
PingOne Admin Portal (North America)
Services restored after application was re-deployed to production.
Address defect in build pipeline allowing future released to be deactivated. RESOLVED
Address defect in build pipeline allowing releases to be stuck activating. RESOLVED