Incident Summary
On December 16, 2016, PingOne SSO for SaaS Apps (formerly Application Provider Services (APS) and some PingOne SSO (formerly CAS Lite) customers experienced an outage Friday afternoon following an update to PingOne's Token Processor Nodes (TPN).
Customer Impact
Following the update to the TPN, attributes that were previously optional were made required resulting in SSO206 error messages to appear for end users. Accounts that did not have the required attributes set experienced SSO206 error messages for four hours.
Incident Timeline - December 16, 2016 (MT)
- 1219 - Token Processing changes committed to production Fastlane.
- 1402 - Token Processing changes staged (10% live) in production for canary monitoring (20 minutes long).
- 1403 - First SSO206 events appear in the PingOne log files (but only 30 errors occur in the full 20 minutes of monitoring which is not sufficient to trigger failure in the build deployment pipeline).
- 1425 - Token Processing changes activated (100% live) in production after passing all tests and canary monitoring.
- 1426 - Significantly more SSO206 events appear in the PingOne log files.
- 1622 - Support receives initial customer calls reporting SSO206 events
- 1722 - Following triage, Support escalates to SRE and DEV teams.
- 1745 - Token Processing code change is rolled back. SRE and DEV team observes an immediate drop in SSO206 errors through all production services.
Affected Services
- PingOne Services - Single Sign-On (PingOne SSO for SaaS Apps/APS)
Resolution
Token Processing update was rolled back.
Ping Action Items
- Add PingOne “Invited SSO” to system tests - [SSD-3712]
- Make all system attributes optional - [SSD-3711]
- Improve response code reporting for improved monitoring capability - [SSD-1904]