A service outage in Amazon’s AWS region US-EAST-1 affected their Simple Email Service (SES) and Simple Storage Service (S3) services, which our internal applications depend on for certain functions. Ping Site Reliability Engineering (SRE) responded, posting notification of impact to certain services relying on the S3 and SES services, and communicating expected client impact.
Following an unintended update to the Amazon S3 infrastructure by engineers at Amazon, systems began to error out and fail during. Due to the distributed nature of the PingOne and PingID infrastructure, this disruption did not halt the functioning of all components. During this disruption, customers may have experienced the following abnormal application behavior:
- dock icons not loading
- admin portal applications unable to edit or load
- admin portal sometimes receiving 504 gateway timeouts
- PingID emails not sent due to impact on Amazon SES
- PingID authenticator 504 gateway timeouts
The root cause and preventive actions from our provider, Amazon Web Services can be found at this URL: https://aws.amazon.com/message/41926/.
Incident Timeline - February 28, 2017 (MT)
- 1132 - First reports of issue - Customer cannot access dashboard
- 1135 - Other support team members confirm issue
- 1209 - SRE creates incident with status Identified
- 1249 - Support team reports of PingID email being affected
- 1352 - AWS reports S3 service operations beginning to recover
- 1342 - Support team reports some improvement, but still issues with icons
- 1412 - AWS reports object retrieval, listing, and deletion functions are fully recovered. Adding new objects still impacted
- 1508 - AWS reports functionality to S3 fully recovered
- 1530 - After confirming functionality restored, SRE updates incident to resolved
- 1610 - Ebay reports continued issues with PingID emails due to SES impact
- 1745 - AWS reports SES functionality fully recovered
- Admin Portal Monitor
- Administration Portal
- PingOne dock (North America)
- PingID App
- PingID Authenticator (North America)
Resolution centered on Amazon AWS restoring availability to S3 and SES services.
Ping Action Items
- Improve the PingOne Admin Portal to allow creation and editing of applications while S3 is not available.
- Cross-Region replication for critical S3 resources.
- Support multiple regions for SES or other redundant email service - MX records multiple regions.
- Add detection of SES outages
- If AWS is detected to be down then switch to a redundant email service or the ability to switch to a different AWS region of SES.