During the process of removing dedicated IPs for emails, we put the IPs into a “standby” mode based on Amazon’s recommendation. During this process, an issue in Amazon AWS SES (Simple Email Service) resulted in zero active IP addresses available in the dedicated pool, and the services failed to send any email from 20:25 UTC to 23:19 UTC on 06/11/2020. The issue was resolved after Amazon removed these IPs from the “standby” mode.
All workflows that depend on emails were disrupted.
June 11, 2020 (all time in UTC)
- Around 20:19 - AWS put our IPs into "standby"
- 20:25 - All emails stopped going out
- 21:19 - SRE was notified by Support that customers’ OTP email is down
- 21:29 - SRE opened case about emails being down (case # 7089253821)
- 22:31 - Based on the errors coming from SES and troubleshooting done by AWS support indicating that emails are being throttled, SRE opened another ticket to increase the throttling limits (case # 7089347421)
- 23:19 - Additional troubleshooting with AWS indicating the “standby IP” may be the issue. AWS removed the IPs from the "standby" mode. Emails started flowing from this point on.
All workflows sending email through AWS SES in NA and EU (Ireland only).
Removing the IPs from the “standby” mode resolved the issue.
Ping Action Items
- Improve alerting around SES when the amount of emails being sent drops under a threshold.
- Improve the process to limit future AWS provisioning events to a single region where possible.
- Improve email delivery resiliency by leveraging multiple IP pools in multiple regions.