A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. Leslie Lamport, 1987
Amazon Web Services DynamoDB experienced downtime in the N Virginia availability zone early Sunday morning, September 20, 2015. As a result, a number of other AWS services inside N Virginia that depend on DynamoDB also had downtime. Companies and organizations that built services on top of those systems who didn’t have geographic load balancing were having problems as well.
UPDATE: Amazon’s report Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region is available as of September 23, 2015.
Affected services include at least CloudWatch, SES, SNS, SQS, SWS, AutoScale, Cloud Formation, Directory Service, Key Mgmt and Lambda, according to a report on Hacker News.
Down Detector has a page of AWS outages trouble reports, and details pulled from Twitter. https://downdetector.com/status/aws-amazon-web-services
When core infrastructure goes down, it tends to affect other platforms that depend on that core infrastructure and that hide it from their users. This in turn affects applications built on those platforms.
Docker: “We are currently seeing intermittent errors when pushing and pulling, related to issues that AWS is having. We are currently investigating the causes, and doing what we can to mitigate the problems.” @dockerstatus and http://status.docker.com/
Heroku: “Starting new dynos (unidling, one-off, scaling or restarting crashed apps, new releases) is still unavailable.” “Until this incident is resolved, you might be unable to open new support tickets with us. If you need to communicate with our support staff during this time, please email email@example.com.” @herokustatus and https://status.heroku.com/incidents/811
CircleCI: “Experiencing Issues with Heroku and AWS”. @circleci and http://status.circleci.com/
TravisCI: “Partial System Outage”. @traviscistatus and https://www.traviscistatus.com/ After incident report at https://www.traviscistatus.com/incidents/wzyhx97450f4 , “Degradations and outages due to AWS us-east-1 domino effect”
There are a lot of applications built on AWS and on Heroku, which are at risk of downtime. A comprehensive list is probably impossible, but here are some reports, in alphabetical order.
- Amazon Echo
- Amazon Instant Video
- AWS Console. The API is hosted in each region, so the command line worked, but the console is hosted out of us-east-1 and was unavailable.
- AWS Support Tickets. Use this workaround ???
- Canopy. “Our iOS app is down due to the system-wide #AWSoutage. Updates will be provided soon.”
- Clickfunnels. “Our development team is aware of this, however, at this time, the only thing that we can do is wait until the issues at Amazon AWS are resolved completely.” http://status.clickfunnels.com/
- CoinSimple. “Widespread outage in Amazon Web Services is affecting several Internet services, including @CoinSimple. Please check here for updates.”
- FastSpring. “We were having some intermittent problems due to #awsoutage. The problem is being resolved by Amazon.”
- HashiCorp. “Our infrastructure provider is experiencing issues which may cause our project websites to be inaccessible. Sorry for the inconvenience.”
- IFTTT. “We have identified an issue with our service provider. We will continue to provide updates as more information becomes available.” http://status.ifttt.com/incidents/xnbwnqj608hg
- Lightspeed POS. “Monitoring - Our service provider is experiencing a major issue with several of their services. Our engineers are working to reduce our reliance on these components until their issues are resolved. Some customers may experience slowness related to this problem.” http://status.lightspeedretail.com/
- Medium. “Identified - We’re working to fix a major outage and will be back online as soon as possible.” https://medium.statuspage.io/
- Nest. “We’re investigating a service outage with the Nest mobile app and Cam services, and the team is working on a fix. Details to come.” https://nest.com/support/#status
- Netflix. “We are currently experiencing issues streaming on all devices. We are working to resolve the problem. We apologize for any inconvenience.” https://help.netflix.com/help
- Paper. “account related activities in Paper are down due to problems with our infrastructure provider #AWS”
- Product Hunt. “Our servers are sick. We’re working on it right meow 😻”
- Quandl. “The Quandl website and API are temporarily unavailable, due to problems at #AWS.” http://quandl.statuspage.io
- Shipt. “Unfortunately, we’re down right now with the #AWSOutage.”
- Social Flow
- Takipi. “Due to an AWS outage, we’re experiencing some slowness and connectivity issues. Stay tuned for updates”.
- Twilio. “Update - We have established the DynamoDB backend in another AWS region and are re-routing requests from upstream services to the new region. We will provide another update once we verify that the requests are being serviced properly. ” http://status.twilio.com/
- Viber. “We are experiencing disruptions in our service. We are working to resolve this issue. Sorry for the inconvenience.” https://support.viber.com/
- waffle.io “Experiencing Issues with Heroku” http://status.waffle.io/
- Walt Disney World app.
- Wink. “We are noticing increased connection issues for Wink users. Our engineers are working with Amazon to get this resolved ASAP.” http://status.winkapp.com/
Follow the discussion on Twitter on the #AWS hashtag, as well as #awsoutage and #awsdown . Twitter is unaffected by this outage, and Slack is also unaffected, which devops teams are both happy about.
This Hacker News discussion is a good one as startups radio in their woes.