The Amazon Web Services Outage

Published Dec 13, 2021, 7:48 PM

On Tuesday, December 7th, 2021, a glitch in Amazon Web Services had a massive impact on everything from smartphone apps to Walt Disney World. We learn about the origins of AWS and what Amazon says went wrong.

Learn more about your ad-choices at https://www.iheartpodcastnetwork.com

Speaker 1

Welcome to tech Stuff, a production from I Heart Radio. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with I Heart Radio and a love of all things tech and last week, Amazon's US East one cloud region had a bit of an outage, and the effects were widespread. Amazon delivery services were affected. A lot of deliveries just couldn't be made because the whole system that underlies that the computer system was affected. Computer games like Player Unknowns Battlegrounds became unavailable. People discovered that some of their home automation devices weren't working properly. Room Baz went berserk and rose up against their human owners. Even down at Walt Disney World, guests found themselves struggling with systems like Genie Plus, or even just making a park reservation so that they could visit a theme park. Also, I kind of made up the roomba thing. So today I thought I would talk a little bit about the history of Amazon Web Services, what it actually does, why it's such a big deal for Amazon the company, and why when there's an outage it has such a widespread effect. Now, the history of Amazon Web Services or AWS goes back a couple of decades and it is tied closely with the general rise of cloud computing. So first, let's define cloud computing just so that we have a common language. Now, if you were to go to Google and query the terms cloud computing definition, you would likely get something like the following quote. The practice of using a network of remote servers hosted on the Internet to store, manage, and process data either than a local server or a personal computer end quote. So, at its most simplest form, cloud computing is when you access computational resources that are on someone else's computer and you use the Internet to do it. So, if you use any sort of cloud storage like one drive or drop box or any of a thousand others, what you are actually doing is saving files to special data servers that are in some massive server farm somewhere in the world, probably not too far from where you are, Or maybe you're actually saving that one file two servers that are in a few different massive server farms. Though you wouldn't necessarily be aware of any of this, because that would be going on in the background, and it would be a matter of redundancy to make sure that your file remains available even if something should happen to any one particular machine. So when you access that file, what you're doing is connecting back to one of those servers that holds that particular file, and you might download the file to your local machine, so you're just retrieving it, or depending on the type of file you're accessing and the type of service you're using, you might be able to do stuff like make changes to that file through a web based client. So if you were to create a document in Google Docs, for example, that would follow that kind of cloud computing model. It's one of the simplest manifestations of cloud computing and is effectively cloud storage with a little bit of editing thrown in. But cloud computing can go far beyond just storing files. There are cloud based services that allow developers to build out an app environment. They might do this so that a distributed team, you know, people who aren't working all in the same location can simultaneously work on the same code and create test environments to make sure that the app performs as expected before or they deploy the app to end users, you know, to customers. Other cloud services serve as an actual deployment platform, so not just to develop, but to deploy the gifts, developers the assets that they need to push out an app and handle user interactions. So some apps might quote unquote live natively on your device. Right, You download a file to whatever you're using, whether it's a computer or smartphone or tablet or whatever it is, and then all the processes and all the data could be contained right there locally on your machine. Um that's like the old form of computing. But increasingly we're seeing apps that rely on the cloud for functionality. So games could have stuff like leaderboards or ways that you can compete or cooperate with other players in real time. A weather app needs to fetch data from servers to tell you what the weather will be, like your your own doesn't magically know what the weather is going to be. Even a lot of home automation apps will communicate back with a web server somewhere rather than handle everything right there in your home. In fact, that's a sticking point for a lot of home automation folks, right they don't necessarily want to have the cloud part of the infrastructure. They would prefer to have their home be kind of a self contained system. You see this a lot with people who have security systems where they would prefer to have something that was completely contained within their own home, as opposed to having their security system become a surveillance tool for a company that may or may not be working in conjunction with, say, law enforcement. That's become a big issue, but that's a matter for a different podcast. Now. Building out these kinds of systems is expensive because you need the physical facilities right, You need the actual buildings, and they have to be large enough to hold all the servers that are designed to make your app work right, and then the facilities also have to be designed themselves to allow those servers to operate. That means building out stuff like cooling systems so that your machines don't overheat. So you know, it's not just enough to have a place where you store all the computers. You've gotta have it be appropriate for that. You know, it needs to be dry, free of dust, cooled, that kind of stuff. You also need to maintain those systems. You have to repair or replace components as they fail, because we all know technology does fail at some point for a variety of reasons. That's also why you need to make more than the bare minimum to run your operations, right, you need to do more than just the basics. You need backups for redundancy so that if and when a specific machine goes down, others can take its place seamlessly without affecting the end user. So in our Google Docs experience, like if you were to create a document that's not just sitting on one server that Google owns, it's actually on multiple servers um multiple machines, and if one of those machines goes down, you can still get access to your file. Also when you make changes to it. Essentially you're making changes on one file on one machine that that machine then sends out a message to all the other machines that have that same file so that they can all be updated with the newest version. UM. I've done episodes about you know, that kind of background stuff in in the Google Docs world. So moving on, Essentially, cloud computing utilizes network connections to allow you, other people, organizations, companies to rely on machines that are hosted somewhere else, and that frees you up considerably. You don't have to invest in buying hard drives so that you can save all your files. You can just subscribe to a service to get some cloud storage companies don't have to keep themselves out with massive computer systems, uh complete with an I T department to support those computer systems. They can just you know, spend some money to use a cloud computing service owned by someone else and then host all their operations through that. Though, I should add a lot of companies take a more hybrid approach, so they have some systems typically really like mission critical systems are sometimes ones that require a great deal of privacy and security. They might run those on premises or on prem and then rely on other stuff, like the more administrative stuff for cloud systems. Cloud computing started becoming a term, a buzz term really around I mean, the idea was older than that, but it was starting to really get circulation around twenty ten or so. But the seeds, like I said, we're planted earlier. So let's let's take a look at Amazon specifically, because it played a big part in this. So way back in two thousand, Amazon was scrambling to keep up with some scaling issues, and this is something that you hear about with startups pretty frequently. A new startup is usually a fairly small company, and it's nimble, and it's agile, and it might offer a small range of services or products, or it might only serve a relatively small region or both. I think companies like Lift and Uber those launched in just a couple of cities early on, right, so they were able to grow in a controlled manner. Well, if customer demand is high and investors are pouring money into the startup, it makes some sense to try and grow the company and expand operations. But growing ads new challenges and making sure that the things you offer are able to scale up and meet demand is a non trivial matter. That's the situation Amazon was in around two thousand. One of the things the company was exploring was building out merchant sites for other companies but still using the Amazon platform. So, for example, Amazon might partner with a retail company like Target to provide an online store, but use Amazon's infrastructure underlying that store, and this would bring in a new stream of revenue for Amazon, and it would mean these retail companies could rely on Amazon's platform rather than having to build out us an online store all of their own. So Amazon called this merchant dot com. But it turned out building merchant dot com was pretty challenging. It was one thing to manage Amazon's rapid growth, but it was another to build out products that could immediately scale to fit the needs of established companies like Target. The initial result was a product that had so many interconnected moving parts and features that it was difficult for a user to navigate and actually use. And I'm sure all of you out there know that if a tool is hard to use, most people don't bother with it. Right. You might get it and try and think this is too much hassle, so you would rather go without or find some of their alternative. Well, in two thousand two, Amazon began building out Amazon dot Com Web Service. Now this would not quite be the same thing as Amazon Web Services, despite the similar names. It was much more simple than that. It used a SOAP and XML interface. And by SOAP, I don't mean the stuff you use to get clean. Now, if you're not a developer, those things probably sound a little confusing, so let's clear it up. SOAP is a messaging protocol which originally stood for Simple Object Access Protocol and XML means extensible markup Language. It is a language so weird to say this, so it's like a machine readable and human readable language that's used to create sets of rules for document encoding. So this this is a language we used to define rules as opposed to you know, programming something together. These allowed developers to create processes that can run on pretty much any machine that has HTTP installed on it. UH. That way, you could create a process that can run on Windows device is or Mac os or Lenox, all that kind of stuff without having to program a specific version for each operating system. So Amazon's version of this allowed for a pretty limited amount of development around creating processes that could access the Amazon product catalog. This would allow web developers to create an interface on their own web page that would utilize Amazon's store, with the idea that people could buy a product right there from that web page instead of having to navigate over to Amazon dot com itself, and the developers would earn a small commission on every sale made through that you know, web page based point of sale. It's just a tiny dip of the toe in the cloud based infrastructure. Also, Amazon noticed that developers were I mean, this happens all the time. Developers were taking that tool and making stuff that Amazon had not anticipated or intended, and nothing necessarily bad, but like some were making games where they would show use this this methodology to show a picture of an Amazon product, and it was up to you to guess what that product was. That kind of thing. So they were gamifying certain elements of this and that kind of got wheels turning over at Amazon. This happens all the time. Whenever you create anything and you give it to developers, they immediately figure out ways to misuse it, I mean use it creatively anyway. Around the same time, Amazon executives began to realize that their various development teams were running into the same problems over and over. Namely, each team working on a different internal project would need to go through the same basic steps before they could do any serious work on the project itself, which involved things like establishing systems to handle compute operations, UH, storage solutions to hold all the data, and also database solutions to organize everything, and a clear picture began to emerge. Amazon's teams were having to reinvent the wheel with every new project, and the original projection for seeing a project go from start to finish was supposed to be three months. That was the goal for Amazon, but it turned out that just building out the infrastructure to allow a project team to actually start developing their project would take three months, so everything was running behind schedule. The lesson that the executives took from this is that it would be a worthwhile endeavor to establish an centralized internal system that could support the compute, database, and storage needs of all these different project teams. So it would need to be a system that could compartmentalize and contain each project so that every one of them would have the resources that the teams needed. It meant building out virtual machines and figuring out ways to create redundancy, and it was a matter of necessity for Amazon in order for those internal teams to get out of that you know, three month projection goal. But it also meant Amazon was building up something that could potentially end up being a service that the company could offer to others. It would take a little bit longer for that to come about. Over time, Folks at Amazon again to think of this effort as creating something almost like an operating system, but for the Internet rather than for a computer or a mobile device. These ideas began to first take shape around two thousand three, when Amazon executives were attending a company retreat. It would be another few years before the earliest version of Amazon's web services would launch. All Right, we're gonna take a quick break. When we come back, we'll talk more about Amazon Web Services. Okay, we left off in two thousand three. Let's put this in perspective. If there anything like me, you might say, all right, well that's less than twenty years ago. I get it. But we let's think about other things that were going on. Right, So, two thousand three was a year before Facebook would launch at Harvard, let alone expand beyond it. In fact, it was about three years before Facebook would get out of the phase where it was only available to college students. Two thousand three was two years before You Too blaunched. It was four years before Apple would introduce the iPhone, and it was just two years after we had had the dot com crash that had whited out numerous web based companies. So this was very early on in thinking about cloud computing and operations at this kind of scale. The company began to invest in building out data centers, you know, these huge facilities. The whole thousands of servers and engineers developed and tweaked database management services to coordinate and partition these machines effectively Meanwhile, the product development teams would work on new products to expand what Amazon could do for customers. So in two thousand three, Andy Jesse, who would go on to become the CEO of Amazon as of July five of this year, he became the project lead for Amazon Web Services. He had suggested to Jeff Bezos that Amazon could take the systems the company had been developing for internal use and then open those up as a product for other companies. And he was essentially pitching cloud computing to Jeff Bezos, and he got the go ahead. In two thousand four, Jesse's team had a beta version of this product that was ready for testing, and over the following two years they would refine and tweak that product until in two thousand six, AWS was ready to launch its first initial product. Now this would not be Amazon Web Services as a cohesive whole, but rather a single product called Simple Storage Services or S three, which debuted on March fourteen, two thousand six. Now, Amazon described AS three as a tool that would let developers save and retrieve quote any amount of data at any time from anywhere on the web end quote. So this was a cloud storage product. It is a cloud storage product it still exists. The experience of merchant dot com had, however, taught Amazon de olopers a pretty valuable lesson, which I would summarize as just because he can doesn't mean you should. Now. Granted, I usually use that phrase to criticize vocalists who do irritating vocal runs during their songs um Mariah carry, but in this case, I'm talking about the issue of feature creep. Now. Feature creep is this tendency to throw in extra features and options into a product just because he can. These features don't necessarily contribute to the usefulness of that product. In fact, more often than not, they can cause a product to be jan kie and hard to navigate. The Amazon developers didn't want S three to fall into that trap, and so early on the team decided that the only thing that needed to be done was to make sure the storage service was as good as it could be and just avoid including any extraneous options. Their motto was quote the system should be made as simple as possible, but no simpler end quote. That's also a good point. A bare bones approach is sometimes the best one, but you do still need the bones to be there. The architecture of the product can be described as objects, buckets, and keys. Objects are essentially data, and that data could be just about anything. S three doesn't care what the data is. It could be video files, it could be a game, it could be a database, it could be music, it could be whatever. The objects have metadata that describes what the object is and when it was last modified. Next, you've got your buckets, and this is a kind of classification system. So imagine you've got these objects, that is, files, and you've got a lot of different types of them that belong to a lot of different things. So you might have a bucket for specific kind of file like music files, or more likely, you might organize buckets according to specific projects. So one project might have all its objects sorted into one or more buckets that belong to that project alone. Now, keys are a kind of I D for each object inside bucket, and each object has one key, so you can find any object inside S three if you have two pieces of information, the bucket it is in and the key for the object. So keys are used mainly for retrieval and you know that kind of thing. The Amazon developers created a storage system that was priced at fifteen cents per gigabyte of storage space per month. At least at launch, it is significantly cheaper than that now and this tells you that Amazon has scaled the service dramatically, and considering we're well into the era of big data, that's a good thing for developers. So today, Amazon's S three standard storage has three different tiers of cost, which depends on how much storage you're actually using, Like how much data do you have in the system. So let's say that you have fifty terabytes or less in S three standard, that would mean that you are looking at two point three since per gigabyte per month. Uh, if you've got more than five hundred terabytes stored, that's on the other end of the scale, then you're paying two point one since per gigabyte per month. And yeah, that adds up for companies that need to store a lot of data. Anyway, I bring it up to help illustrate how much things have changed. Fifteen cents per gig per month is way, way, way, way, way more expensive than two point three cents per gig per month. Oh and I should also mention that S three today offers several other storage products that have other features and costs associated with them. But this is not meant to be an ad for S three, so we're just gonna leave that for now. Anyway, S three right now, the gate was successful. In fact, just two months after launch, Amazon saw that demand had exceeded their projections by act of one hundred. Today, there are more than one hundred trillion objects stored in buckets in S three, and the fact that the product could scale up to accommodate that number of objects attests to good design decisions that were made early on. Yeah, the organization system is simple, but that simplicity also meant that S three could grow on demand, which it did. In August two thousand six, Amazon launched a new cloud based service, and this one was called and still is called Amazon Elastic Compute Cloud or e C two, And as the name suggests, this product offers up a different element of computing, the actual compute part. That is, this is a system that would allow customers the chance to tap into on demand computing power. Now, developers who had a great idea but who lacked the money or space or both to build out a computer facility could subscribe to e C two and lean on Amazon's systems to do the work for them. Like S three. This idea had its roots back in two thousand three. A couple of Amazon engineers, Chris Pinkham and Benjamin Black, had authored a memo suggesting a product that could give developers the chance to run software on Amazon computer systems specifically designated for that task. Around this same time, Amazon introduced Simple que Services or Amazon s q S. This is a type of message que and by message I mean the kinds of communications that go from service to service. So let's say you're running an app on your phone and the app might in the background send a request to a remote server to get access to some data, and that would be a message. So s q S is a platform that queues up messages so that the back end of a system can respond appropriately to requests that should give the end user a seamless experience. Now there's a lot more to s q S than that, but I think that's simple X A nation will serve us well enough for this episode. So these products S three, e C two and s q S kind of became the backbone for what would grow into Amazon Web Services as a whole. And there are a lot of other focused products in that suite, but generally speaking, each one is meant to be really good at doing something specific without having that feature creep issue come into play. Amazon got the jump on other big companies like Google and Microsoft when it came to offering up cloud based computing products. This gave Amazon the chance to establish a dominant position in the market. I mean, when you're effectively the only game in town, it's you know, not hard to become dominant. But today these other companies, Microsoft and Google and lots more have their own cloud computing services available. Still, Amazon's head start meant that the company still has a very strong presence. According to Synergy Research Group, Amazon's share of the cloud computing market is thirty two percent, or nearly one third of the entire market. That's more than Microsoft and Google's products combined together, those companies make up about of the market. So about a third of all the cloud computing business that's going on out there is going through Amazon. And like I said, that includes tons of different things from apps on your phone to video games to Walt Disney World's virtual ticketing system. Now, I'm not going to say that as long as a WS is running smoothly, everything should go well, because all the products that are built on top of a w S still need to have a good design. I mean, it's possible to make a really lousy product that's using AWS, and it's not the fault of AWS if that product is lousy. But as we saw last week, when things get harry on AWS, all the products that rely on those services they can be affected. So last week, at approximately ten thirty am on Tuesday, December seven, twenty one, a w S had what we in the tech biz call a whoop see. It was a whoop see that lasted between five to seven hours, depending upon the services you were relying upon. And because a WS has this massive presence in the market, and because so many big companies rely on it in order to make their stuff work, that whoop see had a pretty big footprint. According to Amazon, the issue was that there was a glitch in some crucial networking hardware. And this hardware is in charge of hosting what Amazon called foundational services, including stuff like e C two, but also it handled stuff like Amazon's Domain name service. Now, this service is kind of like the liaison that connects human readable u r L addresses with machine readable addresses, and without it, you can tell your browser to go to that particular website all you like, but it ain't happening because the liaison is on like a five to seven hour offee break, and the machines have no idea what you're on about. Anyway, the AWS internal system became overwhelmed, and that's something that usually doesn't happen. Usually there's this cross network scaling system that kicks in and meets increased demand. But this glitch essentially caused a massive game of telephone within the AWS system, and it overloaded all the circuits. To use a somewhat flimsy analogy, so the glitch triggered what Amazon called quote a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for a communication between these networks end quote. So it's almost like a classic denial of service attack, only Amazon kind of did it to itself. I guess we're being fair, we would say the glitch cost it. Now. A delay in communication normally just means you have an irritating experience like lag, right, and you can manage that you usually, but you know, it just makes whatever you're doing more difficult. Except a lot of systems have time out features, in which if there is a long enough delay between sending a message and getting a response, you reach a failed state, and that happened a lot last Tuesday. What made matters more difficult was that Amazon's own real time monitoring services rely on those internal AWS systems. I mean that's how AWS even got started, right, I mean it was Amazon building out its own infrastructure and then offering up those capabilities to other companies. So that meant the mitigation teams who were working to fix stuff didn't have all their real time monitoring tools available as they were tackling the problem, so that slowed down the recovery quite a bit. Amazon has since apologized to customers for this outage, and the reps now say that the company is working to distribute its service Health Dashboard across multiple regions, so that should something similar happen in the future, the fixed should theoretically happened much more quickly. Uh So, Yeah, this is another way for us to realize that we have put a tremendous amount of trust and dependence upon cloud services. And it's another reminder that sometimes like you could have designed everything yourself as good as it can possibly be. You can have an incredible app, but if the technology that powers that app goes down, it doesn't matter how good your product is, right it, you know, and since you don't control that, since you are dependent upon a cloud uh provider, then if the cloud provider has problems, that's really a big blow to your own business plans. It's one of the reasons why companies really debate on what services they want to put on the cloud versus on premises. Um It's it's a complicated thing too, because scaling is such a tricky issue. Most companies you know that aren't like huge fortune five companies don't have the assets necessary to be able to scare al at least not to the massive scales that we're seeing in the global Internet space. Anyway, I hope you found this episode interesting as we talked about a WS and what happened last week. If you have suggestions for topics I should cover on future episodes of tech Stuff, please reach out to me. The best way to do that is on Twitter. The handle for the show is text Stuff H s W and I'll talk to you again really soon. Text Stuff is an I Heart Radio production. For more podcasts from My Heart radio, visit the I heart Radio app, Apple podcasts, or wherever you listen to your favorite shows.

In 1 playlist(s)

TechStuff
2,470 clip(s)

TechStuff

TechStuff is getting a system update. Everything you love about TechStuff now twice the bandwidth wi…

Social links

Website

Follow podcast

RSS feed

Recent clips

TechSupport: The Two Sides of Biometric Data w/ Adam Clark Estes

Week in Tech: AI’s Problem Solving Problem

The Story: Are the US and China in a Tech War? w/ Jake Sullivan

Browse 2,467 clip(s)