Today, I participated in a local AWS Summit’s free GameDay event here in Chicago. Going into this, I was not sure what I was getting myself into. The details of the event sounded intriguing; split up and compete to see who can create the best scalable solution.
The event starts by splitting you up into groups of 3-4 individuals. It pairs beginners with advanced AWS users to not only help balance out the room but also teach at the same time. I thought that was a really cool and neat concept; learn from one another. Walking into the room, I voiced a confident “advanced” but leaving it left me hungry for more and full of questions. It was well worth my time and it was loads of fun.
First, the CEO of Unicorn Rentals walks in and gives the most “inspiring” talk that basically can be summed up to: “I told folks on Good Morning America that we are live, good luck, don’t mess up, and make me a lot of money.” Fortunately for us, we were given a comical half-assed “Runbook” of the architecture and how it works; sadly this is probably more documentation than some projects I have worked on in my professional career! From coffee stains to passwords scribbled all over the place, I knew from the get go that this was going to be a fun little ride.
First thing we did from logging in was delete all those scratched out accounts, and change the password on the root account! Once we secured the account, we needed to register our team on the scoreboard. This was done by creating a root TXT record in the hosted zone with our team name. It appears the organizers listened for the TXT records and registered teams off that; kinda cool! From here, the fun begins and our application begins taking load.
When we were handed the account, we were told that it is working today, but we need to make it better to handle increased load in ~30 minute increments. First thing we checked was to see how our application was configured. We noticed that the root A record was pointing to an individual EC2 instance with a static IP and not to a ELB/ASG. We corrected this issue by creating an ALIAS record to the ELB for the root record and associated the ELB with the ASG. Next, we were told to create a simple deployment architecture using only User Data and Auto Scaling Groups. We started by copying the existing broken launch configuration, fixed it so it got the new code base and also didn’t have a shutdown command at the end (WHY!!!), and then we proceeded to launch with a scale of 2 instances (hard-coded for now). Keep in mind this was only 30 minutes into the competition.
From here, we faced various issues such as:
- The “Network Engineer” messed up our ACLs with a bad script
- The same “Network Engineer” killed our main route table with no default Internet Gateway.
- You were met with other nefarious and “accidental” issues along the way and it was great to simulate the Oops moments.
Aside from these moments, we began to notice that our subnet for launching instances was limited to only a /28 CIDR block! Load by this point was starting to jump to almost double digits; This just wouldn’t do! We fixed this by creating three subnets (one in each AZ) and associated them with the ELB, and autoscaling groups. With ample space, we can now focus on improving our scaling policy. At first we scaled based on network requests, but later on we determined we should have used the ELB Alarms on latency and scaled off of this. With a solid policy, enough servers to handle traffic, we figured we are good to go! Wrong, so wrong, I can’t believe how wrong we were.
Despite having everything in order, our servers still couldn’t keep up with load! What was going on? There was enough servers, they weren’t crashing, no high load. What could be the problem? We were stuck on this for sometime until we logged into the instance and noticed that the application handled one request with an average latency of about 4 seconds. Playing with the binary, we noticed there was an Elasticache (memcached) option we could leverage. We sprang to the occasion and built up a small distributed memcache cluster and configured our app to use it. Now our average request time was ~.2 seconds! Sweet!
However, despite having fast response times, we were still getting failures. I didn’t make any sense, until we ran the binary and watched it. It appeared that it would only handle one connection at a time, and then reject any other connections as it was handling that one request. This ended up taking the most time, but we couldn’t implement a solution in time. Speaking with some of the Solutions Architects, they mentioned some teams used Docker to run multiple on the same host, and some folks found out that you can run about ~4 binaries concurrently on different ports and handle the load. While the competition winded down, I was implementing a solution that involved running 4 instances of the binary on the same host leveraging IpTables to do round robin; unfortunately, I didn’t get a chance to see it in practice but I was confident this solution would have worked.
All in all, it was a great experience. No my team did not win, but I got to explain and teach others about AWS and some best practices. Towards the end it was really cool to see a team effort as we started tackling harder and harder challenges, and overall I think we all walked away with a little bit more knowledge than we came in. At one point, I had 10 people sitting around me listening to me explain how we do deployments at gogo, our challenges and pitfalls with AWS, and other random topics. Even some of the Solutions Architects came by to sit and discuss, and joked “I needed stadium seating” with the amount of people around me! At that moment it made me realize we do some pretty wicked stuff at work. If anything, this re-energized me to get back and implement some of the things I learned at work!
With all this being said, I really got to thank the AWS Summit GameDay organizers! This is great, and loads of fun. Please keep doing these kinds of events; especially in other cities!