Thursday, March 10, 2011

Amazon EC2 design FAIL - What, Why, and How to fix it.

This is a rant, a screed, a diatribe, a scream in the wilderness hoping to call the adults in charge to notice a major design flaw in Amazon's otherwise excellent Elastic Cloud Computing service, know as EC2 for short.

However, unlike most rants, screeds, etc... I offer a reasonable and easy to implement solution which should work well for all concerned.

What EC2 is:
EC2 lets you create virtual servers based on their hardware and networks. It's fast, reliable, and pretty flexible when it comes to getting far more computing resources in short notice than would even be possible for a small company to arrange, let alone finance, because you can pay by the hour of computing time, and the megabyte of disk storage.

Amazon offers a wide variety of Linux and Microsoft operating systems to run within these virtual servers, and they make it easy to provision new machines, or "instances".

My story:
Yesterday, I was at work, and for whatever reason, I couldn't find the instance of Windows Media Streaming I had last used on Amazon EC2 about 6 months ago that I needed for a demo. With real servers, it's obvious when you have boxes to look at, hopefully all nicely labeled, but since virtual servers don't actually take up physical space in the office, they end up just like any other misplaced computer file.

I then proceeded to create a new one from scratch. The setup wasn't that long, but my work day because a long one while I got everything set. It got worse when I figured out that the Hardware Streaming Box we were going to use wasn't using the same protocol I had previously used. I got all that sorted out about midnight, but then found out something else was amiss.  I thought it could be either the streaming box, or the virtual server that was mis-configured, so I created a virtual server in our own local network (using VMware) to divide the problem and more accurately place blame. At about 6 AM I had proof that it was the streaming box, and it had a virus. It needed to be reset to factory standards... I waited for our supplier to call back to get the proper procedure for doing so, and got everything working by 10 AM today. (Now a 26 hour work day).

I then proceeded to help everyone else test out their parts of the demo, and showed them how everything worked with the box, Amazon, Windows, etc... I was done after lunch at about 1:30PM.  I was taking care of putting the hardware away, cleaning up my office, etc... when I shut down the Virtual Server. I was looking at the configuration of it, and it seemed to be stuck in the process of shutting down (terminating) far longer than expected.

Then I couldn't find it!  (Deja vue)

It was about 15 minutes later that I found out what had happened.... Amazon threw my newly configured virtual machine away, assuming I no longer wanted it, merely because I turned it off (using the Windows Shutdown command) to save the compute costs while I wasn't using it. My reaction was one of surprise and sadness, and resignation to an even longer work shift  that was now like to stretch from 8AM to 5 PM the next day.

I'm upset about this, I understand how someone on the product team might have justified using the word Terminate to signify deleting a server, and someone else defended the decision to delete them by default, but it's not the way people use computers.

How you can relate:
Imagine if the mere act of turning off your desktop machine resulted in its disappearance and the need to set up a new one, no matter how inexpensive. This is the problem I faced.  I invested hours of time getting everything working just right, and testing it.... I had to spend another 3 hours to do it all over again.


How to fix it:
Now... here's my message to the folks who control the design of this system...

You have added a "termination prevention" system, which helps to alleveiate the problem, if the user has a clear understanding of the NON-STANDARD use of the word termination in this context. This kludge of a fix tells me that the product managers don't quite have a good enough grasp of how things work.

A far better fix, one that fits with far less ambiguity, and far less pain for all involved, is to use the standard word DELETE when describing the act of removing a virtual machine from existence.

Deletion of a virtual machine, or set of files should NEVER happen merely because a virtual machine powered itself down. It should ALWAYS and ONLY be the result of a positive direct action at the request of a user, who then gets a message warning them of the full implications of their actions before giving their final confirmation of the action.

Summary:
Please take this in the spirit with which it is intended, as CONSTRUCTIVE criticism, and a possible fix.

You'll save all of your new users having to go through this painful experience, and have a better product to boot.

Update: As you can see in the comments, this design fail is making a hole for others to fill.

2 comments:

  1. Mike - I'm sorry to hear of your cloud experience. Sadly, many people get road rash when first picking up the new/mislabeled terms. We've seen it repeatedly at RightScale, and that's why we created ServerTemplates.

    May I suggest "stopping" the instance? Your root drive will be saved (but you'll have to save any attached drives manually).

    ReplyDelete
  2. Stopping an instance does not delete attached drives. I think it used to, but it doesn't anymore.

    ReplyDelete