Thursday, October 16, 2008

Ebay Does It, Why Can't We?

It occurs to me that as web applications have taken over the world over the past decade, what’s become clear is that we have a heck of a lot to learn about managing them.

Complexity in the software industry has sky-rocketed, and applications that once were versioned and released every six months are now being upgraded with new production code every couple of weeks. EBay is a prime example of this model. It is widely known that EBay has become so adept and addressing issues and deploying changes to their servers, that every two weeks a whole new version of Ebay is up and running in their multitude of data centers around the world.

Now, Ebay is not a simple application, and downtime at Ebay can cost millions of dollars, not to mention generate an angry mob of users who are not only desperately trying to buy the latest iPhone 3G, but some of whom are also trying to make their living. This is serious stuff. Downtime at Ebay is front page news.

It’s redundant to point out dangers of our new Software-as-a-Service software paradigm. However, the one-to-many relationship between application server and end-users does present a great deal of pitfalls that did not exist in Bill Gates’ world. When Microsoft Outlook crashes, you may be upset, say something not very nice, then restart it. When your application servers crash, money starts flowing straight out the back door as your customers all collectively say something not very nice about you.

So why aren’t more companies able to follow the Ebay model? What does Ebay know that they don’t?

The answer may lie not just in how they deploy new changes, but how they handle and resolve issues quickly when they do occur. Is this truly one the great remaining challenges in the realm of software? Some would say so, and they have good reasons to back them up.

Multi-tier applications represent some of the greatest levels of complexity ever seen in the software industry. With pieces of your application running on many heterogeneous, physically dispersed servers and environments, understanding what went wrong in these environments can be next to impossible. When issues occur, most often the only hope a team has is to attempt to reproduce the same conditions that caused the error, and hope it happens again. This means that to understand the root-cause of issues, recreating the environment, re-populating the database, and generating the required load on the servers is the only solution. Frequently, the pain of going through this effort is too great, and the issues lie dormant… until the next time something bad happens!

What the software industry is screaming out for is the ability to quickly capture, reproduce, and isolate issues as they occur. What we need is something like ‘TiVo™ for Software’.

One solution that has finally emerged from the chaos introduces the concept of recording and replaying software execution. This technology revolves around the core ability to not only record an application’s execution, but just as importantly, the complex environment in which the application ran.

With this new ability, teams can dispense with massive amounts of inefficient workflows that have traditionally been manual, iterative and error prone.

Imagine this common scenario: Your newly out-sourced team in India is handling QA for your complex, multi-tier application. They’re doing a great job and have found over 100 issues with your application. You’ve got your problem reports, log files, and the very large database datasets that your application was using when the bad things happened.

Next comes the fun part.

Now it’s your turn to bring up the same environment that your Indian team was running. I hope you’re using virtual servers! Finally, let’s take a shot at generating the same load on our application that existed when the problem occurred. Hopefully, the moons have aligned, and your fingers are crossed…

Now let’s fast-forward to 2008. Your Indian team is using your recording system. You arrive in the morning, logon to your defect tracking system, load the recording of an issue they found, and press ‘play’.

This time, every event that affected your application in that complex environment, including output from your authentication, LDAP, caching and e-commerce servers, has all been recorded and stored. Even the database and its dataset are no longer required. Most importantly, the end-user traffic that ultimately triggered the problem to occur has been recorded as well. All of these elements are perfectly reproduced allowing you to focus on the most important thing: What went wrong.

Anyone who has been involved in software development can relate to the age-old conundrum of trying to reproduce an issue that simply doesn’t appear to exist. At least not on your machine. Too many sleepless nights have been wasted chasing down phantom bugs. It’s time for the madness to stop.

The problems we’re facing are only getting more and more complex as new technologies are brought to market. This new software paradigm is here to stay. Luckily, I believe new technologies such as record and replay will help control the chaos.

Thursday, March 27, 2008

Where are the clouds?

Why isn't everything in the cloud these days? Where is the promised land of SaaS?

It feels like SaaS has been a story waiting to happen since around 1999. The network is the computer, thin clients with fat pipes, scaling servers, clustering, virtualization... Seems like all the pieces are in place, the cloud is up there. But still there are only a handful of winners that have really figured out the SaaS model, and a sea of also rans that got run over along the way.


Well, some of them were just plain bad ideas. Irrational exuberance and all that. But I think there is another big factor at play here. When you have these giant, sea change moments in the way that software is designed, built, shipped and supported, not to mention sold, you'd better have the tools and technology to support you along the way, otherwise it's not going to be easy! In fact, it's going to be hard. Really hard.

I could point you to a room full of ex-CEO's who will attest to this fact. Whipping up your latest Web 2.0 mashup and putting it online is usually about 3% of the challenge. What happens when people actually start using it? Here's where the rubber meets the road.

One of the essential elements of success is getting a solid, scalable application online and running smoothly and securely. But there just hasn't been a lot of innovation here.

Being able to quickly identify, respond to and resolve issues in a SaaS application is critical, because if one server has a bad day, it's not one customer that feels pain, it's hundreds or thousands. And that's bad. SaaS acts like a big hairy amplifier on any defect or scalability issue that might be lurking in your app.

Technologies like Introscope, Patrol, Vantage, Snort and my company Replay are starting to address the needs, but our customers are still pioneering and forging the landscape as they increasingly feel the pains of this new software paradigm we find ourselves in.

So great job, VMware, Amazon and Linus for getting us to this point where we finally can explore the dream of The Cloud. Now it's up to companies like us to make SaaS applications manageable, cost effective, and safe as we keep things running up there 24/7 with less 'unscheduled maintenance'!