This article originally appeared in the April 2012 issue of Cloud Computing Magazine.
All cloud systems are inherently complex, and complexity is inherently evil. You can’t avoid complexity, since the size and scale that drives efficiency also adds complexity. However, you can choose how complex to make your basic system. A winning strategy for any team of cloud builders is to start simple and then get more complex organically over time. Starting with a complex system means multiplying that complexity as you scale, multiplying the risk of a major failure.
Systems & Complexity
First, let’s look at two important lessons on complexity from systems theory:
1. Complex systems fail.
2. People love to build complex systems.
Many engineers see understanding and developing complex systems as a rite of passage. In reality, the true test of a great engineer is their ability to make things simpler, not more complex. In software development, this is talked about as “elegance” or “code elegance.”
Complexity is the opposite of elegance. Complexity breeds failures. Systems that are not designed for failure, which are complex or sprawling, will fail catastrophically. Frequently catastrophic failure will turn into a cascading failure. “High availability” (HA) as typically implemented in an enterprise datacenter will not protect a system from cascading or catastrophic failures. Traditional HA stems from the idea of risk mitigation, but it simply is not possible to ensure robustness by predicting what could go wrong and adding complexity to handle a predicted range of failures. Cloud systems must embrace risk acceptance and planning, the new emerging approach for building reliably unreliable cloud systems.
Failures in Systems
There are a number of works on systems and failures. They are best summarized by “Gall’s Law”:
“A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.”
Many of the current approaches to building Infrastructure-as-a-Service (IaaS) clouds are deeply rooted in complexity. In this regard, they look similar to how enterprise datacenters and applications are constructed today: heterogeneous, sprawling, multiplicity of silos with no prevailing design patterns or reusability. These kinds of approaches are difficult to scale, to secure, or to maintain with high levels of uptime.
So, simplicity beats complexity. To see it in action, let’s look at AWS.
Examples of AWS EC2’s Design Elegance
When AWS EC2 launched in August of 2006, you could get an m1.small for $0.10/hr. in one region using an API. That was it. That was all of it. You couldn’t even get m1.large or m1.xlarge instances until over a year later.
Even as AWS grows in scope, Amazon maintains simplicity in individual services, such as EC2. Some highlights:
• Only one hypervisor is supported
• The default networking model is a simple flat layer-3 routed networking model
• Instance disk storage is ephemeral, meaning there are no SANs or NAS, just regular old DAS
• Elastic Load Balancing (and similar services) are “lowest common denominator” capabilities: you get just simple L4 load balancing, not a complex L7 load balancer
• EC2 evenly subdivides physical hosts and “bin packs” VM instances onto the same kind of physical hardware designed for that workload/VM-type
• Every VM instance has one network interface (NIC (News - Alert))
Build It Right
The challenge in building robust and scalable IaaS systems isn’t “will there be enough features?” The challenge is, “will we simplify and grow organically what we know works?” Right now I see more of the former and not nearly enough of the latter. In looking at your own systems, I would keep asking questions about the main inputs of complexity at the outset:
1) Features: What is the minimal set of services that you need to provide to your users to be a viable solution?
2) Options: Do your users really need more than one way to do things, to start? Flexible systems are rarely simple, ask Larry Wall.
3) Best practices: Tried and true IT practices for small systems do not automatically improve a production cloud system. They can, in fact, weaken it. Context is key.
Seek simplicity in your cloud implementations, or be prepared for the unavoidable dangers of complexity.
Randy Bias is co-founder andchief technology officer of CloudScaling.
Edited by Stefania Viscusi