Recently, a major public cloud provider experienced a large outage that took many Internet properties offline for several hours. There are two key points technology professionals can take away from this event to learn from and prevent outages from impacting their business in the future.
First, just because your business puts application and data assets in the cloud, it does not mean that they gain foolproof reliability and availability. Second, anything can fail when human processes are involved. It is important for businesses to understand these factors and mitigate them when designing and deploying their own IT infrastructure to support business-critical applications and data.
Clouds Can Fail
This is not the first service outage for a cloud provider. Other cloud providers have had significant outages, impacting application availability as well. Clouds, whether public or private, can’t exist without management and support. While we tend to think that clouds are virtual and nebulous architectures, in reality they use traditional hardware and software to create a specific network architecture to host application services.
As these outages show, failures within the cloud infrastructure can cause catastrophic disruptions to application availability. Businesses need to assess the criticality of the applications and services that they support. When availability and reliability of access to the application and data becomes mission critical to the business, it is imperative that the IT architecture is designed to be fault tolerant so that there is no dependence on a single provider or technology.
Within traditional private datacenter architectures, businesses utilize redundant facilities in geographically diverse locations. If a failure occurs within a data center or to the connectivity to that data center, the business is protected since the application is still available at the other location. Technologies such as global server load balancing enable the automatic and seamless failover to the backup facilities. In a similar manner, IT organizations need to host the application across multiple cloud environments, public and private, to mitigate the risk of any single event disrupting a critical application.
Mistakes are Part of Human Nature
Many of the disruptions and outages of internet services that have been reported (and not) were caused by people making mistakes, not hardware and software failures. Whether it was someone mistyping a command, entering the wrong value, or even tripping over a power cable, human error is one of the primary concerns that must be addressed when developing operational models to support these increasingly complex network architectures.
IT organizations create runbooks, red books, and other operational manuals to detail specific operational tasks that need to be carried out for certain situations. The purpose of these documents is to ensure that people carry out specific tasks in a specific manner to eliminate the potential for mistakes. Unfortunately, any time a human process is introduced into the operational process, there is the potential for error.
In many outages, accidental mistakes made by people cascade to create significant consequences that impact major portions of the cloud environment. The operational management and maintenance of IT architectures is better suited for automated management and orchestration systems and tools that can take the human error variable out of the equation.
Current IT operations teams perform recurring and repetitive tasks on a daily basis. It makes sense that these functions become automated like assembly line manufacturing. A lot of work is being done to develop models for the automation and orchestration of IT infrastructures. Open source projects like OpenStack are working to connect different technologies and vendors into a unified management infrastructure.
The Future is Artificial Intelligence
The reduction of human error through automation, combined with the diversity of multiple environments, can significantly reduce the potential impact of events and outages to a business’ application delivery needs. But this design also creates a more diverse environment that requires an advanced understanding of how the different pieces interact and impact each other.
The idealistic goal is to create a system that can collect the information from different elements within the IT ecosystem, analyze the data in an intelligent manner, and then enact changes to the environment based on the analytics and heuristics. In this vision, the management and orchestration system becomes the central intelligence and nervous system for the network infrastructure. The IT architecture evolves from being a manually built and supported environment to a self-healing and self-evolving ecosystem based on the business needs and policies.
Frank Yue (News - Alert) is the Director Application Delivery Solutions for Radware. In this role, Yue is responsible for evangelizing technologies and trends around Radware�s ADC (News - Alert) solutions and products. He writes blogs, produces solution architectures, and speaks at conferences and events around the world about application networking technologies. Prior to joining Radware (News - Alert), Yue was at F5 Networks, delivering their global messaging for service providers. Yue has also covered deep packet inspection, high performance networking, and security technologies. Yue is a scuba diving instructor and background actor when he is not discussing technology.
Edited by Alicia Young