This article originally appeared in the April 2012 issue of Cloud Computing Magazine
No matter how you play it, Disaster Recovery (DR) is an expensive hedge. Maintaining a live replica of your production environment in a remote location cannot be reduced to a negligible expense. Cloud-derived efficiencies can, however, make a functional DR solution far more accessible and less expensive. Rather than expounding upon the obvious benefits of cloud DR, this article focuses on the inherent obstacles using a borrowed theme that symbolically parallels the road from disaster to recovery.
Employing cloud as a core component of a DR plan is, by no means, a new idea. However, cloud DR has historically offered little more than glorified cloud-based backup. Aside from the prohibitive physics of hurriedly shuttling terabytes of data to perform emergency rebuilds of critical systems, storage-centric offerings fundamentally fail to address business continuity.
Should a tornado whisk your production datacenter off to the Land of Oz, a remote non-functional repository of data, no matter how well-groomed and current, serves zero practical function for getting back to work. Had the munchkins kindly pointed Dorothy to pallets of nicely organized yellow bricks and suggested that she pave her way to the Emerald City, it’s fair to assume that her ‘road to recovery’ would have taken a lot longer. Simply put, a sound DR plan ensures some reasonable path to operational continuity.
Since reduced operational capacity is an accepted risk in the context of DR, it may be perfectly reasonable to virtualize systems at the DR site which otherwise demand dedicated hardware for day-to-day production loads. Physical-to-virtual (P2V) conversion is virtualization’s “quest for Oz”: deceptively simple in concept, fraught with peril in practice.
Whoever coined the phrase “gotcha” must have been peering through their crystal ball and referencing P2V conversion. Not to suggest that it can’t or shouldn’t be done but the challenges, risks and expected rewards must be quantified up front and reviewed with the nearest wise wizard (read: someone who’s done it before) before committing. Sometimes, it makes perfect sense and is worth the effort; in fact, sometimes it’s totally painless. Other times it’s easier to invest in application-level HA tools for an architecturally complex design that’s more predictable.
Virtual-to-virtual (V2V) conversions can be similarly vexing. Moving VMs between environments that are not strictly homogeneous, even those with the same hypervisor and disk image format, can be problematic for a number of reasons. The upshot to all of this is that once initial conversion issues are resolved, the cloud DR operation becomes routine and reliable even during dynamic conversions like ongoing physical-production to virtual-DR replication.
Divination of required bandwidth for the DR site can be a significant project for which the popular and convenient click-your-heels-three-times level of effort will assuredly be insufficient. Most organizations have a solid handle on Internet and remote-site WAN needs. However, many have not needed to collect the type of near-real-time inter-system throughput metrics required for engineering a DR solution. These are two feasible “misqueues of the masses” related to DR network throughput:
Emerald City Municipal IT Department leases a burstable private-line circuit to handle DR traffic (their biggest operational nemesis being the wicked ditch-digger of the West that lands with surprising accuracy and frequency upon unsuspecting fiber-optic cables). The original thought during DR design was that a burstable circuit would hedge against underestimated bandwidth requirements. Historic network and system throughput statistics were non-existent and best guesses were deemed sufficient. In all of Emerald IT’s merriment upon taking the DR site live, they neglected the imposition of outbound rate-limiting from their production site. Middle-of-the-night data synchronizations began consuming every last drop of available bandwidth (including premium-rate burstable bandwidth) for four to five hours per day on the otherwise dormant circuit. Since Emerald City operates with stereotypical government efficiency, eight months passed before Emerald’s Comptroller mentioned the egregious DR cost-overruns. After much undo expense, the configuration fix was implemented in under an hour. Ruby Shoe Company was an early adopter of cloud technology and adroitly identified mission-critical systems for DR deployment.To determine bandwidth requirements for the desired 24-hour replication frequency, a Ruby Shoe Network Engineer added up the collective volume of disk space assigned to the aforementioned critical VMs. Knowing how much data must pass within a given time frame allowed for a simple calculation to determine the necessary bandwidth.The Ruby Shoe Engineer surmised that 16 terabytes of data needed to be transferred to the DR site daily; the fundamental oversight being that the VMs were thin-provisioned and the total actual data utilization was closer to 2 terabytes. This sparkling realization did not occur until after Ruby Shoe committed to a 2-year contract for the type of wickedly expensive private-line circuit required to move 16 terabytes per day.
Using cloud for DR purposes makes for a highly compelling story. However, it requires brains, courage and some benevolent guidance to ensure a happy ending.
Edited by Stefania Viscusi