When Bad Things Happen To Good Data, Part 2

Jon Toigo


In part one of this series, we noted that many vendors were promoting data protection strategies that work best in response to only a subset of the causes of downtime, albeit the most common causes of downtime. These are hardware faults, software glitches, human errors and malware, which together with planned maintenance account for the lion’s share of IT downtime annually.

The Risks of Downtime

Of course that doesn’t mean that other risks – those with a broader geography such as weather disasters, geological disasters, or man-made disasters such as fires, nuclear and chemical releases or terrorist events – never materialize. Vendors of everything from high availability (HA) systems to hypervisor software simply go with the statistical probabilities. With HA clustering, you are protected against the most common disasters, including the 7% annual failure of hard disk drives, or the somewhat greater failure rate in power supplies, etc. Storms like Hurricane Sandy happen only once every 200 years, so protecting against those sorts of events may not have the same statistical urgency. To synopsize the argument made by the vendors, “HA trumps DR.”

Only Hurricane Sandy’s once-in-200-year storm was followed by another once-in-200-year weather event nine months later. And, to be clear, adding the amount of annual downtime attributed to maintenance activity distorts the relative proportions of downtime attributable to unplanned interruption events that are, technically speaking, disasters. If maintenance downtime is removed from the probability models, interruption events like hardware failures, viruses and malware, carbon robot failures (human mistakes), and software glitches would still probably account for the majority of downtime causes, but the proportional difference between downtime from those causes and from disasters with a larger geographical footprint would likely be much less.

Plus, when you have a “Big D” disaster, it is a disaster, and not an easily surmountable glitch. Rather than needing to pause an application to replace a bit of failed hardware or to remediate a bad line of code, you need to do the full job of disaster recovery: fully restore your critical data, re-host your applications in a different location and on new gear, and reconnect all networks – to users, customers, investors and business operators – all at light speed.

High Availability Architecture

I have nothing against high availability architecture. HA is and has always been a subset of the methodology of disaster recovery planning, one of many techniques that could be selectively applied to deliver a certain measure of recoverability. But HA is not a replacement for DR.

High Availability architecture proceeds from the assumption that you ensure the operational continuity of the data processing platform by building an identical copy of that platform and by replicating data from platform one to platform two so that you can “fail over” from 1 to 2 if 1 has a problem. Easy peasy.

There are a lot of nuances to this strategy, of course. For one, you can either do active-passive or active-active clustering, with the latter involving the sharing of a workload on an on-going basis and “simply” shifting all workload over to the extant node if the other one fails. Active-passive, which is the more commonly deployed cluster architecture, involves workload executing on node 1 and only data replication occurring between nodes 1 and 2. If node 1 fails, this failure must be detected (via a monitored “heartbeat” signal) thereby triggering the startup of the passive node and the loading of operating system or hypervisor and application workload or virtual machine.

Clustering and Full Redundancy

Typically, clustering requires identical equipment on each node. It also typically requires a flawless data mirroring process, which often carries with it requirements for “identicality” in the storage infrastructure (same storage array or same software-defined storage product, etc.) The need for an identical kit has a tendency to drive up the costs of this solution, both in terms of kit acquisition and solution maintenance over time.

In fact, the idea of full redundancy (what we used to call HA) has been part of DR since the beginning of the practice in the 1970s and 1980s. Companies with deep pockets built wholly redundant data centers and purchased expensive wide area network (WAN) interconnects to replicate data over distance between the two facilities. Workload was either shared between facilities (active-active) or one was “lights out” and manned only by a skeleton crew until it was needed for a recovery and went “lights on.”

Full redundancy/HA was an extremely costly approach to ensuring continuous availability and, at a practical level, it was a strategy fraught with challenges that have not gone away. For one thing, whenever a change was made to the infrastructure or software complement in the primary facility, operators needed to be dispatched to the remote facility and make the same changes to infrastructure and software there. Moreover, for the strategy to offer meaningful protection against large footprint disasters, the redundant facility needed to be separated by adequate distance from the primary to avoid being consumed by the same disaster.

Safe Distances Between Facilities

Another issue: what is a minimum safe distance between the facilities? For a time, 80 kilometers (about 50 miles) was touted as the minimum safe distance. Coincidentally, this was also roughly the distance that data could travel over a private WAN link without accruing significant distance-induced latency. Moving data over that distance on a shared WAN service exposed the transfer not only to latency but also to jitter that could combine to create data differences or “deltas” that would make the data at the redundant facility temporally “behind” the data at the primary facility. When systems were started (assuming they could be) at the redundant facility, databases and applications would be using older data, missing the latest transactions or files.

The problem is, of course, that contemporary hurricanes and other natural and man-made disasters have been demonstrating damage zones with diameters greater than 80 kilometers — averaging closer to 180 to 200 kilometer diameters. For example, companies that relocated their production operations to Philadelphia from NYC in response to Hurricane Sandy found that the same weather that inundated their primary data centers was coming dangerously close to encroaching on their redundant facilities, as well.

Between their high maintenance cost and perceived weaknesses and deltas, HA/full redundancy strategies have always represented a very small subset of total DR programs used by companies worldwide. Shared commercial recovery facilities (so-called “hot sites” or some contemporary DRaaS cloud services) have been chosen by some companies to provide redundancy at a distance at a lower cost. However, too many firms have been lulled into complacency with respect to disaster recovery planning overall. They believe that on-premise server HA with mirroring delivers adequate protection against “the 90%” of disasters that do not involve facility-wide or milieu type disaster events.

HA or DR?

So, “HA trumps DR” seems to be a rather foolish premise. HA is part of DR, not an alternative. It is neither a poor DR technique, nor is it the universal solution for surviving all interruption threats. It is a technique with some cache in the case of certain workloads under certain conditions and will likely be part of many disaster recovery plans going forward, but, as part of a hybrid strategy that incorporates other data protection methods, including backup to tape.

That’s right, tape backup. Only, it doesn’t need to be an image created by backup software. Tape has undergone significant changes over just the past few years. First, its capacity per cartridge has grown tremendously due to Barium Ferrite coatings on tape media. LTO-7 tape, for example, can store up to 16TB of compressed data to a single cartridge, growing through generation 10 to upwards of 125TB per cartridge compressed. And “enterprise” tape media from IBM and Oracle, also using Barium Ferrite, are already delivering 10TB and 8.5TB per cartridge uncompressed (more than double those capacities, if 2.5:1 compression is used).

A second noteworthy improvement in tape technology involves the application of a formatting technology called Linear Tape File System, which enables a much simpler integration of tape media with other disk/solid state file systems. That means you can build an archive using tape much more readily and create data copies by writing directly to tape in the same manner as you would write data to any other storage target.

The Evolution of Tape as a Backup

But why tape? First, it enables you to create a data copy with tremendous portability. You can send your backup data on tape to a recovery data center or to a cloud service provider (called cloud seeding) with none of the delays that accrue to WAN link-based data transfers.

Second, tape provides an air gap. With mirroring, bad data is replicated at a best possible speed between the mirrored disk targets. If a virus corrupts primary data, tape-based data provides the air gap, preventing immediate replication of the corrupted data.

Thirdly, tape is “restore target agnostic.” Tape backups don’t require identical target hardware to be used at the recovery site. Backups made from one storage array from one vendor can be readily restored to any other vendor’s disk hardware. That is the kind of flexibility that you need to recover fast in a bad situation.

For this World Backup Day, consider going back to basics and renewing your tape-based data protection strategy. If tape is viewed as old school, call it an archive strategy, then make copies of your data after copying them to an archive for off-site storage. When the worst happens, at least you will have a shot at recovery.


More in Storage & Destruction