When Bad Things Happen to Good Data , Part 1

Jon Toigo


Truth be told, not all data is the same. At a glance, it might seem to be – each anonymous dataset, file or object is just a collection of 1s and 0s. But the reality of the situation is quite different.

Some 1s and 0s may support a mission-critical order processing database that must never be interrupted; others may be historical files that are never used in day-to-day operations. Not all applications are the same in terms of their workload characteristics or their criticality to the business. Their data “inherits” this criticality like DNA from the applications they serve.

If we agree that data criticality differs, then we also need to accept that the methods we use to protect and restore data to a usable state might be selected based on this criticality and priority of restoral. Expensive, fast restore techniques might be appropriate for critical data, but less appropriate for historical, or rarely-accessed bits. Similarly, data that is critical might be served by a multi-layer protection approach involving multiple services such as continuous data protection, mirroring and periodic backup. But such expensive and resource-intensive approaches may not be necessary for historical data.

Backing up historical data

These differences may not strike many users as terribly important when they are talking about their desktop system. All data on “Drive C” (or changed data in more sophisticated methods) simply gets backed up to a disk or tape container at a designated time of the night or week. However, this approach can differ in a data center where many different servers may be operating many different virtualized and non-virtualized applications using different, and sometimes incompatible, storage configurations. Finding time in the schedule to undertake all of the data protection services can often be a major challenge.

This challenge becomes more profound as data growth predictions approach 10s of zettabytes. Let’s put this in perspective. A zettabyte is one thousand exabytes, which was the last mind-bending multiplier of data quantity that the analysts threw around a few years ago. An exabyte is one billion gigabytes, or one million terabytes – one million of those 1TB hard disks you have been buying for the past few years. A zettabyte is one billion of those 1TB drives.

According to leading analysts, we are looking at the onset of something I have been calling the “zettabyte apocalypse.” Simply put, the rate at which data is being produced today will reach between 10 and 60 zettabytes in total by 2020, depending on the analyst you read. That projection is causing leading cloud vendors to scramble and find space to store all of the bits. One cloud architect recently determined that the collective annual manufacturing output of the disk makers, the flash storage producers, and the optical disc industries would not be sufficient to deliver adequate storage space to the market to handle even a fraction of the coming data deluge. Only tape, which is plotting reasonable capacity growth trends using Barium Ferrite coatings, improved tracking technology and finer signal-to-noise processing, could possibly provide the elbow room sufficient for zettabytes of data.

But tape is generally thought to be “old technology” – not part of the new “all silicon” data center model. And, of course, since no work is being done to separate the critical data from the historical data (which might be an excellent candidate for tape), much of the data protection industry is focused on selling as much flash-to-flash or disk-to-disk data replication as possible, which only accelerates everyone toward the zettabyte apocalypse.

Shelter in place

Even some of the latest builders of big data systems have adopted a “shelter in place” mindset to cover data protection requirements. Their topology involves the build-out of hundreds or thousands of servers, each with its own direct-attached or internal storage, all connected in a hyper-cluster. Magical software divides workload processing across tens or hundreds or more server nodes so the data on all of nodes can be considered by software algorithms intended to provide rapid insights from fast-changing transactional data. Some object storage vendors apply much of the same idea to huge quantities of files: shelter in place and use either mirroring or erasure coding to protect the bits so they can be used in analytical or historical research.

However, there are flaws in these strategies, including the propensity of big data or “Big D” disasters that could wipe out original and mirrored copies of data. Since all of the data sits on the same raised floor and often in the same server rack, it is all subject to the same flood, fire, electrical failure, network failure or any number of other outage events with a geographical footprint wider than a simple hard drive or power supply failure.

The conceptualization of data

This is part of a broader problem in the conceptualization of data by many organizations. In big data analytics, the value of data is often described as accruing in the seconds or minutes following its creation. After a few minutes, the insight generated by the data is old news and new data has begun to shape a different insight. For instance, when a customer’s credit card has been validated and the purchase has been approved to proceed, it makes that transaction old news.

If data loses its value so rapidly, perhaps we need to rethink its retention and storage altogether. But, between the needs for regulatory compliance and historical record-keeping, the thought of throwing anything away is simply untenable.

Data classification

Truth be told, to deal with the zettabyte apocalypse, or to become more economical in our data protection strategy, what we really need to do is classify our data and associate the right protection strategy to the right data for the best possible fit. We need to determine how to move data across a storage infrastructure so that its operational utility can be fully gleaned and its historical preservation requirements can be met. Creating such workflows may involve different data protection services applied in different combinations to different data based on what that the application or business process dictates.

This brings us back to the premise of this blog. In too many cases, we aren’t doing the heavy lifting of data classification that must precede the selection of data protection methods or the purchase of data protection software and technology. Absent of having done the necessary task of associating data with application and application with business process, we can’t do an effective job of protecting our data assets – or at least not a cost-effective job.

The second truth is that, if you do a fairly good job of classifying assets, the effort will pay off in many ways. In addition to paving the way for better data protection from disasters, knowing your data will also allow you to platform the data more intelligently for operational needs, to design a more efficient governance and compliance strategy, to apply security protection judiciously, and to shave cost off of your storage infrastructure investments by leveraging time honored techniques for hierarchical storage management and archiving.

In fact, the money saved by using tape to archive the up to 70 percent of data that isn’t re-referenced after 30 days could fund your entire data protection strategy. That’s something to think about on World Backup Day.


More in Storage & Destruction