Data de-duplication has been in the lexicon of storage administrators for almost a decade, and various implementations of the technology have come and gone. Software implementations have proven somewhat more durable than hardware appliances, primarily because the latter proved to have all of the disadvantages of monolithic “islands of storage” arrays with none of the benefits.
Clearly, de-duplication, if it is to be truly useful, needs to be applied on a grand scale, rather than being narrowly applied to backups or archival data repositories. While it is understandable why de-dupe vendors went for the low-hanging fruit of backups, which tend to contain a significant percentage of the same bits from one backup to the next and provide a clear-cut example of wasted space owing to unnecessary data replication, the fact is that the industry was already pushing users toward disk to disk replication for bit protection rather than backups. So the efficacy of the de-dupe backup scenario, it can be argued, was time-limited from the moment that it was introduced.
Early Years of De-Duplication
Early de-duplication products also ran afoul of marketecture: outrageous claims by vendor marketing departments of data reduction ratios that never seemed to be seen outside of the vendor’s own test lab. To realize a 70:1 reduction ratio (which was used to substantiate the extraordinarily high price many vendors charged for their de-duplicating storage arrays), data needed to consist of ASCII text files and spreadsheets with no graphics. Database data was already considered compressed and “rich media” such as videos and graphics were simply ignored by many de-duplication and compression algorithms.
The Legal Concerns of De-Duplication
In a growing number of companies, especially those that are publicly traded and that must file periodic financial reports to government regulatory bodies,(e.g. the Securities and Exchange Commission), de-duplication has come under increased scrutiny. Some of my financial industry clients won’t touch the technology at all, while others are limiting the use of de-duplication technology to data sets that are not subject to regulatory or legal governance.
Here’s the concern, as spelled out to me by one of the largest financial firms in the country. Says the CIO, the SEC requires a “full and unaltered” copy of quarterly and annual data to be filed over the course of the year. The concern is that an irate customer or shareholder that is suing the financial firm for whatever reason they may have, might raise the issue of de-duplicated data as a sidebar issue.
Step 1: The plaintiff asks the company lawyer for the past ten years of SEC filings. The company lawyer says that he will oblige the plaintiff.
Step 2: Plaintiff’s attorney asks, almost as an aside, whether the company lawyer knows whether any of the data in the financials has ever been de-duplicated. The attorney doesn’t know what de-duplication is, but promises to ask the IT department.
Step 3: The next day, the attorney reports that, yes, some of the data in the financial reports may have been retrieved from a de-duplicated backup somewhere along the way. So, yes, some of the financials may include data that was de-duplicated somewhere along the way. This causes plaintiff’s counsel to request a sidebar with the judge to explore the compliance of the data with SEC rules and its admissibility in the current case.
Step 4: Since neither compression nor de-duplication technology have ever been determined to be legally permissible or “in compliance” with legal or regulatory mandates, this question is taken up in the court. Both sides engage and hear testimony from “experts” who claim that the technologies do or do not “materially alter” data. The debate over the issue continues for weeks until…
Whether or not data de-duplication is found to be a compliant technology, reports the CIO, the price to litigate the issue will easily reach tens of millions of dollars. His company does not want to be the test case, so they are avoiding de-duplication technology for all financial data.
This decision requires the client to segregate his storage infrastructure, applying storage “services” like de-duplication and compression only to select data and excluding other data from the technology all together. Whether or not others may see this as an over-reaction, it is a sober one. Not long after the issue was raised, I solicited an opinion from the Institute of Internal Auditors in New York. I asked whether de-duplication technology represented a potential problem from the standpoint of legal or regulatory compliance. Their response was simply, “What’s de-duplication?”
For more information about the role of de—duplication, its potential complications within the legal system and how it works best for IT departments, stay tuned for my next blog.
In the meantime, learn more about how you can keep your data storage activities compliant. See our eBook: The Lowdown on Compliant Data Storage.