Tech Careers: Dedupe Overview

Published:  Mar 02, 2010

 Technology       
“I’ve seen that before”.

This sentence, in essence, describes a technology known as de-duplication. In the previous post we described the importance of “dedupe”. The amount of world-wide digital data being generated daily is enormous, and dedupe is a strategic technology used to reduce the amount of information that actually gets stored.

If you don’t know anything about this technology, you are at a disadvantage as an interview candidate. If you already have a high-tech job, you need to add dedupe to your knowledge portfolio.

At a very high-level you can think of dedupe as a “pattern recognition” technique. Consider the following diagram, which depicts a data set being sent through a dedupe service.

Dedup Recognizes and Reduces Identical Data

Consider our foundational diagram that depicts an application sending information to a persistent storage device. A Dedup Service scans every bit of information that flows from the application and looks for identical bit patterns. When it recognizes identical patterns the service “dedupes” the data and does not store it persistently. In the diagram above, the bit pattern “0101” occurs four times, but it is only stored once. This example results in a 25% capacity savings on disk. For huge data sets, a 25% capacity savings is significant.

This example also makes it clear that backup applications (which often back up the same data over and over again) are a great use case for this technology.

The dedup service can be implemented in different locations, as depicted in the following diagram.

Dedup in Sever versus Storage

Why implement dedup in one location versus another? As seen on the left, running dedup in the server results in less data being sent to a storage device, but more CPU time being spent in the server (remember that dedup must spend CPU time examining every bit of information). On the right, the server spends no time doing dedup but must send every bit of information over to the storage device. Customers often choose one dedup method over another based on their business requirements.

I encourage high-tech job seekers to learn more about de-duplication (especially what’s going on under the covers). For a real-world example of a dedup service running inside of an application server, I recommend reading about EMC’s Avamar product. For dedup running inside of a disk array, study EMC’s Data Domain product.

Steve
http://stevetodd.typepad.com
Twitter: @SteveTodd
EMC Intrapreneur

***