Tech Careers: Know How to Say “Dedupe”
Published: Feb 23, 2010
“How would you handle the explosive growth of information”?
There are many good responses, but the following answer may be the best and most current:
“I’d use dedupe”.
Deduplication (dedupe for short) is a technology that shrinks down the digital bits of information that actually get stored. Think about it. If a corporation had to allocate enough storage to house a petabyte of digital information, what kind of cost savings could they achieve if they actually only needed 30% of that total?
First of all, they wouldn’t have to pay the up-front acquisition costs for all of that storage. Secondly, they wouldn’t have to spend as much money on administrative staff to manage all that storage. Thirdly, they’d spend less money on electricity costs to power all that storage.
It’s easy to see why the promise of dedupe is so compelling. It’s also important to research the difference between dedupe and its related cousin: compression (for a primer on how compression differs from dedupe, read here).
This article focuses on why dedupe is often used and describes one of the more common use cases where dedupe shines: backups. A future article will dive into the details of how it actually works.
The next logical question might be “how would you use dedupe”? First of all you can describe the huge disk arrays that are ingesting enormous amount of freshly generated digital bits. Then you can note that often times it takes MORE storage to back up a disk array, which brings you to your next statement:
“I’d start using dedupe to prevent backing up the same data over and over again”.
Consider the diagram below, which shows information in a disk array being sent to a backup device on a weekly schedule.
On the left we see our “snap copy” configuration from a previous post. Live, production data is periodically “frozen”, and a copy of it is accessed by a backup service (on the right hand side). On Monday through Friday the backup service accesses the snap copy in order to monitor daily, incremental changes to the information. These changes are then sent to a backup device. On Sunday, however, a FULL backup might be performed.
The backup service will certainly generate additional amounts of digital information. Without a disciplined and intelligent backup strategy, the amount of storage required for backup capacity can easily spiral out of control. When backup storage requirements are coupled with the amount of storage found in the primary disk array, it is easy to see why the promise of de-duplicating the information holds so much attraction.
Our next post will dive into some of the details of de-duplication, and also discuss how de-duplication has found its way into primary disk arrays as well.
Steve
http://stevetodd.typepad.com
Twitter: @SteveTodd
EMC Intrapreneur