Deduplication – A Quick Tutorial
Well this one is a logical step after the previous tutorial on virtulization where we discussed, what virtulization is and how is it applied across the enterprise and consumer ecosystems. Allegedly there is strong association of Deduplication and Virttulization. Again a mere mention of Deduplication does gather some good press (Hope fully good press!!!) for your product announcement. I am writing this in the same helplessness as I had faced while learning about virtulization, so we will call it yet again “Pre-Reading before you go onto bigger concepts in Deduplication”
What is Deduplication?
Data deduplication is also known as also[intelligent compression, single-instance storage] is a method of reducing storage needs by eliminating redundant data. Only unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy.
In English: For example(very crudely), Say you have an 5MB mp3 of the The Who -Baba O’Riley saved on you Laptop and you have like the same song stored in 10 different folders. If you take a back up all of those 10 instances of the same file are saved, requiring 50 MB storage space. What deduplication can do for you is, it will store only one instance of the song and each subsequent instance is just referenced back to the one saved copy. This might seem lame as we are saving only 50 MB. Now assume that you have 1000 instances of same song or more and you’ll see the difference.
Data deduplication offers other benefits also. Reduces the storage space requirements which turns out to be a big saving on the disk expenditures. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a Wireless Access Network for remote backups, replication, and disaster recovery.
Data deduplication can generally operate at the file(e.g: Movie Files, mp3), block(e.g: A Chunk of the file can be 512K or something like that), and even the bit(e.g: Bits which are arranged on the storage device to make the file) level. File deduplication eliminates duplicate files (“The Who -Baba O’Riley” e.g), but this is not a very efficient means of deduplication. Block and bit deduplication looks within a file and saves unique iterations of each block or bit. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved, the changes don’t constitute an entirely new file. This behavior makes block and bit deduplication far more efficient. However, block and bit deduplication take more processing power and uses a much larger index to track the individual pieces.
Hash collisions are a potential problem with deduplication. When a piece of data receives a hash number, that number is then compared with the index of other existing hash numbers. If that hash number is already in the index, the piece of data is considered a duplicate and does not need to be stored again. Otherwise the new hash number is added to the index and the new data is stored. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When a hash collision occurs, the system won’t store the new data because it sees that its hash number already exists in the index.. This is called a false positive, and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining meta-data to identify data and prevent collisions.
In real world, data deduplication is used in combination with other forms of data reduction such as data compression and delta differencing. Taken together, these three techniques can be very effective at optimizing the use of storage space.
Dedupe Drivers
- Massive Data Growth(Youtube, web 2.0, lot of emails, movies, bittorrent,freemp3
) - Tape has been the only cost-effective option for storing massive amounts of backup and archive data. Experts say the odds of recovery from a given tape backup are about 90%.
- Proliferation: Cheaper Internet access
- $/GB rock bottom
- Blogs – everyone’s is an expert
- Suddenly Green is new mantra!!!
- Global warming
- Disruption technology
- Virtulization
This is fairly large list but I hope this gives us the idea.
Why Deduplication
- Eliminates redundant data, which can significantly shrink storage requirements and improve bandwidth efficiency.[Operations like Backup, Archiving store extremely redundant information.]
- Primary storage has gotten cheaper over time, enterprises typically store many versions of the same information so that re-use of old data is possible.
- Deduplication lowers storage costs since fewer disks are needed, and shortens backup/recovery times since there can be far less data to transfer.
- Again in the context of backup and other nearline data, we can make a strong supposition that there is a great deal of duplicate data.
- The same data keeps getting stored over and over again consuming a lot of unnecessary storage space (disk or tape), electricity (to power and cool the disk or tape drives), and bandwidth (for replication), creating a chain of cost and resource inefficiencies within the organization.

Source: TheInfoPro[2008]
Buzzwords
Source or Destination Deduplication
Source Deduplication
Source deduplication refers to the comparison of data objects[logical pointers] at the source, before they are sent to a destination (usually a data backup destination). The advantage of source deduplication is that less data is required to be transmitted and stored at the destination point. The disadvantage is that the deduplication catalog and indexing components are dispersed over the network so that deduplication potentially becomes more difficult to administer.
Destination Deduplication
Destination deduplication refers to the comparison of data objects after they arrive at the destination point. The advantage of destination deduplication is that all the deduplication management components are centralized. The disadvantage is that the entire data object must
be transmitted over the network before deduplicating.
Deduplication Space Savings
Deduplication vendors often claim that their products offer 20:1, 50:1, or even greater data reduction ratios. These claims actually refer to the “time-based” space savings effect of deduplication on repetitive data backups, i.e. it refers to incremental backup by which only
new and changed will be transmitted during the backup. Because the backups contain mostly unchanged data, once the first full backup has been stored, all subsequent full backups see a very high occurrence of deduplication. But what if the business doesn’t retain 64 backup copies? What if the backups have a higher change rate? Realizing that space savings numbers from a vendor’s marketing department often don’t represent a real-life environment, what should be expected for space savings on backup data sets.
How Does Deduplication Work?
Regardless of OS, Application or file system type, all data objects are written to a storage system using a data reference pointer, without which the data could not be referenced or retrieved. In traditional (non-deduplicated) file systems, data objects are stored
without regard to any similarity with other objects in the same file system. In a deduplicated file system, two new and important concepts are introduced:
- A catalog of all data objects is maintained. This cataloger contains a record of all data objects using a “hash” that identifies the unique contents of each object.
- The file system is capable of allowing many data pointers to reference the same physical data object.
Referencing data objects, comparing the objects, and redirecting reference pointers forms the basis of the deduplication algorithm.
Deduplication is without a doubt one of this year’s hottest topics in data storage. The rationale behind deduplication is simple: Eliminate your duplicate data and reduce the capacity needed during backups and other data copy activities. Unfortunately, the many different deduplication approaches from various vendors, with much hype about their unique benefits, can leave users bewildered. As they consider the variety of deduplication offerings, they often fail to understand the basic design nuances that are important to them.This paper looks beyond the hype and focuses on the important design aspects of deduplication, giving evaluators the information they need to make informed decisions when examining deduplication solutions.
Resources and References:
- Dr Dedupe- Definitive Must Read Blog on Dedupe(*****)
- Deduplication – Hype or Reality
- Looking Beyond the Hype: Evaluating Data Deduplication Solutions
- Joint Deduplication of Multiple Record Types in Relational Data
- Interactive Deduplication using Active Learning
- A Probabilistic Deduplication, Record Linkage and Geocoding System
- Leveraging Aggregate Constraints For Deduplication
- Data Depuplication and Tivoli Storage Manager
- Data De-duplication Methodologies: Comparing ExaGrid’s Byte-level Data De-duplication To Block Level Data De-duplication
- Record Deduplication By Evolutionary Means
- Data Reduction Methodologies: Comparing ExaGrid’s Byte-Level-Delta Data Reduction to Data De-duplication
- Manufacturing.net Whitepapers Directory
- TechRepublic: De-duplication Whitepapers
- Depuplication Calculator
- How Safe Is Deduplication?
- EMC Sees Bright Future for Flash and De-Dupe
- What do YOU want to ask the dedupe vendors?
- Data DeDupe — Product or Feature?

Thanks! Really amazing. I wish i could spend my time on writing articles…just have no time for it.
Oh, Thanks! Really amazing. Greets.
Thanks It’s really great. My most of time to spend in to check the Data duplications
Hello webmaster
I would like to share with you a link to your site
write me here preonrelt@mail.ru