Next Generation Deduplication

With it's acquisition of Santa Clara-based Kadena Systems, Arkeia is developing unique block-grain, content-aware, sliding-window data deduplication technology for backup.  Block-grain deduplication compresses files by uncovering file elements (blocks) that are shared among multiple files, including multiple versions of the same file.  Block-grain deduplication is well understood in the data storage industry.  Kadena's block-grain deduplication technology represents an improvement over traditional variable-block deduplication strategies.

sliding_window_thumb2Sliding Window Deduplication

Arkeia leverages Kadena's sliding-window approach to deduplication to maximize deduplication compression ratios. Unlike fixed-block deduplication technologies, sliding-window deduplication can efficiently deduplicate a new file that is created by the insertion of bytes into a known file. The key to efficient sliding-window deduplication is minimizing the CPU necessary to identify duplicate blocks. Kadena's approach is to leverage a unique progressive-matching strategy.  (click on image to zoom)

Progressive Matching

Kadena's progressive-matching technology speeds up sliding-window deduplication by reducing the CPU necessary to determine if a block is a duplicate. An early example of progress-matching technology was the use of efficient checksumming to quickly identify new data blocks. If blocks are probable duplicates, Kadena calculates a more CPU-intensive hash to determine if the candidate block matches a known block in the pool. By eliminating unnecessary hash calculations, Kadena's progressive-matching drastically reduces the time required to deduplicate data.

Content-aware_thumbContent-aware

The size of the sliding-window is automatically adjusted according to the file type in order to optimize compression ratios and speed of deduplication. Different types of data will deduplicate at a better ratio depending on the block size of the window that is used for deduplication. For example, JPEG files are poor candidates for block-grain deduplication because the JPEG compression process randomizes file content by eliminating virtually all patterns within the file. File-grain deduplication is still relevant because it will identify the same file, whatever its name. To gain the benefits of file-grain compression, Kadena sets the window size to the size of the file.

Data in databases generally have very high pattern content and are best deduplicated with a small sliding-window.  Word processing documents and slide presentations (e.g. Microsoft PowerPoint) compress best with a mid-sized sliding window.  (click on image to zoom)

Dedup_replication_thumb

Replication of Deduplicated Data

By combining deduplication with Arkeia's backup replication technology, Arkeia customers will be better equipped to protect distributed environments over WAN connections. Data is replicated in its deduplicated form, permitting consolidation of remote data to a central site or replication of mission-critical data to a remote disaster recovery site. (click on image to zoom)