How Checksum Smartly Manages Data Integrity in HDFS

DZone

Ensuring data integrity is a basic necessity or back bond in big data processing environment to achieve accurate outcomes. Of course, the same is applicable while executing any data moving operations with traditional data storage systems (RDBMS, Document Repository, etc.) through various applications. Data transportation happens over networks, device-to-device transfers, ETL processes, and much more. In two words, data integrity can be defined as an assurance of the accuracy and consistency of data throughout the entire life cycle.

In a big data processing environment, data(rest) gets persisted in a distributed manner because of the huge volume. So, achieving data integrity on top of it is challenging. Hadoop Distributed File Systems (HDFS) has been efficiently built/developed to store any type of data in a distributed manner in the form of the data block (breaks down the huge volume of data into a set of individual blocks) with data integrity commitment. There might be multiple reasons to get corrupt data blocks in HDFS, starting from IO operation on the system disk, network failure, etc.

Source: DZone

Pyntax

How Checksum Smartly Manages Data Integrity in HDFS

ByGautam Goswami

By Gautam Goswami

Related Post

Standalone Runtime Server in Mule 4

DataWeave Practice: Prime Number Code

3 Common Encryption Mistakes That Are Easy to Avoid

You missed

Teslas made in Texas will likely have to leave the state before Texans can buy them

MagSafe used to fish out iPhone 12 Pro dropped in canal

Wacom Cintiq Pro 24 Touch review: Beautiful but needs improvement

Google made it hard for users to keep location data private

Pyntax