Big-Data Project Guidelines

DZone

The aim of the following article is to share with you some of the most relevant guidelines in cloud-based big data-based projects that I’ve done in my recent mission. The following list is not an exhaustive one and may be completed according to each organization/project specifications.

Guidelines for Cloud-Based and Data-Based Projects

Data Storage

Use data partitioning and/or bucketing to reduce the amount of data to process.
Use Parquet format to reduce the amount of data to store.
Prefer using SNAPPY compression for frequently consumed data, and prefer using GZIP compression for data that is infrequently consumed.
Try as much as possible to store a big enough file instead of many small files (average 256MO – 1GB ) to improve performances (R/W) and reduce costs — the file system depends on the use case needs and the underlying block storage file system.
Think about a DeltaLake/IceBerg framework before managing schema evolutions and data updates using custom solutions.
“Design by query” can help improving consumption performances — for instance, you can store the same data in different designs in different depths depending on the consumption pattern.
Secure data stored on S3 using an adapted model using versioning and archiving.

Data Processing

When developing distributed applications, re-think your code in a way to avoid as much as possible data shuffling as it leads to performance leaks.
Small table broadcasting can help achieve better performances.
Once again, use Parquet format to reduce the amount of data to process thanks to PredicatePushDown and ProjectionPushDown.
When consuming data, use as much as possible data native protocols to be close to data and avoid unnecessary calls and protocols overhead.
Before choosing a computation framework, identify first if your problem needs to be solved using parallelization or using distribution.
Think to merge and compact your files to improve performances and reduce cost while reading (Delta.io can help achieve that),

Data Locality

Move the processing next to data and not the opposite — data size is generally higher than the jobs’ or scripts’ sizes.
Process data in the cloud and get only the most relevant and necessary data out.
Limit the inter-region transfer.
Avoid as much data travel as possible between infrastructures, proxies, and patterns.

Various

Use a data lake for analytical & exploratory use cases and use operational databases for operational ones.
To ingest data, prefer using config-based solutions like DMS/Debezuim rather than custom solutions. Also prefer also using CDC solutions for long-term running ingests.
Make structures (table, prefix, path…) that are not aimed to be shared privately.

Source: DZone

Pyntax

Big-Data Project Guidelines

ByMehdi Tazi

Guidelines for Cloud-Based and Data-Based Projects

Data Storage

Data Processing

Data Locality

Various

By Mehdi Tazi

Related Post

Azure Databricks: 14 Best Practices For a Developer

What is ETL?

AWS Serverless Data Lake: Built Real-time Using Apache Hudi, AWS Glue, and Kinesis Stream

You missed

I hate installing apps to save money, but this Pixel privacy feature makes it worthwhile

Teslas made in Texas will likely have to leave the state before Texans can buy them

MagSafe used to fish out iPhone 12 Pro dropped in canal

Wacom Cintiq Pro 24 Touch review: Beautiful but needs improvement

Pyntax