DZone

The aim of the following article is to share with you some of the most relevant guidelines in cloud-based big data-based projects that I’ve done in my recent mission. The following list is not an exhaustive one and may be completed according to each organization/project specifications.

Guidelines for Cloud-Based and Data-Based Projects

Data Storage

  • Use data partitioning and/or bucketing to reduce the amount of data to process.
  • Use Parquet format to reduce the amount of data to store.
  • Prefer using SNAPPY compression for frequently consumed data, and prefer using GZIP compression for data that is infrequently consumed.
  • Try as much as possible to store a big enough file instead of many small files (average 256MO – 1GB ) to improve performances (R/W) and reduce costs — the file system depends on the use case needs and the underlying block storage file system.
  • Think about a DeltaLake/IceBerg framework before managing schema evolutions and data updates using custom solutions.
  • “Design by query” can help improving consumption performances — for instance, you can store the same data in different designs in different depths depending on the consumption pattern.
  • Secure data stored on S3 using an adapted model using versioning and archiving.

Data Processing

  • When developing distributed applications, re-think your code in a way to avoid as much as possible data shuffling as it leads to performance leaks.
  • Small table broadcasting can help achieve better performances.
  • Once again, use Parquet format to reduce the amount of data to process thanks to PredicatePushDown and ProjectionPushDown.
  • When consuming data, use as much as possible data native protocols to be close to data and avoid unnecessary calls and protocols overhead.
  • Before choosing a computation framework, identify first if your problem needs to be solved using parallelization or using distribution.
  • Think to merge and compact your files to improve performances and reduce cost while reading (Delta.io can help achieve that),

Data Locality

  • Move the processing next to data and not the opposite — data size is generally higher than the jobs’ or scripts’ sizes.
  • Process data in the cloud and get only the most relevant and necessary data out.
  • Limit the inter-region transfer.
  • Avoid as much data travel as possible between infrastructures, proxies, and patterns.

Various

  • Use a data lake for analytical & exploratory use cases and use operational databases for operational ones.
  • To ingest data, prefer using config-based solutions like DMS/Debezuim rather than custom solutions. Also prefer also using CDC solutions for long-term running ingests.
  • Make structures (table, prefix, path…) that are not aimed to be shared privately.

Source: DZone