Delta Table Partitioning - why only for tables bigger than 1 TB

Question

Hello there,

I’ve been creating an ETL/ELT Pipeline with Azure Databricks Workflows, Spark and Azure Data Lake. It should process in Near Real Time changes (A Change Data Capture process) from an Azure SQL Server Database. I have a bunch of big tables with the size of 30 – 50 GBs on the SQL Server and the biggest of them have the size of 190 – 220 GBs, and a lot of small tables. Compressed in parquet delta format they get to 3 - 5 GBs and the biggest ones 45 - 55 GBs on the Data Lake, but they are expected to grow. I am storing the data in delta tables. Because of easier management of the data (Life Cycle Management, Debugging of the process etc.) I think that it will be better to store the data of the big tables in partitioned delta tables. I ran some tests and observed that in some cases (when the data written is big - millions of records) the writing of data into the partitioned tables takes less time, but in most of the cases (thousands of records) it takes the same time as in non - partitioned tables. When one creates the table for the first time, it takes longer for the partitioned table to be created, which makes sense, since the metadata of the partitioned tables is more complicated.

So far so good, but at the same time I have the statement in the documentation of Databricks, saying that one should partition only tables that are bigger than 1 TB. https://docs.databricks.com/en/tables/partitions.html

https://zcusa.951200.xyz/en-us/azure/databricks/tables/partitions

Do you think that I should follow this rule and create the tables as non – partitioned? And what stays behind this rule? Why should I partition only tables that are bigger than 1 TB? Many thanks for your answer in advance!

I am partitioning based on year and month of ingestion of the data. The data will not be querried based on those columns, but I have no requirement for column that the data will be queried on.

Many thanks for your answer in advance!

Accepted Answer

Hi @Martin
Welcome to Microsoft Q&A platform and thanks for posting your query here.

Great question, and thanks for providing detailed context about your use case! You're right, partitioning can be a bit tricky to decide, especially with data sizes in the range you're working with.

Why the 1 TB Rule?

The 1 TB rule mentioned in the Databricks documentation is more of a guideline than a strict rule. It's based on balancing the trade-offs between metadata overhead and query/write performance. Here’s why it’s suggested:

Metadata Management Overhead: Every partition adds metadata to the Delta Lake transaction log. For smaller tables, this overhead can outweigh any potential benefits because modern query engines (like Spark) can already handle these efficiently without partitioning.
Query Patterns Matter: Partitioning is most useful when your queries regularly filter data using the partition column. If your data isn’t being queried by the partition key (in your case, year and month of ingestion), the partitions may not help much in reducing query scan times.
Write Performance: Writing to partitioned tables involves Spark distributing data across partitions. This can slow down writes, especially for smaller tables. For large tables, the distribution cost is often justified because it simplifies downstream reads.

Should You Partition?

Here are a few suggestions based on your scenario:

For Large Tables (>100 GB and Growing): If you expect these tables to grow significantly over time (approaching or exceeding 1 TB), it’s reasonable to partition them preemptively. Year/month ingestion might work for now but consider revisiting the partitioning strategy if your query patterns change in the future.
For Small Tables (3-5 GB): For tables this size, I’d recommend avoiding partitioning. Delta Lake is optimized for small to medium tables and can handle these efficiently without partitions.
Alternative to Partitioning – Z-Ordering: If you want to optimize query performance without dealing with the overhead of partitions, consider using Z-Ordering. This clusters data on a specific column, which can improve scan times, especially for non-partitioned tables.

Why Does Partitioning Sometimes Speed Up Writes?

Partitioning can sometimes make writes faster because Spark can parallelize writes across partitions. However, this depends on how evenly your data is distributed and the available cluster resources. It's a case-by-case scenario.

What I would recommend

Stick to non-partitioned tables for now unless you anticipate significant data growth.
Focus on Z-Ordering for query optimization if your workload demands it.
Monitor query and write performance closely as your data grows. Testing is key in Spark workloads.

For more information, please refer the below articles.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Delta Table Partitioning - why only for tables bigger than 1 TB

0 additional answers

Your answer