Hi @Martin
Welcome to Microsoft Q&A platform and thanks for posting your query here.
Great question, and thanks for providing detailed context about your use case! You're right, partitioning can be a bit tricky to decide, especially with data sizes in the range you're working with.
Why the 1 TB Rule?
The 1 TB rule mentioned in the Databricks documentation is more of a guideline than a strict rule. It's based on balancing the trade-offs between metadata overhead and query/write performance. Here’s why it’s suggested:
- Metadata Management Overhead: Every partition adds metadata to the Delta Lake transaction log. For smaller tables, this overhead can outweigh any potential benefits because modern query engines (like Spark) can already handle these efficiently without partitioning.
- Query Patterns Matter: Partitioning is most useful when your queries regularly filter data using the partition column. If your data isn’t being queried by the partition key (in your case, year and month of ingestion), the partitions may not help much in reducing query scan times.
- Write Performance: Writing to partitioned tables involves Spark distributing data across partitions. This can slow down writes, especially for smaller tables. For large tables, the distribution cost is often justified because it simplifies downstream reads.
Should You Partition?
Here are a few suggestions based on your scenario:
- For Large Tables (>100 GB and Growing): If you expect these tables to grow significantly over time (approaching or exceeding 1 TB), it’s reasonable to partition them preemptively. Year/month ingestion might work for now but consider revisiting the partitioning strategy if your query patterns change in the future.
- For Small Tables (3-5 GB): For tables this size, I’d recommend avoiding partitioning. Delta Lake is optimized for small to medium tables and can handle these efficiently without partitions.
- Alternative to Partitioning – Z-Ordering: If you want to optimize query performance without dealing with the overhead of partitions, consider using Z-Ordering. This clusters data on a specific column, which can improve scan times, especially for non-partitioned tables.
Why Does Partitioning Sometimes Speed Up Writes?
Partitioning can sometimes make writes faster because Spark can parallelize writes across partitions. However, this depends on how evenly your data is distributed and the available cluster resources. It's a case-by-case scenario.
What I would recommend
- Stick to non-partitioned tables for now unless you anticipate significant data growth.
- Focus on Z-Ordering for query optimization if your workload demands it.
- Monitor query and write performance closely as your data grows. Testing is key in Spark workloads.
For more information, please refer the below articles.
- When to partition tables on Databricks
- Best practices: Delta Lake
- How to partition Delta tables efficiently?
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.