Constantly Running Interactive Clusters Best Practices

Question

Hello there,

I’ve been creating an ETL/ELT Pipeline with Azure Databricks Workflows, Spark and Azure Data Lake. It should process in Near Real Time changes (A Change Data Capture process) from an Azure SQL Database.

For that purpose, I will have several Databricks Workflows that will run continuously on one and the same Interactive cluster (like 7 - 10 Databricks Workflows). So, the cluster will be shared between these 7 – 10 Workflows and will run perpetually. Every of these Workflows will process CDC data from one table (these are big tables with SQL Server CDC enabled on them).

Additional to that, I will have several Databricks Workflows that will run in batch mode with Job Clusters and will process data from small tables.

My question concerns the Interactive Cluster that will run perpetually in continuous mode: Are there some best practices when one has a continuous running process with one and the same interactive cluster? Like, should one perform periodic downtime of the process when the cluster can be restarted or for example replaced by another BackUp cluster, while this one is being restarted? Or should one execute periodic clean up of the cache of the cluster? Or may be some other activities are good to be done periodically?

Many thanks for your answer in advance!

Accepted Answer

Hi @Martin
Welcome to Microsoft Q&A platform.
Thank you for sharing details about your use case! Running interactive clusters continuously to process near real-time Change Data Capture (CDC) workloads is indeed a common use case in Databricks. Below are some best practices and recommendations for managing continuously running interactive clusters:

Cluster Maintenance

Periodic Restarts: While Databricks clusters are designed to run continuously, it’s a good practice to restart the cluster periodically to avoid issues caused by resource fragmentation (e.g., memory leaks or stale cache). The restart frequency depends on workload stability but doing it during off-peak hours (e.g., weekly) is common.
Backup Cluster: To avoid downtime during restarts, consider setting up a backup cluster. Redirect workloads to the backup cluster temporarily while the primary cluster is restarted or maintained.

Cluster Configuration

Autoscaling: Enable autoscaling to adjust resources dynamically based on workload demands. This helps manage sudden spikes in workload while optimizing costs.
Spot Instances: If cost optimization is a priority, consider using spot instances with fallback to on-demand instances.

Cache Management

Clear Cache Periodically: Use Spark's CLEAR CACHE command to free up memory by removing unused cached data. For example:
```
  CLEAR CACHE
```
This can be done on a schedule or after significant schema changes in your data.

Monitoring and Alerts

Set Up Cluster Metrics: Use Databricks’ cluster metrics dashboard to monitor CPU, memory usage, and disk I/O. Alerts can be configured for key metrics to proactively address potential issues.
Log Monitoring: Use tools like Azure Monitor or Datadog to capture and analyze logs from Databricks to monitor the health of the cluster and the performance of workflows.

Workflow Optimization

Stream Processing: Since your workflows process near real-time CDC data, ensure your streaming queries are optimized with proper checkpointing and watermarks to avoid excessive state memory usage.
Load Balancing: Distribute the workload efficiently across your workflows to prevent overloading the cluster.

Use Cluster Policies in Databricks to enforce configuration standards (e.g., max/min worker nodes, runtime versions) and ensure consistency across clusters. Regularly update the Databricks runtime version to take advantage of performance improvements and bug fixes.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Constantly Running Interactive Clusters Best Practices

0 additional answers

Your answer