Constantly Running Interactive Clusters Best Practices

Martin 100 Reputation points
2025-01-06T08:13:58.96+00:00

Hello there,

 

I’ve been creating an ETL/ELT Pipeline with Azure Databricks Workflows, Spark and Azure Data Lake. It should process in Near Real Time changes  (A Change Data Capture process) from an Azure SQL Database.

 

For that purpose, I will have several Databricks Workflows that will run continuously on one and the same Interactive cluster (like 7 - 10 Databricks Workflows). So, the cluster will be shared between these 7 – 10 Workflows and will run perpetually. Every of these Workflows will process CDC data from one table (these are big tables with SQL Server CDC enabled on them).

 

Additional to that, I will have several Databricks Workflows that will run in batch mode with Job Clusters and will process data from small tables.

 

My question concerns the Interactive Cluster that will run perpetually in continuous mode: Are there some best practices when one has a continuous running process with one and the same interactive cluster?  Like, should one perform periodic downtime of the process when the cluster can be restarted or for example replaced by another BackUp cluster, while this one is being restarted? Or should one execute periodic clean up of the cache of the cluster? Or may be some other activities are good to be done periodically?

 

Many thanks for your answer in advance!

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,526 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,303 questions
0 comments No comments
{count} votes

Accepted answer
  1. Smaran Thoomu 19,050 Reputation points Microsoft Vendor
    2025-01-06T10:08:50.6333333+00:00

    Hi @Martin
    Welcome to Microsoft Q&A platform.
    Thank you for sharing details about your use case! Running interactive clusters continuously to process near real-time Change Data Capture (CDC) workloads is indeed a common use case in Databricks. Below are some best practices and recommendations for managing continuously running interactive clusters:

    Cluster Maintenance

    • Periodic Restarts: While Databricks clusters are designed to run continuously, it’s a good practice to restart the cluster periodically to avoid issues caused by resource fragmentation (e.g., memory leaks or stale cache). The restart frequency depends on workload stability but doing it during off-peak hours (e.g., weekly) is common.
    • Backup Cluster: To avoid downtime during restarts, consider setting up a backup cluster. Redirect workloads to the backup cluster temporarily while the primary cluster is restarted or maintained.

    Cluster Configuration

    • Autoscaling: Enable autoscaling to adjust resources dynamically based on workload demands. This helps manage sudden spikes in workload while optimizing costs.
    • Spot Instances: If cost optimization is a priority, consider using spot instances with fallback to on-demand instances.

    Cache Management

    • Clear Cache Periodically: Use Spark's CLEAR CACHE command to free up memory by removing unused cached data. For example:
        CLEAR CACHE
      
      This can be done on a schedule or after significant schema changes in your data.

    Monitoring and Alerts

    • Set Up Cluster Metrics: Use Databricks’ cluster metrics dashboard to monitor CPU, memory usage, and disk I/O. Alerts can be configured for key metrics to proactively address potential issues.
    • Log Monitoring: Use tools like Azure Monitor or Datadog to capture and analyze logs from Databricks to monitor the health of the cluster and the performance of workflows.

    Workflow Optimization

    • Stream Processing: Since your workflows process near real-time CDC data, ensure your streaming queries are optimized with proper checkpointing and watermarks to avoid excessive state memory usage.
    • Load Balancing: Distribute the workload efficiently across your workflows to prevent overloading the cluster.

    Use Cluster Policies in Databricks to enforce configuration standards (e.g., max/min worker nodes, runtime versions) and ensure consistency across clusters. Regularly update the Databricks runtime version to take advantage of performance improvements and bug fixes.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.