- databricks

Run Databricks Notebooks In Parallel -Python

Databricks

Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. You can use dbutils library of databricks to run one notebook and also run multiple notebooks in parallel.

Run one Notebook from another Notebook

You can use the one databricks from another notebook by using the notebook run command of dbutils library. Below is the syntax of the command.

dbutils.notebook.run("notebook-name", 60, {"argument": "data", "argument2": "data2", ...})

Example:

dbutils.notebook.run("../path/to/Notebook", 6000)

Run multiple Notebooks in parallel

You can run multiple Azure Databricks notebooks in parallel by using the dbutils library. Here is a python code based on the sample code from the Azure Databricks documentation on running notebooks concurrently and on Notebook workflows with additional parameterization, retry logic and error handling.

Note that all child notebooks will share resources on the cluster, which can cause bottlenecks and failures in case of resource contention. In that case, it might be better to run parallel jobs each on its own dedicated clusters using the Jobs API.

You could use Azure Data Factory pipelines, which support parallel activities to easily schedule and orchestrate such as a graph of notebooks.

Python Code :

from concurrent.futures import ThreadPoolExecutor

class NotebookData:
  def __init__(self, path, timeout, parameters=None, retry=0):
    self.path = path
    self.timeout = timeout
    self.parameters = parameters
    self.retry = retry

  def submitNotebook(notebook):
    print("Running notebook %s" % notebook.path)
    try:
      if (notebook.parameters):
        return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
      else:
        return dbutils.notebook.run(notebook.path, notebook.timeout)
    except Exception:
       if notebook.retry < 1:
        raise
    print("Retrying notebook %s" % notebook.path)
    notebook.retry = notebook.retry - 1
    submitNotebook(notebook)

def parallelNotebooks(notebooks, numInParallel):
   # If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once. 
   # This code limits the number of parallel notebooks.
   with ThreadPoolExecutor(max_workers=numInParallel) as ec:
    return [ec.submit(NotebookData.submitNotebook, notebook) for notebook in notebooks]

#Array of instances of NotebookData Class
notebooks = [
NotebookData("../path/to/Notebook1", 1200),
NotebookData("../path/to/Notebook2", 1200, {"Name": "Abhay"}),
NotebookData("../path/to/Notebook3", 1200, retry=2)
]   
      
res = parallelNotebooks(notebooks, 2)
result = [i.result(timeout=3600) for i in res] # This is a blocking call.
print(result)      

Useful resources

You can also browse other categories in our blog for some amazing 8051, Python, ARM, Verilog, Machine Learning codes.

1 thought on “Run Databricks Notebooks In Parallel -Python

Leave a Reply