- databricks

Computing the total storage size of the ADLS Gen1 or Gen2 folder in Pyspark

This post explains how to calculate the total storage size of an Azure Data Lake Store(ADLS) Gen1 or Gen2 folder in Pyspark using Azure Databricks or Azure Synapse Analytics.

Assumptions

  • ADLS Gen1 or Gen2 is already set and is being mounted in Azure Databricks or Azure Synapse Analytics.
  • The below code can be used to calculate the folder size. It cant be used as a user defined function (UDF). Because it contains databricks utility function (dubtils) or Synpase utility functions (mssparkutils). These functions are not allowed inside a UDF. The following error will be thrown by databricks or synapse. databricks could not serialize object: Exception: You cannot use dbutils within a spark job

Code

def recursiveDirSize(path):
  total = 0
  dir_files = dbutils.fs.ls(path)
  for file in dir_files:
    if file.isDir():
      total += recursiveDirSize(file.path)
    else:
      total = file.size
  return total

print(recursiveDirSize("/mnt/folder/"))

Unix command

You can use the disk usage Unix command in the Databricks or Synapse notebook in order to get the size. Any dbfs directory has a mount on the Unix system and one can access it using /dbfs.

%sh
du -h /dbfs/mnt/folder/

The above command takes a lot of time to run. Please run cautiously

You can also browse other categories in our blog for some amazing 8051, Python, ARM, Verilog, Machine Learning codes.

Leave a Reply