Azure Databricks: How to Optimize all your Tables
THE MANY SMALL FILE PROBLEM
If you have ever worked with big data tooling it is very likely that you've encountered The Many Small Files Problem, where tools like Hive, Spark, etc. do not perform well across many small files. Best practice is to store files in sizes between 500mb - 2gb.
Another best practice is typically to either OPTIMIZE with a WHERE clause after an ETL job or create a nightly separate Notebook to run as a job to fully OPTIMIZE all tables within a database. Because OPTIMIZE is an expensive command we often create nightly jobs. Our code below we walk through how to do that.
THE OPTIMIZE SOLUTION
Originally introduced in Azure Databricks Delta(now Delta Lake), OPTIMIZE solved the many small file problem by leverage bin-packging(or Compaction) to better optimize files in terms of organization and sizing.
Other ways to leverage OPTIMIZE and the optional ZORDER command:
OPTIMIZE myDatabase.myTable WHERE date >= '2017-01-01'
OPTIMIZE myDatabase.myTable WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)
Learn more about these commands in the link below:
Databricks recommends "you start by running OPTIMIZE on a daily basis (preferably at night when spot prices are low). Then modify your job from there."
The first step here is to return the SQL result from SHOW TABLES IN myDatabase which will return databaseName, tableName, and isTemporary. The next step is to iterate over the RDD(resilient distributed dataset) row by row, in parallel(ie. par.foreach) utilizing parallel collections and place the column values into a variable which we can then leverage to pass into a Spark SQL command to dynamically run the OPTIMIZE command. We can pass in variable to a spark sql command by adding an s before the query string(ie. spark.sql(s""). For a best practice I always wrap spark.sql commands with tripple double-quotes so we can write sql on multiple lines.