top of page
  • Writer's pictureSean Forgatch

Azure Databricks Delta - Automate the loading and creation of Tables

Updated: Apr 14, 2019





Using the power of Spark and SQL Server to automate the creation of a structured layer on top of Azure Data Lake.




*Important - This can be done in Python but I have achieved results up to 60% percent faster using Scala.




Problem - We need to provide a layer of access to data in Azure Data Lake. This process should be done automatically as to not require additional resources in analytical departments.


Solution - Using Spark and Scala we can automate this development using parallel processing to create both the Data Definition Language(DDL) and Data Manipulation Language(DML).


The example code below is a simplified version to learn the basics. It can automatically load the delta files and create the delta tables during the first run.


Step 1 - Setup Cluster Libraries

  1. Download azure-sqldb-spark by clicking here

  2. Go to your Databricks clutser> Libraries > Install New > Upload > Jar

  3. Drag the azure-sqldb-spark-1.0.2.jar (or newer) file to the upload screen and hit install.




Step 3 - Create a Scala Notebook

  • Create a new notebook from the Azure Databricks home page or by right clicking somewhere in the Workspace tab.

Step 3 - Mount your Azure Data Lake Store

  • Review requirements here, we will not go over this in this blog.

  • https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html


 

Databricks Notebook (Scala)


  • Import libraries

  • Retrieve metadata from SQL Server using sqlContext.read.sqlDB

  • Create view over results called VW_Metadata


  • Create variables to store ADLS paths

  • Map rows to an instance of a string which then can be referenced inside the foreach using variables

  • Create a parallel foreach to iterate over metadata




  • Read current file from Raw, passing in the path using DF_DL_RAW_Dataset_loadString variable

  • Write to Standard zone using Databricks Delta format and pas in the target path using the DL_Standard_Delta_Directory variable.

  • Create Database if associated database to the table does not exists

  • Create Delta table if not exists, infer schema from Standard file layer





2,618 views1 comment

Recent Posts

See All
bottom of page