Sean Forgatch
Azure Databricks Delta - Automate the loading and creation of Tables
Updated: Apr 14, 2019

Using the power of Spark and SQL Server to automate the creation of a structured layer on top of Azure Data Lake.
*Important - This can be done in Python but I have achieved results up to 60% percent faster using Scala.
*Read First - Azure Databricks Analytics Architecture
Problem - We need to provide a layer of access to data in Azure Data Lake. This process should be done automatically as to not require additional resources in analytical departments.
Solution - Using Spark and Scala we can automate this development using parallel processing to create both the Data Definition Language(DDL) and Data Manipulation Language(DML).
The example code below is a simplified version to learn the basics. It can automatically load the delta files and create the delta tables during the first run.
Step 1 - Setup Cluster Libraries
Download azure-sqldb-spark by clicking here
Go to your Databricks clutser> Libraries > Install New > Upload > Jar
Drag the azure-sqldb-spark-1.0.2.jar (or newer) file to the upload screen and hit install.

Step 3 - Create a Scala Notebook
Create a new notebook from the Azure Databricks home page or by right clicking somewhere in the Workspace tab.
Step 3 - Mount your Azure Data Lake Store
Review requirements here, we will not go over this in this blog.
https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html
Databricks Notebook (Scala)
Import libraries
Retrieve metadata from SQL Server using sqlContext.read.sqlDB
Create view over results called VW_Metadata

Create variables to store ADLS paths
Map rows to an instance of a string which then can be referenced inside the foreach using variables
Create a parallel foreach to iterate over metadata

Read current file from Raw, passing in the path using DF_DL_RAW_Dataset_loadString variable
Write to Standard zone using Databricks Delta format and pas in the target path using the DL_Standard_Delta_Directory variable.
Create Database if associated database to the table does not exists
Create Delta table if not exists, infer schema from Standard file layer
