Using the power of Spark and SQL Server to automate the creation of a structured layer on top of Azure Data Lake.
*Important - This can be done in Python but I have achieved results up to 60% percent faster using Scala.
*Read First - Azure Databricks Analytics Architecture
Problem - We need to provide a layer of access to data in Azure Data Lake. This process should be done automatically as to not require additional resources in analytical departments.
Solution - Using Spark and Scala we can automate this development using parallel processing to create both the Data Definition Language(DDL) and Data Manipulation Language(DML).
The example code below is a simplified version to learn the basics. It can automatically load the delta files and create the delta tables during the first run.
Step 1 - Setup Cluster Libraries
Download azure-sqldb-spark by clicking here
Go to your Databricks clutser> Libraries > Install New > Upload > Jar
Drag the azure-sqldb-spark-1.0.2.jar (or newer) file to the upload screen and hit install.
Step 3 - Create a Scala Notebook
Create a new notebook from the Azure Databricks home page or by right clicking somewhere in the Workspace tab.
Step 3 - Mount your Azure Data Lake Store
Review requirements here, we will not go over this in this blog.
Databricks Notebook (Scala)
Retrieve metadata from SQL Server using sqlContext.read.sqlDB
Create view over results called VW_Metadata
Create variables to store ADLS paths
Map rows to an instance of a string which then can be referenced inside the foreach using variables
Create a parallel foreach to iterate over metadata
Read current file from Raw, passing in the path using DF_DL_RAW_Dataset_loadString variable
Write to Standard zone using Databricks Delta format and pas in the target path using the DL_Standard_Delta_Directory variable.
Create Database if associated database to the table does not exists
Create Delta table if not exists, infer schema from Standard file layer