Sean Forgatch

Apr 7, 20192 min

Azure Databricks Delta - Automate the loading and creation of Tables

Updated: Apr 14, 2019

Using the power of Spark and SQL Server to automate the creation of a structured layer on top of Azure Data Lake.

*Important - This can be done in Python but I have achieved results up to 60% percent faster using Scala.

*Read First - Azure Databricks Analytics Architecture

Problem - We need to provide a layer of access to data in Azure Data Lake. This process should be done automatically as to not require additional resources in analytical departments.

Solution - Using Spark and Scala we can automate this development using parallel processing to create both the Data Definition Language(DDL) and Data Manipulation Language(DML).

The example code below is a simplified version to learn the basics. It can automatically load the delta files and create the delta tables during the first run.

Step 1 - Setup Cluster Libraries

  1. Download azure-sqldb-spark by clicking here

  2. Go to your Databricks clutser> Libraries > Install New > Upload > Jar

  3. Drag the azure-sqldb-spark-1.0.2.jar (or newer) file to the upload screen and hit install.

Step 3 - Create a Scala Notebook

  • Create a new notebook from the Azure Databricks home page or by right clicking somewhere in the Workspace tab.

Step 3 - Mount your Azure Data Lake Store


Databricks Notebook (Scala)

  • Import libraries

  • Retrieve metadata from SQL Server using sqlContext.read.sqlDB

  • Create view over results called VW_Metadata

  • Create variables to store ADLS paths

  • Map rows to an instance of a string which then can be referenced inside the foreach using variables

  • Create a parallel foreach to iterate over metadata

  • Read current file from Raw, passing in the path using DF_DL_RAW_Dataset_loadString variable

  • Write to Standard zone using Databricks Delta format and pas in the target path using the DL_Standard_Delta_Directory variable.

  • Create Database if associated database to the table does not exists

  • Create Delta table if not exists, infer schema from Standard file layer

    26261
    0