Data Factory Ingestion Framework: Part 2 - The Metadata Model
Part 2 of 4 in the series of blogs where I walk though metadata driven ELT using Azure Data Factory. We will review the primary component that brings the framework together, the metadata model. The metadata model is developed using a technique borrowed from the data warehousing world called Data Vault(the model only). It's primary purpose is storing metadata about a dataset, the objective is that a dataset can be agnostic to system type(ie. SQL Server table, SAP Hana table, Teradata table, Oracle table) essentially any Dataset available in Azure Data Factory's Linked Services list(over 50!).
Before reading this blog, catch up on part 1 below, where I review how to build a pipeline that loads this metadata model discussed in Part 2, as well as an intro do Data Vault.
Benefits of using Data Vault to automate data lake ingestion:
Historical changes to schema
Easily keep up with Azure's advancement by adding on new Satellite tables without restructuring the entire model
Easily add a new source system type also by adding a Satellite table
Benefits of using Azure Data Factory
Develop pattern oriented ETL\ELT - I'll show you how you'll only ever need two ADF pipelines in order to ingest an unlimited amount of datasets.
Run a pipeline in batches of 50
Wide ranges of connectors
Pay for what you use
Model Base Tables
The following are an example of the base model tables. The primary driver around the design was to automate the ingestion of any dataset into Azure Data Lake(though this concept can be used with other storage systems as well) using Azure Data Factory as well as adding the ability to define custom properties and settings per dataset. The solution would comprise of only two pipelines. One to get and store metadata, the other to read that metadata and go and retrieve the actual data. Each system type will have it's own Satellite table that houses the information schema about that particular system.
This group of tables houses most importantly the center piece to the entire model, the Hub_Dataset table, whose primary purpose is to identify a unique dataset throughout numerous types of datasets and systems. The Hub_Dataset table separates business keys from the attributes which are located on the dataset satellite tables below. Columns table hold all column information for a dataset. The Option table gets 1 record per unique dataset, and this stores simple bit configurations such as isIngestionEnabled, isDatabricksEnabled, isDeltaIngestionEnabled, to name a few.
The different type tables you see here is just an example of some types that I've encountered. Their sole purpose is to store that unique attribute data about an individual dataset. Take ..type_sql(SQL Server) for example, this data will house the table name, schema, database, schema type(ie. tables and views), which would then tie back to it's dataset key in Hub_Dataset.
I then feed this data back to data factory for ETL\ELT, I write a view over the model to pull in all datasets then send them to their appropriate activity based on sourceSystemType. This is driven through a batch framework addition not discussed within the scope of this blog but it also ties back to the dataset.
These tables are loaded by a stored procedure and holds distinct connections to our source systems. They are identified by a system type acronym(ie. sql, asql, sapHana, etc.) which Data Factory will then execute logic based upon that type. In Azure Data Factory we will only have 1 Linked Service per source system type(ie. if we have 100 source SQL Server databases then we will have 100 connections in the Hub\Sat tables for Linked Service and in Azure Data Factory we will only have one parameterized Linked Service for SQL Server). To elaborate, we will be passing in connection string properties to a template linked service per system type.
sat_LinkedService_Configuration has key value columns. This is to account for the variable amount of properties that can be used on the Linked Services
sat_LinkedService_Options has 1 record per connection to control settings such as isEnabled.
The last table here is the only link involved in this model, it ties a dataset to a connection using the hashKey from the Hub_Dataset table as well as the hashKey from the Hub_LinkedService table
An example base model with three source system types: Azure SQL, SQL Server, and Azure Data Lake Store. Data Vault table types include 2 Hubs, 1 Link, and the remaining are Satellites primarily as an addition to the Hub_Dataset table.
Adding a new Source System Type
When adding a new source system type to the model, there are a few new objects you'll need to create or alter such as:
Create - Staging Table , this is a staging table to (ie. adf.stg_sql) stage the incoming metadata per source type.
Create - View of Staging Table, this view is used in our data vault loading procedures to act as our source for our loading procedure as well as to generate a hash key for the dataset and a hashkey for the column on a dataset.
Alter - Load Procedure, finally, the procedure that reads the views and loads the tables mentioned above. In my case I've used only one procedure to load Hub and Sat's for the dataset while using one other procedure which loads the Link. This is just how I chose to organize it.
To reiterate, these only need developed once per system type, not per connection.
*Adding connections are a one time activity, therefore we will not be loading the Hub_LinkedService at the same time as the Hub_Dataset.
Load Staging tables - this is done using the schema loader pipeline from the first blog post in this series(see link at the top).
Load Model - Execute the load procedure that loads all Dataset associated tables and the link_Dataset_LinkedService
What else can I add to this model?
Though not discussed in this article, I've been able to fuel other automation features while tying everything back to a dataset. The metadata currently fuels both Azure Databricks and Azure Data Factory while working together.Other tools can certainly be used. These scenarios include:
Change Tracking or Replication automation
Data Warehouse and Data Vault DML\DDL Automation
Data Profiling automation
Hope this helps you along in your Azure journey! Look for part 3 in the coming weeks!