Azure Databricks Architecture on Data Lake
Updated: May 31, 2019
This is a simple overview of a mature Data Lake architecture to be used alongside Databricks Delta. The loading of the data lake from Ingestion into RAW and the processing over to CUR can be 100% completely automated, as it should be. We are not eliminating ETL work when doing ELT, rather we are just pushing the transformation further down the pipeline(ELETL). Keep in mind this is the Data Lake architecture and does not take into account what comes after which would be in Azure, a cloud data warehouse, a semantic layer, and dashboards and reports. This specific architecture is about enabling Data Science, and presenting the Databricks Delta tables to the Data Scientist or Analyst conducting data exploration and experimentation.
Data Lake Zones
RAW: Raw is all about Data Ingestion. We want to get data into Raw as quickly and as efficiently as possible. No transformations are allowed here. Data needs to be stored as is from it's source system, this mitigates risk of schema changes to RAW and keeps the source ingestion architecture simple and resilient. Data in RAW is to be stored by ingestion date. No access from end users is to be granted here. Data here can include duplicate rows, record versions of various updates, etc. Delta ETL reads are done between STD and CUR.
STD: The STD zone has two primary features: standardized file types and data partitioning. A standard file within this zone should be carefully considered in whether you should compress the data and which file type to use. This is largely going to be based on the type of data you are working with and performance requirements. The STD zone allows the CUR zone to be rebuilt at a better performance rate than rebuilding from RAW.
CUR: The CUR zone is where our Databricks Delta table data will live, as well as any actual curated data files. Consumption can be done from the Databricks Delta table using a Spark connector as such in PowerBI. Databricks Delta table data is zippy compressed parquet files. Even though Databricks Delta has query optimization to alleviate some partitioning requirements. Databricks\Spark can be used to load the this zone from STD using Delta format.
The table layer is actually fairly straight forward, as we are not building models here(though we could, Data Vault is an excellent choice for this area). Tables in Databricks Delta will represent Directories within our data lake. Loading of these tables will be done using Spark within Databricks notebooks.
Data Lake Architecture
Databricks Delta Guide
Azure Data Lake Best Practices