I've been working on Databricks Delta for clients since it was in preview, it changed the game for how we can do ETL\ELT within Data Lake and greatly simplified the process. Previously, when doing ETL in a data lake we would need to read out the inserts, updates, and deletes and either load those as append files to a table.
Benefits I've encountered
Consider the examples below metadata driven ELT
Just two quick examples of benefits I've encountered when working with Delta. ETL development is changing, things are getting more automated, and this is a good thing!
1. Automated DDL Statements
Here the first step is to write out our file layer, we can do overwrite or append depending on the scenario. This will write out snappy compressed parquet files along with their schema. If we we're too accidentally load the same file twice, it would account for that and you would not have duplicate data. So, if you have an insert only architecture scenario, you don't even need to write update statements, you can just append the file to it's directory, and if you've created a table over that directory using Delta, it will inherit that data.
Here we are automating the creation of our Data Lake datasets. Since we are not explicitly defining the schema, this will create an external table, meaning if we delete data in the table, it wont delete the underlying data itself. When we do this, it will inherit the schema from the previous directory we loaded.
2. Automated DML Statements
In most cases we need to do standard ETL when processing data in a lake. The general best practice pattern is Load Files > Load Staging >Load Target
Here we can create patterns, instead of the previous best which was to automatically create Merge scripts foreach source and target table.
What's next for the world of Big Data?
I started my career in a very interesting space, between two worlds alongside world class data management and architecture consultants and industry experts. The two worlds being the very mature Data Warehousing world working on the latest and greatest approaches such as Data Vault which has encompassed implementation methodologies and patterns that can be automated through ETL automation. Data Warehousing is more relevant than ever. The second world is the world of Big Data, which from it's inception was defined as the 3 v's, not 4, not 5 and defiantly not 8. If you're determining if you need a data lake as yourself is this data big? Is it fast? Is it unstructured?
Well, I enjoyed both worlds and I saw the benefits of augmenting each other. But what my mentors told me from the beginning was that we would eventually come full circle. With the open sourced version of Delta we have. We now can process data in a data lake just like we can in say, SQL Server. Which only means, eventually, we'll all be working in relational databases again soon.