Did Hadoop kill Data Warehousing or Save it?
Updated: Jul 18, 2019
Shortly after the Cloudera and Hortonworks merger, Cloudera announced the sudden departure of their CEO. The merger coupled with its continued drop in stock price demonstrates Horton (the Big Data elephant) is no longer in the room and death of Hadoop is imminent.
Big Data - Best understood as the 3 V's(volume, variety, velocity), introduced by Doug Laney in 2001
Hadoop - an open-source software framework for storing data and running applications on clusters of commodity hardware.
Data Warehousing - "A data warehouse is a copy of transaction data specifically structured for query and analysis." - Ralph Kimball
How can this be? Last year alone industries spent over $65b to tackle big data problems
and they continue to do so. Could it be that all the speculation around the death of Hadoop is merely a failed business strategy by attempting to capitalize on open source Apache software, therefore opening themselves up to a larger competitor base? After all, Microsoft and Amazon have both leveraged open source big data Apache software, and developed their own enterprise versions of these tools, which are much easier to adopt and leverage in their cloud platforms; their stock prices have never seen greater heights.
THE DEATH OF HADOOP?
To say that big data is dying is to say that Napster killed the music industry. Napster did not kill the music industry but revolutionized the way we listen to music. A blog on NME put it best when they stated:
Even though Hadoop provided the initial capability to work with big data, Amazon and Microsoft were the true benefactors who actually capitalized on it. In the age where going digital and the recent saying coined "Digitally Transform or Die", and where it's predicted that 30% of Fortune 500 companies won't exist due to technological advances, we can't speculate the death of Hadoop by looking at a company who has failed to keep pace with it's competitors.
Cloudera and Hortonworks were the pioneers in developing the start of a unified platform to use the various tools in the Apache eco-sytem. Their implementations still required a great deal of overhead such as cluster management administrators and a unique engineering skill-set to work on these tools. Eventually, came the further emergence and maturity of the cloud platforms(ie. AWS and Azure) as well as other tools such as Databricks that allowed simple configuration in a single unified platform while leveraging the scalable power of the cloud. Thus eliminating the need for these extra roles and reducing complexity by allowing an individual to scale to multiple machines with a click of the button.
One of the arguments that big data vendors were marketing was that Hadoop could be a replacement to a data warehouse and would often refer to surveys conducted by Gartner and Forrester analysts to relate to this speculation. This caused much confusion in various industries looking to invest in analytics and in response CIO's and CDO's moving CAPEX dollars towards Hadoop platforms in hopes that new technology could solve an old problem, because let's face it, data warehousing and business intelligence is hard and a solution is not wrapped up in a single tool. We cannot deny though that these investments from both the industry and venture capitalist sides helped get us to where we are today which is better relational systems in terms of functionality and performance.
So yes, Hadoop is dying. But it brought us so much, it challenged us to create better technology together(open-source), it questioned our ethics(Cambridge Analytica), and it brought us to where we are today, in the age where accounting innovators are working towards attaching data assets to the balance sheet. Though there is still a long way to go, both for the company who is still trying to figure out how to look at yesterday's data as well as for the company trying to predict the answers to tomorrow's questions.
THE RESURRECTION OF DATA WAREHOUSING
Big data itself describes the type of data, most commonly this includes highly unstructured data and the speed at which we capture data. Large sizes of data(in regards to relational data) is better off in databases built specifically for data warehousing which has matured since cloud platform's growth(ie. SQL Data Warehouse(Azure), Redshift(Amazon), Snowflake). If we're to look at Snowflake for an example, they focus solely on data warehouse software, so we can gain some insight into the direction of the data warehousing market by looking at their latest investment rounds. Where as to get that granularity of information from Microsoft or AWS would require much more analysis because of their various product categories. Snowflake's IPO is expected in the next 1 - 3 years and to this point has raised nearly $1b in various funding rounds. Without conducting a survey we can determine where investors and businesses are already moving their capital expenditures which is back to the data warehouse so that, they can ultimately drive analytics.
Big data isn't dying, Hadoop is, the tool that spurred an industry revolution to fuel the war on big data. Now, leading big data vendors such as Databricks are aligning themselves to better operate on structured data through ACID compliant databases (ie. Databricks Delta which has been open sourced as Delta Lake) operations that act similar to a relational databases. Essentially, taking the technology learned from the Apache ecosystem and maturing it to the point where one can operate on it as if it were a normal RDBMS. A counter to this idea is that data lakes often live in Hadoop environments, and business are continuing to build and invest in these as they continue to be relevant for unstructured data but Gartner analysts predict "By 2020, 30% of data lakes will be built on standard relational DBMS technology at equal or lower cost than Hadoop". We are seeing this possibility already with Microsoft's version of SQL Server 2019 where they are integrating Apache Spark right into the mix of their flagship database. After all, is not the the data we derive from unstructured data... structured?
As these Hadoop systems begin their retirement, data warehouse\MPP technology continue their rise, and not only the tools but the way we model the data using techniques like Data Vault(invented by Dan Linstedt during his engagement with the U.S. Department of Defense), creating patterns and automating data integration, as well as how we think about and conduct our work using modern project management techniques like Agile data warehousing. HDFS systems in the cloud will continue to be used but for their original intent which is advanced analytics and largely will be based in the cloud.
The data warehouse never died no, but the understanding on it's role and capabilities seemed to for a time, walk out of the war room.