• Sean Forgatch

Azure Databricks: Data Profiling

Updated: May 31, 2019



Problem: Need to profile a certain object to understand certain metrics in preparation for Data Warehousing, Engineering, or Science.


Solution: We will utilize the pandas-profiling package in a Python notebook.


Step 1: Import pandas-profiling package

Step 2: Create Pandas Dataframe over source File and Run Report

Step 3: Review Profile


pandas-profiling location




Step 1: Import pandas-profiling package


To import the library, all we need to do is type in the pypi package name shown in the screenshot below:





Step 2: Create Pandas Dataframe over source File and Run Report


*Note a Pandas dataframe is different from a regular dataframe and must be created using the Pandas library



Step 3: Review Profile


The results are far superior to other data profiling libraries. However, it is quite difficult to get the raw data out. There is a method which will give you a the data, but you will spend quite a lot of time getting that data into a usable format.








©2018 by Modern Data Engineering. Proudly created with Wix.com