Sean Forgatch
Azure Databricks: Data Profiling
Updated: May 31, 2019

Problem: Need to profile a certain object to understand certain metrics in preparation for Data Warehousing, Engineering, or Science.
Solution: We will utilize the pandas-profiling package in a Python notebook.
Step 1: Import pandas-profiling package
Step 2: Create Pandas Dataframe over source File and Run Report
Step 3: Review Profile
pandas-profiling location
Step 1: Import pandas-profiling package
To import the library, all we need to do is type in the pypi package name shown in the screenshot below:

Step 2: Create Pandas Dataframe over source File and Run Report
*Note a Pandas dataframe is different from a regular dataframe and must be created using the Pandas library

Step 3: Review Profile
The results are far superior to other data profiling libraries. However, it is quite difficult to get the raw data out. There is a method which will give you a the data, but you will spend quite a lot of time getting that data into a usable format.



