Sean Forgatch

May 24, 20191 min

Azure Databricks: Data Profiling

Updated: May 31, 2019

Problem: Need to profile a certain object to understand certain metrics in preparation for Data Warehousing, Engineering, or Science.

Solution: We will utilize the pandas-profiling package in a Python notebook.


Step 1: Import pandas-profiling package

Step 2: Create Pandas Dataframe over source File and Run Report

Step 3: Review Profile

pandas-profiling location

Step 1: Import pandas-profiling package

To import the library, all we need to do is type in the pypi package name shown in the screenshot below:

Step 2: Create Pandas Dataframe over source File and Run Report

*Note a Pandas dataframe is different from a regular dataframe and must be created using the Pandas library

Step 3: Review Profile

The results are far superior to other data profiling libraries. However, it is quite difficult to get the raw data out. There is a method which will give you a the data, but you will spend quite a lot of time getting that data into a usable format.

    73730
    0