• Sean Forgatch

Azure Data Factory: Ingesting the 10TB GDELT Dataset



The GDELT dataset is one of the most interesting datasets in the world and is perfect for Analytics, Big Data, and AI project experimentation.


GDELT Raw Data


The raw dataset includes a master file which currently list around 400,000 file paths to call using an HTTP source in Azure Data Factory. There is also a 15 minute incremental update file which only list the newest of the 3 file types included in the dataset. We will be focusing on the initial load of the data.


*This will cost you in Azure Credits and will most likely exceed the free credit limits you are given, therefore you may not be able to ingest all the files unless you add additional credits.



GDELT Ingestion Pipeline

Data Factory Items to be built:


Pipeline

1 pipeline to accomplish the following:

  1. Copy master list of data files

  2. Run U-SQL job on master list to parse out file names(you can do this in Dataflows if you'd like)

  3. Lookup the output of U-SQL(or dataflows) activity

  4. Filter on ID column for each number (0-9) this is a simple technique for our use case to increase performance.

Datasets

  • ADLS - Gdelt | File path: DataLake/Raw/Gdelt/Master

  • ADLS - Gdelt - data | File path: DataLake/Raw/Gdelt/Data

  • ADLS - Gdelt - refined | File path: DataLake/Raw/Gdelt/Master/Refined

  • HTTP - Gdelt - master | Relative Url: masterfilelist.txt

  • HTTP - Gdelt - data | Relative url: @dataset().DS_SourcePath


Connections

  • Azure Data Lake Store | Standard connection to your Azure Data Lake Store

  • HTTP | See properties below:




Step 1: Create a pipeline named PL_Ingest_Gdelt

Step 2: Create Linked Services: Reference Linked Services items above

Step 3: Create Datasets DS_HTTP_Gdelt_Master (source)

  • Linked Service: Select the LS_HTTP_Gdelt service your previously created

  • Relative Url: masterfilelist.txt

  • Request method: GET

  • File format settings: Click "Detect Text Format"

  • Schema: Click "import schema", this will get 3 columns called Prop_0 - Prop_2

Step 4: Create Dataset DS_ADLS_Gdelt (sink)

  • Directory: DataLake/Raw/Gdelt/Master

  • Filename: Master

  • File format settings: Click "Detect Text Format" and Column names in the first row

  • Schema: Click "Import schema" this should create 3 columns named ID, FileKey, SourcePath. If it does not manually create them as String types.

Step 5: Drag a Copy activity to the template named Copy-Master

  • Source: DS_HTTP_Gdelt_Master

  • Sink: DS_ADLS_Gdelt

  • Mappings: None

Step 6: Drag a U-SQL activity as the next step

Reference the Data Lake Analytics Linked Service and point to the path of a script that adds a column to the dataset which is with only the name of the file

(***will update this section soon)


Step 7: Lookup Activity

  • Lookup the refined output of the U-SQL or Dataflows job

  • Name: Lookup-Master

Step 8: Filters (10)

Repeat this step 10 times for each number 0-9 while replacing the numbers.

  • Drag Filter activity as next step named Batch-1

  • Settings: Items: @activity('Lookup-Master').output.value

  • Settings: Condition: @startswith(item().Id,'1')

Step 9: Foreach Activity

This will copy files in parallel batches up to 50 in our filtered datasets. The expression is referring to the filter activity which comes before it. Do this foreach Filter(reference diagram at top for clarity)

  • Drag ForEach activity to pane

  • Settings: Items: @activity('Batch-1').output.value



This will now execute 10 ForEach copy activities in which each of those are copying files in batches up to 50.


©2018 by Modern Data Engineering. Proudly created with Wix.com