Sean Forgatch
- Apr 20, 2019
- 2 min read

Azure Data Factory: Ingesting the 10TB GDELT Dataset

The GDELT dataset is one of the most interesting datasets in the world and is perfect for Analytics, Big Data, and AI project experimentation.

GDELT Raw Data

The raw dataset includes a master file which currently list around 400,000 file paths to call using an HTTP source in Azure Data Factory. There is also a 15 minute incremental update file which only list the newest of the 3 file types included in the dataset. We will be focusing on the initial load of the data.

*This will cost you in Azure Credits and will most likely exceed the free credit limits you are given, therefore you may not be able to ingest all the files unless you add additional credits.

GDELT Ingestion Pipeline

Data Factory Items to be built:

Pipeline

1 pipeline to accomplish the following:

Copy master list of data files
Run U-SQL job on master list to parse out file names(you can do this in Dataflows if you'd like)
Lookup the output of U-SQL(or dataflows) activity
Filter on ID column for each number (0-9) this is a simple technique for our use case to increase performance.

Datasets

ADLS - Gdelt | File path: DataLake/Raw/Gdelt/Master
ADLS - Gdelt - data | File path: DataLake/Raw/Gdelt/Data
ADLS - Gdelt - refined | File path: DataLake/Raw/Gdelt/Master/Refined
HTTP - Gdelt - master | Relative Url: masterfilelist.txt
HTTP - Gdelt - data | Relative url: @dataset().DS_SourcePath

Connections

Azure Data Lake Store | Standard connection to your Azure Data Lake Store
HTTP | See properties below:

Step 1: Create a pipeline named PL_Ingest_Gdelt

Step 2: Create Linked Services: Reference Linked Services items above

Step 3: Create Datasets DS_HTTP_Gdelt_Master (source)

Linked Service: Select the LS_HTTP_Gdelt service your previously created
Relative Url: masterfilelist.txt
Request method: GET
File format settings: Click "Detect Text Format"
Schema: Click "import schema", this will get 3 columns called Prop_0 - Prop_2

Step 4: Create Dataset DS_ADLS_Gdelt (sink)

Directory: DataLake/Raw/Gdelt/Master
Filename: Master
File format settings: Click "Detect Text Format" and Column names in the first row
Schema: Click "Import schema" this should create 3 columns named ID, FileKey, SourcePath. If it does not manually create them as String types.

Step 5: Drag a Copy activity to the template named Copy-Master

Source: DS_HTTP_Gdelt_Master
Sink: DS_ADLS_Gdelt
Mappings: None

Step 6: Drag a U-SQL activity as the next step

Reference the Data Lake Analytics Linked Service and point to the path of a script that adds a column to the dataset which is with only the name of the file

(***will update this section soon)

Step 7: Lookup Activity

Lookup the refined output of the U-SQL or Dataflows job
Name: Lookup-Master

Step 8: Filters (10)

Repeat this step 10 times for each number 0-9 while replacing the numbers.

Drag Filter activity as next step named Batch-1
Settings: Items: @activity('Lookup-Master').output.value
Settings: Condition: @startswith(item().Id,'1')

Step 9: Foreach Activity

This will copy files in parallel batches up to 50 in our filtered datasets. The expression is referring to the filter activity which comes before it. Do this foreach Filter(reference diagram at top for clarity)

Drag ForEach activity to pane
Settings: Items: @activity('Batch-1').output.value

This will now execute 10 ForEach copy activities in which each of those are copying files in batches up to 50.

MODERN DATA ENGINEERING

Data Strategy and Engineering Insights

Azure Data Factory: Ingesting the 10TB GDELT Dataset

GDELT Ingestion Pipeline

Pipeline

Datasets

Connections

Recent Posts