Azure Data Factory: Ingesting the 10TB GDELT Dataset
The GDELT dataset is one of the most interesting datasets in the world and is perfect for Analytics, Big Data, and AI project experimentation.
The raw dataset includes a master file which currently list around 400,000 file paths to call using an HTTP source in Azure Data Factory. There is also a 15 minute incremental update file which only list the newest of the 3 file types included in the dataset. We will be focusing on the initial load of the data.
*This will cost you in Azure Credits and will most likely exceed the free credit limits you are given, therefore you may not be able to ingest all the files unless you add additional credits.
GDELT Ingestion Pipeline
Data Factory Items to be built:
1 pipeline to accomplish the following:
Copy master list of data files
Run U-SQL job on master list to parse out file names(you can do this in Dataflows if you'd like)
Lookup the output of U-SQL(or dataflows) activity
Filter on ID column for each number (0-9) this is a simple technique for our use case to increase performance.
ADLS - Gdelt | File path: DataLake/Raw/Gdelt/Master
ADLS - Gdelt - data | File path: DataLake/Raw/Gdelt/Data
ADLS - Gdelt - refined | File path: DataLake/Raw/Gdelt/Master/Refined
HTTP - Gdelt - master | Relative Url: masterfilelist.txt
HTTP - Gdelt - data | Relative url: @dataset().DS_SourcePath
Azure Data Lake Store | Standard connection to your Azure Data Lake Store
HTTP | See properties below:
Step 1: Create a pipeline named PL_Ingest_Gdelt
Step 2: Create Linked Services: Reference Linked Services items above
Step 3: Create Datasets DS_HTTP_Gdelt_Master (source)
Linked Service: Select the LS_HTTP_Gdelt service your previously created
Relative Url: masterfilelist.txt
Request method: GET
File format settings: Click "Detect Text Format"
Schema: Click "import schema", this will get 3 columns called Prop_0 - Prop_2
Step 4: Create Dataset DS_ADLS_Gdelt (sink)
File format settings: Click "Detect Text Format" and Column names in the first row
Schema: Click "Import schema" this should create 3 columns named ID, FileKey, SourcePath. If it does not manually create them as String types.
Step 5: Drag a Copy activity to the template named Copy-Master
Step 6: Drag a U-SQL activity as the next step
Reference the Data Lake Analytics Linked Service and point to the path of a script that adds a column to the dataset which is with only the name of the file
(***will update this section soon)
Step 7: Lookup Activity
Lookup the refined output of the U-SQL or Dataflows job
Step 8: Filters (10)
Repeat this step 10 times for each number 0-9 while replacing the numbers.
Drag Filter activity as next step named Batch-1
Settings: Items: @activity('Lookup-Master').output.value
Settings: Condition: @startswith(item().Id,'1')
Step 9: Foreach Activity
This will copy files in parallel batches up to 50 in our filtered datasets. The expression is referring to the filter activity which comes before it. Do this foreach Filter(reference diagram at top for clarity)
Drag ForEach activity to pane
Settings: Items: @activity('Batch-1').output.value
This will now execute 10 ForEach copy activities in which each of those are copying files in batches up to 50.