Automating Metadata Processing with Azure Data Factory & Databricks
In my previous post, I shared how the first step in my AI image processing startup was to enable secure, global uploads to Azure Blob Storage.
Now, we’re moving from storage to intelligence. Using Azure Data Factory, I’ve set up a
BlobEventsTrigger that listens for new uploads in the /blobs/input/ path, and
instantly triggers a Databricks notebook workflow to process metadata.
📜 The Automation Flow
- User uploads an image via our web app.
- Blob Storage logs the event.
- Azure Data Factory pipeline detects the event.
- ADF triggers a Databricks notebook sequence.
- Databricks extracts and saves metadata for the image.
🛠 ARM Template for the Pipeline
{
"name": "metapipeline",
"properties": {
"description": "Pipeline for metadata processing using Databricks notebook",
"activities": [
{
"name": "metaNotebook",
"type": "DatabricksNotebook",
"dependsOn": [],
"policy": {
"timeout": "2:00:00",
"retry": 3,
"retryIntervalInSeconds": 60,
"secureOutput": true,
"secureInput": true
},
"userProperties": [
{
"name": "purpose",
"value": "Metadata processing and validation"
}
],
"typeProperties": {
"notebookPath": "/meta/Main_Metadata_Processing",
"baseParameters": {
"environment": "dev"
}
},
"linkedServiceName": {
"referenceName": "AzureDatabricksLS",
"type": "LinkedServiceReference"
}
}
],
"concurrency": 1,
"annotations": [
"metadata",
"processing",
"databricks"
],
"variables": {
"executionStatus": {
"type": "String",
"defaultValue": "pending"
}
},
"lastPublishTime": "2025-03-11T06:24:10Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
⚡ Blob Event Trigger
{
"name": "upload_event_trigger",
"properties": {
"runtimeState": "Started",
"annotations": [
"Triggered when new files arrive in input container"
],
"pipelines": [
{
"pipelineReference": {
"referenceName": "metapipeline",
"type": "PipelineReference"
},
"parameters": {
"triggerTime": "@trigger().outputs.body.eventTime",
"fileName": "@trigger().outputs.body.fileName"
}
}
],
"type": "BlobEventsTrigger",
"typeProperties": {
"scope": "/subscriptions/mysubid/resourceGroups/pixelresourcegroup/providers/Microsoft.Storage/storageAccounts/pixelintelstorage",
"blobPathBeginsWith": "/blobs/input/",
"blobPathEndsWith": ".csv",
"ignoreEmptyBlobs": true,
"events": [
"Microsoft.Storage.BlobCreated"
],
"retryPolicy": {
"count": 3,
"intervalInSeconds": 30
},
"batchSize": 10,
"maxConcurrency": 5
}
}
}
🔍 What the Notebooks Do
Step 1: Download Image Data
- Runs
Download_Image_Datato fetch the latest image. - Logs the image path.
Step 2: Extract Metadata
- Runs
Extract_Metadatato pull EXIF and device info. - Outputs a rich JSON object with camera and image parameters.
Step 3: Save Metadata
- Runs
Save_Metadatato persist the extracted details. - Stores alongside the image for future ML analysis.
You can explore the notebook scripts here: GitHub repo.
💡 Why Metadata Matters
Some might question—why bother with all this metadata now? The answer is simple: data is future leverage.
Imagine recommending the best devices for image quality based on actual ML results, or suggesting optimal camera settings like:
- Orientation
- Resolution
- Exposure Time
- ISO Speed Ratings
- Focal Length
We could even design specialized hardware for high-accuracy AI image analysis.
🚀 What’s Next
All of this will run in parallel with my MLflow experiments—more on that in the next post.