Automating Metadata Processing with Azure Data Factory & Databricks

In my previous post, I shared how the first step in my AI image processing startup was to enable secure, global uploads to Azure Blob Storage.

Now, we’re moving from storage to intelligence. Using Azure Data Factory, I’ve set up a BlobEventsTrigger that listens for new uploads in the /blobs/input/ path, and instantly triggers a Databricks notebook workflow to process metadata.

📜 The Automation Flow

User uploads an image via our web app.
Blob Storage logs the event.
Azure Data Factory pipeline detects the event.
ADF triggers a Databricks notebook sequence.
Databricks extracts and saves metadata for the image.

🛠 ARM Template for the Pipeline


{
    "name": "metapipeline",
    "properties": {
        "description": "Pipeline for metadata processing using Databricks notebook",
        "activities": [
            {
                "name": "metaNotebook",
                "type": "DatabricksNotebook",
                "dependsOn": [],
                "policy": {
                    "timeout": "2:00:00",
                    "retry": 3,
                    "retryIntervalInSeconds": 60,
                    "secureOutput": true,
                    "secureInput": true
                },
                "userProperties": [
                    {
                        "name": "purpose",
                        "value": "Metadata processing and validation"
                    }
                ],
                "typeProperties": {
                    "notebookPath": "/meta/Main_Metadata_Processing",
                    "baseParameters": {
                        "environment": "dev"
                    }
                },
                "linkedServiceName": {
                    "referenceName": "AzureDatabricksLS",
                    "type": "LinkedServiceReference"
                }
            }
        ],
        "concurrency": 1,
        "annotations": [
            "metadata",
            "processing",
            "databricks"
        ],
        "variables": {
            "executionStatus": {
                "type": "String",
                "defaultValue": "pending"
            }
        },
        "lastPublishTime": "2025-03-11T06:24:10Z"
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

⚡ Blob Event Trigger


{
  "name": "upload_event_trigger",
  "properties": {
    "runtimeState": "Started",
    "annotations": [
      "Triggered when new files arrive in input container"
    ],
    "pipelines": [
      {
        "pipelineReference": {
          "referenceName": "metapipeline",
          "type": "PipelineReference"
        },
        "parameters": {
          "triggerTime": "@trigger().outputs.body.eventTime",
          "fileName": "@trigger().outputs.body.fileName"
        }
      }
    ],
    "type": "BlobEventsTrigger",
    "typeProperties": {
      "scope": "/subscriptions/mysubid/resourceGroups/pixelresourcegroup/providers/Microsoft.Storage/storageAccounts/pixelintelstorage",
      "blobPathBeginsWith": "/blobs/input/",
      "blobPathEndsWith": ".csv",
      "ignoreEmptyBlobs": true,
      "events": [
        "Microsoft.Storage.BlobCreated"
      ],
      "retryPolicy": {
        "count": 3,
        "intervalInSeconds": 30
      },
      "batchSize": 10,
      "maxConcurrency": 5
    }
  }
}

🔍 What the Notebooks Do

Step 1: Download Image Data

Runs Download_Image_Data to fetch the latest image.
Logs the image path.

Step 2: Extract Metadata

Runs Extract_Metadata to pull EXIF and device info.
Outputs a rich JSON object with camera and image parameters.

Step 3: Save Metadata

Runs Save_Metadata to persist the extracted details.
Stores alongside the image for future ML analysis.

You can explore the notebook scripts here: GitHub repo.

💡 Why Metadata Matters

Some might question—why bother with all this metadata now? The answer is simple: data is future leverage.

Imagine recommending the best devices for image quality based on actual ML results, or suggesting optimal camera settings like:

Orientation
Resolution
Exposure Time
ISO Speed Ratings
Focal Length

We could even design specialized hardware for high-accuracy AI image analysis.

🚀 What’s Next

All of this will run in parallel with my MLflow experiments—more on that in the next post.