A Practical Guide to MLOps on AWS: Transforming Raw Data into AI-Ready Datasets with AWS Glue (Phase 02)
In Phase 01, we built the ingestion layer of our Retail AI Insights system. We streamed historical product interaction data into Amazon S3 (Bronze zone) and stored key product metadata with inventory information in DynamoDB.
Now that we have raw data arriving reliably, it's time to clean, enrich, and organize it for downstream AI workflows.
### Objective
* Transform raw event data from the Bronze zone into:
* Cleaned, analysis-ready Parquet files in the Silver zone
* Forecast-specific feature sets in the Gold zone under `/forecast_ready/`
* Recommendation-ready CSV files under `/recommendations_ready/`
This will power:
* Demand forecasting via Amazon Bedrock
* Personalized product recommendations using Amazon Personalize
## What We'll Build in This Phase
* AWS Glue Jobs: Python scripts to clean, transform, and write data to the appropriate S3 zone
* AWS Glue Crawlers: Catalog metadata from S3 into tables for Athena & further processing
* AWS CDK Stack: Provisions all jobs, buckets, and crawlers
* Athena Queries: Run sanity checks on the transformed data
## Directory & Bucket Layout
We'll now be working with the following S3 zones:
* `retail-ai-bronze-zone/` → Raw JSON from Firehose
* `retail-ai-silver-zone/cleaned_data/` → Cleaned Parquet
* `retail-ai-gold-zone/forecast_ready/` → Aggregated features for forecasting
* `retail-ai-gold-zone/recommendations_ready/` → CSV with item metadata for Personalize
You'll also notice a fourth bucket: `retail-ai-zone-assets/`, this stores scripts, and training dataset.
## Step 1 - Creating Glue Resources via CDK
Now that we've set up our storage zones and uploaded the required ETL scripts and datasets, it's time to define the Glue resources with AWS CDK.
We'll create:
* 3 Glue Jobs
* **DataCleaningETLJob** → Cleans raw JSON into structured Parquet for the Silver Zone.
* **ForecastGoldETLJob** → Transforms cleaned data with features for demand prediction.
* **RecommendationGoldETLJob** → Prepares item metadata CSV for Amazon Personalize.
* Four Crawlers
* Validate everything with Athena
From the project root, generate the construct file:
mkdir -p lib/constructs/analytics && touch lib/constructs/analytics/glue-resources.ts
Make sure your local scripts/ and dataset/ directories are present, then upload them to your S3 assets bucket:
aws s3 cp ./scripts/sales_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./scripts/forecast_gold_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./scripts/user_interaction_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./dataset/events_with_metadata.csv s3://retail-ai-zone-assets/dataset/
aws s3 cp ./scripts/inventory_forecaster.py s3://retail-ai-zone-assets/scripts/
### Define Glue Jobs & Crawlers in CDK
Now, open the `lib/constructs/analytics/glue-resources.ts` file and define the full CDK logic to create:
* A Glue job role with required permissions
* The three ETL jobs with their respective scripts
* Four crawlers with S3 targets pointing to Bronze, Silver, Forecast, and Recommendation zones
Open the `lib/constructs/analytics/glue-resources.ts` file, and add the following code:
import { Construct } from "constructs";
import * as cdk from "aws-cdk-lib";
import { Bucket } from "aws-cdk-lib/aws-s3";
import { CfnCrawler, CfnJob, CfnDatabase } from "aws-cdk-lib/aws-glue";
import {
Role,
ServicePrincipal,
ManagedPolicy,
PolicyStatement,
} from "aws-cdk-lib/aws-iam";
interface GlueProps {
bronzeBucket: Bucket;
silverBucket: Bucket;
goldBucket: Bucket;
dataAssetsBucket: Bucket;
}
export class GlueResources extends Construct {
constructor(scope: Construct, id: string, props: GlueProps) {
super(scope, id);
const { bronzeBucket, silverBucket, goldBucket, dataAssetsBucket } = props;
// Glue Database
const glueDatabase = new CfnDatabase(this, "SalesDatabase", {
catalogId: cdk.Stack.of(this).account,
databaseInput: {
name: "sales_data_db",
},
});
// Create IAM Role for Glue
const glueRole = new Role(this, "GlueServiceRole", {
assumedBy: new ServicePrincipal("glue.amazonaws.com"),
});
bronzeBucket.grantRead(glueRole);
silverBucket.grantReadWrite(glueRole);
goldBucket.grantReadWrite(glueRole);
glueRole.addToPolicy(
new PolicyStatement({
actions: ["s3:GetObject"],
resources: [`${dataAssetsBucket.bucketArn}/*`],
})
);
glueRole.addManagedPolicy(
ManagedPolicy.fromAwsManagedPolicyName("service-role/AWSGlueServiceRole")
);
// Glue Crawler (for Bronze Bucket)
new CfnCrawler(this, "DataCrawlerBronze", {
name: "DataCrawlerBronze",
role: glueRole.roleArn,
databaseName: glueDatabase.ref,
targets: {
s3Targets: [{ path: bronzeBucket.s3UrlForObject() }],
},
tablePrefix: "bronze_",
});
// Glue ETL Job
new CfnJob(this, "DataCleaningETLJob", {
name: "DataCleaningETLJob",
role: glueRole.roleArn,
command: {
name: "glueetl",
pythonVersion: "3",
scriptLocation: dataAssetsBucket.s3UrlForObject(
"scripts/sales_etl_script.py"
),
},
defaultArguments: {
"--TempDir": silverBucket.s3UrlForObject("temp/"),
"--job-language": "python",
"--bronze_bucket": bronzeBucket.bucketName,
"--silver_bucket": silverBucket.bucketName,
},
glueVersion: "3.0",
maxRetries: 0,
timeout: 10,
workerType: "Standard",
numberOfWorkers: 2,
});
// Glue Crawler (for Silver Bucket)
new CfnCrawler(this, "DataCrawlerSilver", {
name: "DataCrawlerSilver",
role: glueRole.roleArn,
databaseName: glueDatabase.ref,
targets: {
s3Targets: [
{
path: `${silverBucket.s3UrlForObject()}/cleaned_data/`,
},
],
},
tablePrefix: "silver_",
});
// Glue Crawler (for Gold Bucket)
new CfnCrawler(this, "DataCrawlerForecast", {
name: "DataCrawlerForecast",
role: glueRole.roleArn,
databaseName: glueDatabase.ref,
targets: {
s3Targets: [{ path: `${goldBucket.s3UrlForObject()}/forecast_ready/` }],
},
tablePrefix: "gold_",
});
// Glue Crawler (for Gold Bucket)
new CfnCrawler(this, "DataCrawlerRecommendations", {
name: "DataCrawlerRecommendations",
role: glueRole.roleArn,
databaseName: glueDatabase.ref,
targets: {
s3Targets: [
{ path: `${goldBucket.s3UrlForObject()}/recommendations_ready/` },
],
},
tablePrefix: "gold_",
});
// Glue ETL Job to output forecast ready dataset
new CfnJob(this, "ForecastGoldETLJob", {
name: "ForecastGoldETLJob",
role: glueRole.roleArn,
command: {
name: "glueetl",
pythonVersion: "3",
scriptLocation: dataAssetsBucket.s3UrlForObject(
"scripts/forecast_gold_etl_script.py"
),
},
defaultArguments: {
"--TempDir": silverBucket.s3UrlForObject("temp/"),
"--job-language": "python",
"--silver_bucket": silverBucket.bucketName,
"--gold_bucket": goldBucket.bucketName,
},
glueVersion: "3.0",
maxRetries: 0,
timeout: 10,
workerType: "Standard",
numberOfWorkers: 2,
});
// Glue ETL Job to output recommendation ready dataset
new CfnJob(this, "RecommendationGoldETLJob", {
name: "RecommendationGoldETLJob",
role: glueRole.roleArn,
command: {
name: "glueetl",
pythonVersion: "3",
scriptLocation: dataAssetsBucket.s3UrlForObject(
"scripts/user_interaction_etl_script.py"
),
},
defaultArguments: {
"--TempDir": silverBucket.s3UrlForObject("temp/"),
"--job-language": "python",
"--silver_bucket": silverBucket.bucketName,
"--gold_bucket": goldBucket.bucketName,
},
glueVersion: "3.0",
maxRetries: 0,
timeout: 10,
workerType: "Standard",
numberOfWorkers: 2,
});
}
}
Wire it up on the `retail-ai-insights-stack.ts` file
/**
* Glue ETL Resources
**/
new GlueResources(this, "GlueResources", {
bronzeBucket,
silverBucket,
goldBucket,
dataAssetsBucket,
});
Once deployed via `cdk deploy`:
1. Navigate to AWS Glue > ETL Jobs - You should see:
1. Go to AWS Glue > Data Catalog > Crawlers – Ensure four crawlers exist:
### Step 2 - Run Glue Jobs to Transform Raw Data
Now that our Glue jobs and crawlers are deployed, let’s walk through how we run the ETL flow across the Bronze, Silver, and Gold zones.
#### Locate Raw Data in Bronze Bucket
1. Go to the Amazon S3 Console, open the `retail-ai-bronze-zone bucket`.
2. Drill down through the directories until you see the file, note the tree structure, in my case it's `dataset/2025/05/26/20`
3. Copy this full prefix path.
#### Update the ETL Script Input Path
Open the `sales_etl_script.py` inside VSCode.
On line 36, update the input_path variable to reflect the directory path you just copied:
input_path = f"s3://{bronze_bucket}/dataset/2025/05/26/20/"
Re-upload the modified script to your S3 data-assets bucket:
aws s3 cp ./scripts/sales_etl_script.py s3://retail-ai-zone-assets/scripts/
Because versioning is enabled on the bucket, this will replace the previous file while preserving version history.
#### Run the ETL Jobs
Now let’s kick off the transformation pipeline:
Run `DataCleaningETLJob`
* Go to AWS Glue Console > ETL Jobs.
* Select the `DataCleaningETLJob` and click Run Job.
* This job will:
* Read raw JSON data from the Bronze bucket.
* Clean, cast, and convert it to Parquet.
* Store the results in the `retail-ai-silver-zone` bucket under `cleaned_data/`
Once successful, navigate to the `retail-ai-silver-zone` bucket and confirm:
Run `ForecastGoldETLJob`
* Go to AWS Glue Console > ETL Jobs.
* Select the `ForecastGoldETLJob` and click Run Job.
* This job will:
* Read the cleaned data from `retail-ai-silver-zone/cleaned_data/`
* Aggregate daily sales
* Output the transformed data to `retail-ai-gold-zone/forecast_ready/`
Once completed, visit the Gold bucket and confirm the forecast files are present in that directory.
Run `RecommendationGoldETLJob`
* Go to AWS Glue Console > ETL Jobs.
* Select the `RecommendationGoldETLJob` and click Run Job.
* This job will:
* Read cleaned product data from the Silver zone
* Output only the required item metadata in CSV format
* Save to `retail-ai-gold-zone/recommendations_ready/`
After the job runs successfully, go to the Gold bucket and verify the structure and CSV file.
### Run All Glue Crawlers
Once the Glue crawlers are deployed, you’ll see four of them listed in the Glue Console > Data Catalog > Crawlers:
1. Select all four crawlers.
2. Click Run.
3. Once completed, look at the "Table changes on the last run" column each should say "1 created".
#### Validate Table Creation
Navigate to Glue Console > Data Catalog > Databases > Tables. You should now see four new tables, each corresponding to a specific zone:
Each table has an automatically inferred schema, including columns like `user_id`, `event_type`, `timestamp`, `price`, `product_name`, and more.
### Query with Amazon Athena
Now let’s run SQL queries against these tables:
Open the Amazon Athena Console.
If it's your first time, you’ll see a pop-up:
Choose your `retail-ai-zone-assets` bucket.
Click Save.
#### Sample Athena Query
In the query editor, trying running simple SQL queries:
Select * from sales_data_db.<TABLE_NAME>
Try this query on the `bronze_retail_ai_bronze_zone` table:
Select * from sales_data_db.bronze_retail_ai_bronze_zone
Try this query on the `silver_cleaned_data` table:
Select * from sales_data_db.silver_cleaned_data
Try this query on the `gold_forecast_ready` table:
Select * from sales_data_db.gold_forecast_ready
Try this query on the `gold_recommendations_ready` table:
Select * from sales_data_db.gold_recommendations_ready
## What You’ve Just Built
In this phase, you've gone beyond basic ETL. You’ve engineered a production-grade data lake with:
* Multi-zone architecture (Bronze, Silver, Gold)
* Automated ETL pipelines using AWS Glue
* Schema discovery and validation through Crawlers
* Interactive querying via Amazon Athena
All of this was done infrastructure-as-code first using AWS CDK, with clean separation of storage, processing, and access layers, exactly how real-world cloud data platforms are designed.
But this isn’t just about organizing data. You’re now sitting on a foundation that’s:
* AI-ready
* Model-friendly
* Cost-efficient
* And built for scale
### What’s Next?
In Phase 3, we’ll unlock this data’s real potential, using Amazon Bedrock to power AI-based demand forecasting, running nightly on an EC2 instance and storing predictions back into our pipeline.
You’ve built the rails, now it’s time to run intelligence through them.
## Complete Code for the Second Phase
To view the full code for the second phase, checkout the repository on GitHub
🚀 **Follow me onLinkedIn for more AWS content!**