In E-Commerce, technicians often discuss major infrastructure issues: search architecture, real-time inventory management, personalization engines. But beneath the surface lurks a more insidious problem that plagues almost every online retailer: the normalization of product attributes. A chaotic product catalog with inconsistent values for size, color, material, or technical specifications sabotages everything that follows—filters become unreliable, search engines lose precision, manual data cleaning consumes resources.
As a Full-Stack Engineer at Zoro, I dealt daily with this problem: How to bring order to 3+ million SKUs, each with dozens of attributes? The answer was not in a black-box AI but in an intelligent hybrid system that combines LLM reasoning with clear business rules and manual control mechanisms.
The Problem at Scale
Superficially, attribute inconsistencies seem harmless. Consider size indicators: “XL”, “Small”, “12cm”, “Large”, “M”, “S”—all mean the same, but nothing is standardized. For colors, it’s similar: “RAL 3020”, “Crimson”, “Red”, “Dark Red”—some follow color standards (RAL 3020 is a standardized red), others are fanciful names.
Multiply this chaos across millions of products, and the impact becomes dramatic:
Customers see chaotic filters and give up on searching
Search engines cannot rank products correctly
Analytics show false trends
Merchandising teams drown in manual data cleaning
The Strategic Approach: Hybrid AI with Rules
My goal was not a mysterious AI system performing black magic. Instead, I wanted a system that:
Explainable – you understand why a decision was made
Predictable – no surprises or anomalies
Scalable – across millions of attributes
Human-controllable – business teams can intervene
The result was a pipeline that combines LLM intelligence with clear rules and business oversight. AI with guardrails, not AI without limits.
Why Offline Processing Instead of Real-Time?
The first architectural decision was fundamental: all attribute processing runs in asynchronous background jobs, not in real-time. This may sound like a compromise, but it was a strategic choice with enormous benefits:
Real-time pipelines would cause:
Unpredictable latency on product pages
Fragile dependencies between systems
Cost spikes during traffic surges
Direct impact on customer experience
Offline jobs instead offered:
High throughput: massive batches without affecting live systems
Robustness: processing errors never impact customers
Cost control: calculations during low-traffic times
Isolation: LLM latency isolated from user-facing services
Atomic updates: consistent changes or none at all
Separating customer-facing systems from data processing is essential when working with this volume of data.
The Processing Pipeline
The process unfolded in several phases:
Phase 1: Data Cleaning
Before AI was even involved, data went through a preprocessing step:
Trim whitespace
Remove empty values
Deduplicate duplicates
Convert category context into structured strings
This seemingly trivial step dramatically improved LLM accuracy. The principle: garbage in, garbage out. At this scale, even small errors later cause big problems.
Phase 2: AI Reasoning with Context
The LLM didn’t just sort alphabetically. It reasoned about the values. The service received:
The pipeline automatically recognized these and applied deterministic logic. This saved costs and guaranteed consistency.
Phase 4: Merchant Control
Business-critical attributes required manual review checkpoints. Therefore, each category could be tagged as:
LLM_SORT: The model decides
MANUAL_SORT: Merchants define the order
This dual system gave humans the final control. If the LLM made a mistake, merchants could override it—without stopping the pipeline.
Persistence and Downstream Systems
All results were stored directly in MongoDB—a single source of truth for:
Sorted attribute values
Refined attribute names
Category-level sort tags
Product-level sort order
From there, data flowed in two directions:
Elasticsearch: For keyword-based search, where clean attributes power filter menus
Vespa: For semantic and vector-based search, where consistency improves ranking
Filters now appear in logical order. Product pages show coherent specifications. Search engines rank products more accurately. Customers navigate categories without frustration.
Concrete Results
The pipeline transformed chaotic raw data into clean, usable outputs:
Attribute
Raw Data
Sorted Output
Size
XL, Small, 12cm, Large, M, S
Small, M, Large, XL, 12cm
Color
RAL 3020, Crimson, Red, Dark Red
Red, Dark Red, Crimson, RAL 3020
Material
Steel, Carbon Steel, Stainless, Stainless Steel
Steel, Stainless Steel, Carbon Steel
Numeric
5cm, 12cm, 2cm, 20cm
2cm, 5cm, 12cm, 20cm
This transformation was consistent across over 3 million SKUs.
Impact and Outcomes
The results extended far beyond technology:
Consistent attribute order across the entire catalog
Predictable behavior for numeric values via deterministic fallbacks
Business control through manual tagging system
Clean product pages with intuitive filters
Improved search relevance for customers
Increased trust and better conversion rates
Not just a technical victory—an business victory.
Key Takeaways
Hybrid pipelines outperform pure AI at scale. Guardrails are not a hindrance—they are a feature.
Context is everything: an LLM with category info and attribute metadata is 10x more accurate than one without.
Offline processing is essential: with this data volume, batch efficiency and fault tolerance matter more than real-time latency.
Human override builds trust: teams accept AI when they can control it.
Data hygiene is fundamental: cleaned inputs = reliable outputs. Always.
Conclusion
Normalizing attribute values sounds trivial—until you have to do it for millions of products in real-time. By combining LLM intelligence, clear rules, and human oversight, I turned a hidden, stubborn problem into a scalable system.
It’s a reminder: some of the biggest wins in e-commerce don’t come from flashy tech but from solving the boring problems—those that affect every product page.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
From chaos to clarity: How artificial intelligence is transforming e-commerce catalogs
In E-Commerce, technicians often discuss major infrastructure issues: search architecture, real-time inventory management, personalization engines. But beneath the surface lurks a more insidious problem that plagues almost every online retailer: the normalization of product attributes. A chaotic product catalog with inconsistent values for size, color, material, or technical specifications sabotages everything that follows—filters become unreliable, search engines lose precision, manual data cleaning consumes resources.
As a Full-Stack Engineer at Zoro, I dealt daily with this problem: How to bring order to 3+ million SKUs, each with dozens of attributes? The answer was not in a black-box AI but in an intelligent hybrid system that combines LLM reasoning with clear business rules and manual control mechanisms.
The Problem at Scale
Superficially, attribute inconsistencies seem harmless. Consider size indicators: “XL”, “Small”, “12cm”, “Large”, “M”, “S”—all mean the same, but nothing is standardized. For colors, it’s similar: “RAL 3020”, “Crimson”, “Red”, “Dark Red”—some follow color standards (RAL 3020 is a standardized red), others are fanciful names.
Multiply this chaos across millions of products, and the impact becomes dramatic:
The Strategic Approach: Hybrid AI with Rules
My goal was not a mysterious AI system performing black magic. Instead, I wanted a system that:
The result was a pipeline that combines LLM intelligence with clear rules and business oversight. AI with guardrails, not AI without limits.
Why Offline Processing Instead of Real-Time?
The first architectural decision was fundamental: all attribute processing runs in asynchronous background jobs, not in real-time. This may sound like a compromise, but it was a strategic choice with enormous benefits:
Real-time pipelines would cause:
Offline jobs instead offered:
Separating customer-facing systems from data processing is essential when working with this volume of data.
The Processing Pipeline
The process unfolded in several phases:
Phase 1: Data Cleaning
Before AI was even involved, data went through a preprocessing step:
This seemingly trivial step dramatically improved LLM accuracy. The principle: garbage in, garbage out. At this scale, even small errors later cause big problems.
Phase 2: AI Reasoning with Context
The LLM didn’t just sort alphabetically. It reasoned about the values. The service received:
With this context, the model could understand:
The model returned:
Phase 3: Deterministic Fallbacks
Not every attribute needs AI. Many are better handled with clear logic:
The pipeline automatically recognized these and applied deterministic logic. This saved costs and guaranteed consistency.
Phase 4: Merchant Control
Business-critical attributes required manual review checkpoints. Therefore, each category could be tagged as:
This dual system gave humans the final control. If the LLM made a mistake, merchants could override it—without stopping the pipeline.
Persistence and Downstream Systems
All results were stored directly in MongoDB—a single source of truth for:
From there, data flowed in two directions:
Filters now appear in logical order. Product pages show coherent specifications. Search engines rank products more accurately. Customers navigate categories without frustration.
Concrete Results
The pipeline transformed chaotic raw data into clean, usable outputs:
This transformation was consistent across over 3 million SKUs.
Impact and Outcomes
The results extended far beyond technology:
Not just a technical victory—an business victory.
Key Takeaways
Conclusion
Normalizing attribute values sounds trivial—until you have to do it for millions of products in real-time. By combining LLM intelligence, clear rules, and human oversight, I turned a hidden, stubborn problem into a scalable system.
It’s a reminder: some of the biggest wins in e-commerce don’t come from flashy tech but from solving the boring problems—those that affect every product page.