From chaos to clarity: How artificial intelligence is transforming e-commerce catalogs

In E-Commerce, technicians often discuss major infrastructure issues: search architecture, real-time inventory management, personalization engines. But beneath the surface lurks a more insidious problem that plagues almost every online retailer: the normalization of product attributes. A chaotic product catalog with inconsistent values for size, color, material, or technical specifications sabotages everything that follows—filters become unreliable, search engines lose precision, manual data cleaning consumes resources.

As a Full-Stack Engineer at Zoro, I dealt daily with this problem: How to bring order to 3+ million SKUs, each with dozens of attributes? The answer was not in a black-box AI but in an intelligent hybrid system that combines LLM reasoning with clear business rules and manual control mechanisms.

The Problem at Scale

Superficially, attribute inconsistencies seem harmless. Consider size indicators: “XL”, “Small”, “12cm”, “Large”, “M”, “S”—all mean the same, but nothing is standardized. For colors, it’s similar: “RAL 3020”, “Crimson”, “Red”, “Dark Red”—some follow color standards (RAL 3020 is a standardized red), others are fanciful names.

Multiply this chaos across millions of products, and the impact becomes dramatic:

  • Customers see chaotic filters and give up on searching
  • Search engines cannot rank products correctly
  • Analytics show false trends
  • Merchandising teams drown in manual data cleaning

The Strategic Approach: Hybrid AI with Rules

My goal was not a mysterious AI system performing black magic. Instead, I wanted a system that:

  • Explainable – you understand why a decision was made
  • Predictable – no surprises or anomalies
  • Scalable – across millions of attributes
  • Human-controllable – business teams can intervene

The result was a pipeline that combines LLM intelligence with clear rules and business oversight. AI with guardrails, not AI without limits.

Why Offline Processing Instead of Real-Time?

The first architectural decision was fundamental: all attribute processing runs in asynchronous background jobs, not in real-time. This may sound like a compromise, but it was a strategic choice with enormous benefits:

Real-time pipelines would cause:

  • Unpredictable latency on product pages
  • Fragile dependencies between systems
  • Cost spikes during traffic surges
  • Direct impact on customer experience

Offline jobs instead offered:

  • High throughput: massive batches without affecting live systems
  • Robustness: processing errors never impact customers
  • Cost control: calculations during low-traffic times
  • Isolation: LLM latency isolated from user-facing services
  • Atomic updates: consistent changes or none at all

Separating customer-facing systems from data processing is essential when working with this volume of data.

The Processing Pipeline

The process unfolded in several phases:

Phase 1: Data Cleaning

Before AI was even involved, data went through a preprocessing step:

  • Trim whitespace
  • Remove empty values
  • Deduplicate duplicates
  • Convert category context into structured strings

This seemingly trivial step dramatically improved LLM accuracy. The principle: garbage in, garbage out. At this scale, even small errors later cause big problems.

Phase 2: AI Reasoning with Context

The LLM didn’t just sort alphabetically. It reasoned about the values. The service received:

  • Cleaned attribute values
  • Category breadcrumbs (e.g., “Power Tools > Drills”)
  • Attribute metadata

With this context, the model could understand:

  • That “Voltage” in power tools should be sorted numerically
  • That “Size” follows a known progression (S, M, L, XL)
  • That “Color” may follow standards like RAL 3020
  • That “Material” has semantic relationships (Steel > Stainless Steel > Carbon Steel)

The model returned:

  • Ordered attribute values
  • Refined attribute names
  • A classification: should this be sorted deterministically or contextually?

Phase 3: Deterministic Fallbacks

Not every attribute needs AI. Many are better handled with clear logic:

  • Numeric ranges (2cm, 5cm, 12cm, 20cm → sorted ascending)
  • Unit-based values
  • Categorical collections

The pipeline automatically recognized these and applied deterministic logic. This saved costs and guaranteed consistency.

Phase 4: Merchant Control

Business-critical attributes required manual review checkpoints. Therefore, each category could be tagged as:

  • LLM_SORT: The model decides
  • MANUAL_SORT: Merchants define the order

This dual system gave humans the final control. If the LLM made a mistake, merchants could override it—without stopping the pipeline.

Persistence and Downstream Systems

All results were stored directly in MongoDB—a single source of truth for:

  • Sorted attribute values
  • Refined attribute names
  • Category-level sort tags
  • Product-level sort order

From there, data flowed in two directions:

  • Elasticsearch: For keyword-based search, where clean attributes power filter menus
  • Vespa: For semantic and vector-based search, where consistency improves ranking

Filters now appear in logical order. Product pages show coherent specifications. Search engines rank products more accurately. Customers navigate categories without frustration.

Concrete Results

The pipeline transformed chaotic raw data into clean, usable outputs:

Attribute Raw Data Sorted Output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, RAL 3020
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

This transformation was consistent across over 3 million SKUs.

Impact and Outcomes

The results extended far beyond technology:

  • Consistent attribute order across the entire catalog
  • Predictable behavior for numeric values via deterministic fallbacks
  • Business control through manual tagging system
  • Clean product pages with intuitive filters
  • Improved search relevance for customers
  • Increased trust and better conversion rates

Not just a technical victory—an business victory.

Key Takeaways

  • Hybrid pipelines outperform pure AI at scale. Guardrails are not a hindrance—they are a feature.
  • Context is everything: an LLM with category info and attribute metadata is 10x more accurate than one without.
  • Offline processing is essential: with this data volume, batch efficiency and fault tolerance matter more than real-time latency.
  • Human override builds trust: teams accept AI when they can control it.
  • Data hygiene is fundamental: cleaned inputs = reliable outputs. Always.

Conclusion

Normalizing attribute values sounds trivial—until you have to do it for millions of products in real-time. By combining LLM intelligence, clear rules, and human oversight, I turned a hidden, stubborn problem into a scalable system.

It’s a reminder: some of the biggest wins in e-commerce don’t come from flashy tech but from solving the boring problems—those that affect every product page.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)