E-Commerce at Scale: How AI Enforces Consistent Product Attributes Across Millions of SKUs

The scaling of e-commerce platforms requires solutions for well-known problems such as distributed search, real-time inventory management, and recommendation engines. But beneath the surface lurks a persistent, often underestimated problem that plagues almost every online retailer: the management and normalization of attribute values. While this challenge initially seems trivial, applying it across several million products with dozens of attributes each reveals significant complications.

The Hidden Problem in Product Data Quality

Product attributes serve as the foundation of product discovery. They control filter functions, comparison features, search relevance, and personalized recommendations. In real catalogs, however, attribute values are rarely in optimal form: they exhibit inconsistencies, contain duplicates, have formatting errors, or are semantically ambiguous.

Let’s consider concrete examples:

Size values might be mixed up as: “XL”, “Small”, “12cm”, “Large”, “M”, “S”. Colors are listed chaotically: “RAL 3020”, “Crimson”, “Red”, “Dark Red”. Individually, these deviations seem harmless. But when multiplied across 3 million SKUs, each with dozens of attributes, the problem becomes structurally critical.

The consequences are immediately felt: filters behave unpredictably, search engines lose precision, manual cleanup processes require immense resources, and product discovery becomes slower and more frustrating for users.

Architectural Approach: Hybrid AI with Strict Control

The solution was not to introduce a black-box AI making opaque decisions. Such systems are hard to interpret, complex to debug, and prone to uncontrolled error propagation. Instead, a hybrid pipeline was designed that:

  • Remains explainable – every decision is traceable
  • Works predictably – no arbitrary variations
  • Is scalable – processes millions of documents
  • Is controllable by humans – control mechanisms are built in

The result was a hybrid architecture combining contextual reasoning of large language models with deterministic rules and merchant controllers. AI with guardrails, not AI without control.

Why Offline Processing Was the Right Choice

All attribute normalization is performed not in real-time but in asynchronous background jobs. This was not a compromise but a deliberate architectural decision with significant advantages:

Advantages of batch processing:

  • High throughput: Massive data volumes are processed without burdening live systems
  • Resilience: Failures never impact customer traffic
  • Cost optimization: Calculations run during low-traffic periods
  • System isolation: LLM latency does not affect product pages
  • Determinism: Updates are atomic and reproducible

In contrast, real-time processing would lead to unpredictable latency, fragile dependencies, costly computations, and operational instability. Isolating customer-facing systems from data pipelines is essential at scale.

Data Persistence as a Stability Guarantee

A critical aspect of the architecture was thoughtful data persistence. All normalized results are stored directly in a centralized Product MongoDB. This persistence strategy served multiple functions:

  • Operational transparency: Changes are auditable and traceable
  • Flexibility: Values can be manually overridden or categories reprocessed
  • System integration: Easy synchronization with other services
  • Auditability: Complete audit trail for business-critical processes

MongoDB became the central storage for sorted attribute values, refined attribute names, category-specific sort tags, and product-related sortOrder fields. This persistence strategy ensured consistency and stability across the entire ecosystem.

The Technical Processing Workflow

Before applying AI, a rigorous preprocessing step reduces noise:

  • Trim whitespace
  • Eliminate empty values
  • Deduplicate duplicates
  • Standardize category contexts

This seemingly simple step significantly improves LLM accuracy. Garbage in, garbage out – with this data volume, even minor errors can escalate into larger problems later.

The LLM service then receives cleaned input with context:

  • Sanitized attribute values
  • Category hierarchy information
  • Metadata about attribute type

With this context, the model recognizes:

  • That “Spannung” (voltage) in power tools should be sorted numerically
  • That “Size” in clothing follows known progressions
  • That “Color” may need to consider RAL standards
  • That “Material” has semantic relationships

The model returns: ordered values, refined attribute names, and a classification (deterministic vs. contextual).

Deterministic Fallbacks for Efficiency

Not every attribute requires AI reasoning. Numeric ranges, unit-based values, and simple sets benefit from:

  • Faster processing
  • Predictable sorting
  • Lower costs
  • Eliminated ambiguity

The pipeline automatically detects such cases and applies deterministic logic—efficient resource use without unnecessary LLM calls.

Human Control via Tagging System

Merchants need override options, especially for critical attributes. Therefore, each category can be tagged as:

  • LLM_SORT: Model makes the decision
  • MANUAL_SORT: Merchant defines the order manually

This dual tagging system builds trust: humans retain final control while AI handles the bulk workload.

Search Integration as a Validation Point

After normalization, sorted values flow into specialized search systems:

  • Elasticsearch for keyword-based search
  • Vespa for semantic and vector-based search

This ensures that:

  • Filters appear in logical order
  • Product pages display consistent attributes
  • Search engines rank products more accurately
  • Customers browse categories more intuitively

Search integration was where attribute consistency became most visible and critical.

System Architecture Overview

The entire system follows this flow:

  1. Product data arrives from the product information system
  2. Attribute extraction job pulls values and category context
  3. AI sorting service performs intelligent reasoning
  4. Updated documents are persisted in Product MongoDB
  5. Outbound sync job updates the PIM with new sort orders
  6. Elasticsearch & Vespa sync jobs transfer normalized data
  7. API services connect search systems with client applications

This persistent strategy ensures that every attribute value—whether AI-sorted or manually defined—is reflected in search, merchandising, and customer interactions.

Practical Transformation Results

The pipeline transformed chaotic raw values into consistent output:

Attribute Raw Values Normalized Output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, RAL 3020
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

These examples demonstrate how combining contextual AI reasoning with deterministic rules creates logical, understandable sequences.

Results and Business Impact

The solution delivered significant results:

  • Consistent attribute sorting across 3M+ SKUs
  • Predictable numeric order via deterministic fallbacks
  • Operational control through merchant tagging
  • Visual improvements on product pages with more intuitive filters
  • Increased search relevance and ranking accuracy
  • Greater customer trust and improved conversion rates

This was not just a technical achievement but an immediate business success.

Key Takeaways

  • Hybrid pipelines outperform pure AI: Guardrails and control are essential at scale
  • Context is king: Contextual inputs dramatically improve LLM accuracy
  • Offline jobs are indispensable: They provide throughput, resilience, and cost efficiency
  • Human overrides build trust: Operators accept systems they can control
  • Clean input is fundamental: Data quality is a prerequisite for reliable AI outputs
  • Persistence guarantees stability: Central data storage enables auditability and control

Conclusion

Attribute value normalization may seem simple, but scaling it to millions of products turns it into a real challenge. By combining LLM intelligence with deterministic rules, persistence guarantees, and merchant control, a complex, hidden problem was transformed into a scalable, maintainable system.

The greatest successes often do not come from solving obvious challenges but from tackling underestimated problems—those that are easy to overlook but appear on every product page. Attribute consistency is precisely such a problem.

VON14,54%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)