The scaling of e-commerce platforms requires solutions for well-known problems such as distributed search, real-time inventory management, and recommendation engines. But beneath the surface lurks a persistent, often underestimated problem that plagues almost every online retailer: the management and normalization of attribute values. While this challenge initially seems trivial, applying it across several million products with dozens of attributes each reveals significant complications.
The Hidden Problem in Product Data Quality
Product attributes serve as the foundation of product discovery. They control filter functions, comparison features, search relevance, and personalized recommendations. In real catalogs, however, attribute values are rarely in optimal form: they exhibit inconsistencies, contain duplicates, have formatting errors, or are semantically ambiguous.
Let’s consider concrete examples:
Size values might be mixed up as: “XL”, “Small”, “12cm”, “Large”, “M”, “S”. Colors are listed chaotically: “RAL 3020”, “Crimson”, “Red”, “Dark Red”. Individually, these deviations seem harmless. But when multiplied across 3 million SKUs, each with dozens of attributes, the problem becomes structurally critical.
The consequences are immediately felt: filters behave unpredictably, search engines lose precision, manual cleanup processes require immense resources, and product discovery becomes slower and more frustrating for users.
Architectural Approach: Hybrid AI with Strict Control
The solution was not to introduce a black-box AI making opaque decisions. Such systems are hard to interpret, complex to debug, and prone to uncontrolled error propagation. Instead, a hybrid pipeline was designed that:
Remains explainable – every decision is traceable
Works predictably – no arbitrary variations
Is scalable – processes millions of documents
Is controllable by humans – control mechanisms are built in
The result was a hybrid architecture combining contextual reasoning of large language models with deterministic rules and merchant controllers. AI with guardrails, not AI without control.
Why Offline Processing Was the Right Choice
All attribute normalization is performed not in real-time but in asynchronous background jobs. This was not a compromise but a deliberate architectural decision with significant advantages:
Advantages of batch processing:
High throughput: Massive data volumes are processed without burdening live systems
Resilience: Failures never impact customer traffic
Cost optimization: Calculations run during low-traffic periods
System isolation: LLM latency does not affect product pages
Determinism: Updates are atomic and reproducible
In contrast, real-time processing would lead to unpredictable latency, fragile dependencies, costly computations, and operational instability. Isolating customer-facing systems from data pipelines is essential at scale.
Data Persistence as a Stability Guarantee
A critical aspect of the architecture was thoughtful data persistence. All normalized results are stored directly in a centralized Product MongoDB. This persistence strategy served multiple functions:
Operational transparency: Changes are auditable and traceable
Flexibility: Values can be manually overridden or categories reprocessed
System integration: Easy synchronization with other services
Auditability: Complete audit trail for business-critical processes
MongoDB became the central storage for sorted attribute values, refined attribute names, category-specific sort tags, and product-related sortOrder fields. This persistence strategy ensured consistency and stability across the entire ecosystem.
The Technical Processing Workflow
Before applying AI, a rigorous preprocessing step reduces noise:
Trim whitespace
Eliminate empty values
Deduplicate duplicates
Standardize category contexts
This seemingly simple step significantly improves LLM accuracy. Garbage in, garbage out – with this data volume, even minor errors can escalate into larger problems later.
The LLM service then receives cleaned input with context:
Sanitized attribute values
Category hierarchy information
Metadata about attribute type
With this context, the model recognizes:
That “Spannung” (voltage) in power tools should be sorted numerically
That “Size” in clothing follows known progressions
That “Color” may need to consider RAL standards
That “Material” has semantic relationships
The model returns: ordered values, refined attribute names, and a classification (deterministic vs. contextual).
Deterministic Fallbacks for Efficiency
Not every attribute requires AI reasoning. Numeric ranges, unit-based values, and simple sets benefit from:
Faster processing
Predictable sorting
Lower costs
Eliminated ambiguity
The pipeline automatically detects such cases and applies deterministic logic—efficient resource use without unnecessary LLM calls.
Human Control via Tagging System
Merchants need override options, especially for critical attributes. Therefore, each category can be tagged as:
LLM_SORT: Model makes the decision
MANUAL_SORT: Merchant defines the order manually
This dual tagging system builds trust: humans retain final control while AI handles the bulk workload.
Search Integration as a Validation Point
After normalization, sorted values flow into specialized search systems:
Elasticsearch for keyword-based search
Vespa for semantic and vector-based search
This ensures that:
Filters appear in logical order
Product pages display consistent attributes
Search engines rank products more accurately
Customers browse categories more intuitively
Search integration was where attribute consistency became most visible and critical.
System Architecture Overview
The entire system follows this flow:
Product data arrives from the product information system
Attribute extraction job pulls values and category context
AI sorting service performs intelligent reasoning
Updated documents are persisted in Product MongoDB
Outbound sync job updates the PIM with new sort orders
Elasticsearch & Vespa sync jobs transfer normalized data
API services connect search systems with client applications
This persistent strategy ensures that every attribute value—whether AI-sorted or manually defined—is reflected in search, merchandising, and customer interactions.
Practical Transformation Results
The pipeline transformed chaotic raw values into consistent output:
Attribute
Raw Values
Normalized Output
Size
XL, Small, 12cm, Large, M, S
Small, M, Large, XL, 12cm
Color
RAL 3020, Crimson, Red, Dark Red
Red, Dark Red, Crimson, RAL 3020
Material
Steel, Carbon Steel, Stainless, Stainless Steel
Steel, Stainless Steel, Carbon Steel
Numeric
5cm, 12cm, 2cm, 20cm
2cm, 5cm, 12cm, 20cm
These examples demonstrate how combining contextual AI reasoning with deterministic rules creates logical, understandable sequences.
Results and Business Impact
The solution delivered significant results:
Consistent attribute sorting across 3M+ SKUs
Predictable numeric order via deterministic fallbacks
Operational control through merchant tagging
Visual improvements on product pages with more intuitive filters
Increased search relevance and ranking accuracy
Greater customer trust and improved conversion rates
This was not just a technical achievement but an immediate business success.
Key Takeaways
Hybrid pipelines outperform pure AI: Guardrails and control are essential at scale
Context is king: Contextual inputs dramatically improve LLM accuracy
Offline jobs are indispensable: They provide throughput, resilience, and cost efficiency
Human overrides build trust: Operators accept systems they can control
Clean input is fundamental: Data quality is a prerequisite for reliable AI outputs
Persistence guarantees stability: Central data storage enables auditability and control
Conclusion
Attribute value normalization may seem simple, but scaling it to millions of products turns it into a real challenge. By combining LLM intelligence with deterministic rules, persistence guarantees, and merchant control, a complex, hidden problem was transformed into a scalable, maintainable system.
The greatest successes often do not come from solving obvious challenges but from tackling underestimated problems—those that are easy to overlook but appear on every product page. Attribute consistency is precisely such a problem.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
E-Commerce at Scale: How AI Enforces Consistent Product Attributes Across Millions of SKUs
The scaling of e-commerce platforms requires solutions for well-known problems such as distributed search, real-time inventory management, and recommendation engines. But beneath the surface lurks a persistent, often underestimated problem that plagues almost every online retailer: the management and normalization of attribute values. While this challenge initially seems trivial, applying it across several million products with dozens of attributes each reveals significant complications.
The Hidden Problem in Product Data Quality
Product attributes serve as the foundation of product discovery. They control filter functions, comparison features, search relevance, and personalized recommendations. In real catalogs, however, attribute values are rarely in optimal form: they exhibit inconsistencies, contain duplicates, have formatting errors, or are semantically ambiguous.
Let’s consider concrete examples:
Size values might be mixed up as: “XL”, “Small”, “12cm”, “Large”, “M”, “S”. Colors are listed chaotically: “RAL 3020”, “Crimson”, “Red”, “Dark Red”. Individually, these deviations seem harmless. But when multiplied across 3 million SKUs, each with dozens of attributes, the problem becomes structurally critical.
The consequences are immediately felt: filters behave unpredictably, search engines lose precision, manual cleanup processes require immense resources, and product discovery becomes slower and more frustrating for users.
Architectural Approach: Hybrid AI with Strict Control
The solution was not to introduce a black-box AI making opaque decisions. Such systems are hard to interpret, complex to debug, and prone to uncontrolled error propagation. Instead, a hybrid pipeline was designed that:
The result was a hybrid architecture combining contextual reasoning of large language models with deterministic rules and merchant controllers. AI with guardrails, not AI without control.
Why Offline Processing Was the Right Choice
All attribute normalization is performed not in real-time but in asynchronous background jobs. This was not a compromise but a deliberate architectural decision with significant advantages:
Advantages of batch processing:
In contrast, real-time processing would lead to unpredictable latency, fragile dependencies, costly computations, and operational instability. Isolating customer-facing systems from data pipelines is essential at scale.
Data Persistence as a Stability Guarantee
A critical aspect of the architecture was thoughtful data persistence. All normalized results are stored directly in a centralized Product MongoDB. This persistence strategy served multiple functions:
MongoDB became the central storage for sorted attribute values, refined attribute names, category-specific sort tags, and product-related sortOrder fields. This persistence strategy ensured consistency and stability across the entire ecosystem.
The Technical Processing Workflow
Before applying AI, a rigorous preprocessing step reduces noise:
This seemingly simple step significantly improves LLM accuracy. Garbage in, garbage out – with this data volume, even minor errors can escalate into larger problems later.
The LLM service then receives cleaned input with context:
With this context, the model recognizes:
The model returns: ordered values, refined attribute names, and a classification (deterministic vs. contextual).
Deterministic Fallbacks for Efficiency
Not every attribute requires AI reasoning. Numeric ranges, unit-based values, and simple sets benefit from:
The pipeline automatically detects such cases and applies deterministic logic—efficient resource use without unnecessary LLM calls.
Human Control via Tagging System
Merchants need override options, especially for critical attributes. Therefore, each category can be tagged as:
This dual tagging system builds trust: humans retain final control while AI handles the bulk workload.
Search Integration as a Validation Point
After normalization, sorted values flow into specialized search systems:
This ensures that:
Search integration was where attribute consistency became most visible and critical.
System Architecture Overview
The entire system follows this flow:
This persistent strategy ensures that every attribute value—whether AI-sorted or manually defined—is reflected in search, merchandising, and customer interactions.
Practical Transformation Results
The pipeline transformed chaotic raw values into consistent output:
These examples demonstrate how combining contextual AI reasoning with deterministic rules creates logical, understandable sequences.
Results and Business Impact
The solution delivered significant results:
This was not just a technical achievement but an immediate business success.
Key Takeaways
Conclusion
Attribute value normalization may seem simple, but scaling it to millions of products turns it into a real challenge. By combining LLM intelligence with deterministic rules, persistence guarantees, and merchant control, a complex, hidden problem was transformed into a scalable, maintainable system.
The greatest successes often do not come from solving obvious challenges but from tackling underestimated problems—those that are easy to overlook but appear on every product page. Attribute consistency is precisely such a problem.