Behind The Scenes

Recommendation Architecture & Technical Methodology

A deep-dive exploration into exploratory data analysis, real-world data quality challenges, the 18-dimensional feature engineering pipeline, and the industrial two-stage (ItemCF Retrieval + LightGBM Ranking) hybrid architecture.

1. Exploratory Data Analysis & Quality Audit (EDA)

Deep-dive audit based on 1,000,000 authentic transaction records from sales.csv

Dataset Scale

1,000,000 transaction logs covering 200 SKUs, 50,000 registered customers, and 100 store locations. Generated total revenue of $25,486,128.86 (40.0% average gross margin).

Category & Brand Distribution

Praline ($6.66M, 26.15%) and White Chocolate ($6.07M, 23.82%) lead sales revenue. Ferrero ($4.69M) and Cadbury ($4.67M) form the top revenue dual drivers.

Store Channel Contributions

Airport Duty-Free stores contribute the highest revenue ($7.61M, 29.9%), Shopping Malls account for 26.0%, Online store 25.1%, and Retail shops 19.1%.

📊 Real Data Exploratory Visuals

đŸĢ Category Revenue Distribution
Praline$6.67M (26.15%)
White Chocolate$6.07M (23.82%)
Dark Chocolate$5.30M (20.79%)
Truffle$3.92M (15.40%)
Milk Chocolate$3.28M (12.87%)
đŸŦ Store Type Revenue Contribution
Airport Duty-Free$7.61M (29.87%)
Shopping Mall$6.63M (26.01%)
Online Shop$6.39M (25.06%)
Retail Store$4.86M (19.06%)

đŸ› ī¸ Data Quality Issues & Engineering Governance Framework

Click to Analyze

1. Matrix Sparsity

Verified via `eda.py`, user-item interaction matrix sparsity reaches 90.10% (50,000 users over 200 SKUs). Most users have only 1-2 purchase logs.

Solution: Neural Embeddings + LightGBM GBDT fitting
Click to Analyze

2. Cold-Start Guests

New guests or registered users have zero historical purchases in DB, causing pure ItemCF to fail candidate retrieval.

Solution: Static demographic fallback profiles (Age 35, neutral preferences)
Click to Analyze

3. OOV & Unknown Mapping

Inference input might contain unseen brand or channel strings, triggering KeyNotFoundError or crashing online servers.

Solution: Pre-built mapping dictionaries mapping unseen values to 'Unknown'
Click to Analyze

4. Continuous Feature Imputation

New product cocoa percentages (`cocoa_percent`) or user ages (`age`) might contain missing values, biasing GBDT splits.

Solution: Impute age by median `35`, neutral `0.0` for cocoa/specs
Click to Analyze

5. Class Imbalance

In 1M transactions, positive purchase labels (1) are extremely rare compared to full candidates (0). Imbalance ratio ~ 1:199.

Solution: Apply 1:4 heuristic negative sampling during offline training
Click to Analyze

6. Tensor Scaling & Multi-threading Lock

Large feature scale gaps (Age 18-70 vs Weight 50-200g). OpenMP/MKL multi-threading deadlocks easily in cloud environments.

Solution: Min-Max scaling + force single-thread `OMP_NUM_THREADS=1`

1.5 Raw Data Schema & 10-Row Previews

Overview of entity relationship star schema and live 10-row raw table previews

Star Schema Relations

products.csv
PK: product_id (200 SKUs)
customers.csv
PK: customer_id (50,000 Users)
stores.csv
PK: store_id (100 Stores)
Foreign Keys Mapping to Fact Table
sales.csv (Fact Table)
Contains 1,000,000 records linking product_id, customer_id, store_id with revenue, quantity, and profit
Download sales.csv Download products.csv Download customers.csv Download stores.csv
order_id order_date product_id store_id customer_id quantity unit_price discount revenue cost profit
0RD00000001 2023-01-07 P0080 S093 C040749 5 14.43 0.15 61.33 42.77 18.56
0RD00000002 2023-10-22 P0173 S065 C020161 3 12.01 0.00 36.03 19.06 16.97
0RD00000003 2023-05-07 P0115 S078 C048069 2 10.02 0.00 20.04 10.29 9.75
0RD00000004 2024-06-23 P0186 S088 C047901 2 14.66 0.10 26.39 16.35 10.04
0RD00000005 2024-09-24 P0197 S054 C033950 1 12.34 0.00 12.34 7.94 4.40
0RD00000006 2024-03-29 P0160 S089 C008918 4 13.52 0.00 54.08 36.59 17.49
0RD00000007 2023-02-26 P0062 S024 C002897 1 11.97 0.10 10.77 7.16 3.61
0RD00000008 2023-11-03 P0111 S085 C038072 5 4.62 0.00 23.10 16.15 6.95
0RD00000009 2024-10-11 P0135 S029 C003786 4 7.88 0.00 31.52 19.90 11.62
0RD00000010 2023-12-17 P0069 S056 C043148 3 8.88 0.00 26.64 18.19 8.45

2. 18-Dimensional Feature Engineering Pipeline

Detailed feature extractions, aggregations, and selection rationale (cols_order)

🤔 Feature Selection Rationale & Methodology

In recommendation ranking stages, relying purely on IDs leads to severe overfitting and data sparsity bottlenecks. Based on domain knowledge and statistical aggregation, we constructed this 18-dimensional feature matrix across 4 core dimensions:

  • ① Demographics Matching: Age groups, gender, and loyalty membership have strong preferences regarding cocoa percentage (`cocoa_percent`) and package weight (`weight_g`).
  • ② Purchasing Power & Price Sensitivity: Summarizing user total spend (`user_avg_revenue`) and discount rates (`user_avg_discount`) allows models to distinguish high-ticket users from promo-driven shoppers.
  • â‘ĸ High-Order Cross Interactions: Essential for boosting AUC and ranking accuracy! Aggregating user historical counts over specific products, brands, and categories captures brand loyalty and category affinity directly.
  • â‘Ŗ Contextual Signals: Airport Duty-Free vs Online shoppers exhibit distinct instant purchase intents. Day-of-week and month features capture seasonality and weekend shopping surges.

📋 Full 18-Feature Matrix Overview

1. User Profile user_age Continuous age normalized. Captures sweetness and cocoa dark preferences.
2. User Profile user_gender Gender encoded (Male/Female/Unknown) for demographic filtering.
3. User Profile user_loyalty Loyalty membership flag (0 or 1). Members exhibit higher AOV and repeat rates.
4. Item Spec item_cocoa Cocoa percentage (40%-90%) normalized. Core divider for Dark vs Milk chocolate.
5. Item Spec item_weight Package weight (g). Distinguishes snack packs from gift boxes.
6. Item Spec item_category Category Label Encoding (Praline, White, Dark, Truffle, Milk).
7. Item Spec item_brand Brand Label Encoding (Ferrero, Cadbury, Lindt, Mars, Godiva, etc.).
8. Contextual store_type Channel store type (Airport, Mall, Online, Retail) influencing consumer mindset.
9. Contextual day_of_week Day of week (0-6). Differentiates weekday indulgence from weekend shopping.
10. Contextual month Transaction month (1-12). Captures Valentine's and Christmas gifting peaks.
11. User Aggs user_total_purchases Total historical order count per user. Measures platform loyalty.
12. User Aggs user_avg_revenue Historical average spend per order. Quantifies purchasing power.
13. User Aggs user_avg_discount Average discount rate enjoyed historically. Quantifies price sensitivity.
14. Item Aggs item_total_sales Global cumulative sales per item. Popularity prior probability.
15. Item Aggs item_avg_discount Average item discount rate. Measures inherent promotional strength.
16. Cross Interaction user_item_purchase_count đŸ”Ĩ Core Signal: Repeat purchase count by this user for this exact SKU.
17. Cross Interaction user_brand_purchase_count đŸ”Ĩ Core Signal: Historical purchase count by this user for this specific brand.
18. Cross Interaction user_category_purchase_count đŸ”Ĩ Core Signal: Historical purchase count by this user for this category.

3. Multi-Algorithm Benchmark & Trade-Off Analysis

Unified offline metrics evaluation across Collaborative Filtering, GBDT trees, and Deep Neural Networks

Algorithm Class Precision@5 Recall@5 HitRate@5 Inference Latency System Role & Commercial Positioning
Item-based CF 1.04% 2.87% 5.20% ⚡ 0.19 ms Retrieval Layer: Sub-millisecond candidate filtering from massive SKUs
XGBoost Ranking 1.07% 2.51% 5.20% 2.96 ms Ranking Alternative: High interpretability feature fitting
LightGBM Ranking 1.15% (Highest) 2.65% 5.70% 3.09 ms Ranking Champion: Optimal balance of accuracy and fast training
Neural NCF (PyTorch) 0.84% 1.99% 4.15% 0.50 ms Embedding Provider: Captures non-linear deep interactions & generalization

4. Industrial Two-Stage Pipeline Interactive Sandbox

Run dynamic simulation showing candidates pruned via ItemCF Retrieval and scored by LightGBM Ranking

đŸŽ¯ Active User Context
User ID: CUST_000001 (Age 38 | Female | Online Channel)
Core History: Lindt 85% Dark, Sea Salt
Purchase Count: 12 Orders
Ready for live inference...
Stage 1: Full Candidate Inventory SKU Pool Size: 200 Products

Inventory containing 200 authentic chocolate SKUs across all brands and cocoa levels...

SKU-101 (Lindt Dark 85%) SKU-104 (Dark Sea Salt 70%) SKU-208 (Godiva Truffle) SKU-305 (Ritier Dark Almond) SKU-112 (Valrhona Cocoa 90%) SKU-402 (Patchi Espresso) ... (+194 authentic SKUs)
Stage 2: ItemCF Collaborative Filtering Retrieval Layer Latency: ~0.19 ms

Calculating similarity against active user CUST_000001 history to retrieve Top 10 candidates:

Click run button to trigger backend ItemCF retrieval...
Stage 3: LightGBM Real-time Ranking & Feature Scoring Latency: ~3.0 ms

Combining user profile (Age 38, Female) with product specs to compute GBDT CTR predictions:

Click run button above to view live LightGBM model score rankings