1. Exploratory Data Analysis & Quality Audit (EDA)
Deep-dive audit based on 1,000,000 authentic transaction records from sales.csv
Dataset Scale
1,000,000 transaction logs covering 200 SKUs, 50,000 registered customers, and 100 store locations. Generated total revenue of $25,486,128.86 (40.0% average gross margin).
Category & Brand Distribution
Praline ($6.66M, 26.15%) and White Chocolate ($6.07M, 23.82%) lead sales revenue. Ferrero ($4.69M) and Cadbury ($4.67M) form the top revenue dual drivers.
Store Channel Contributions
Airport Duty-Free stores contribute the highest revenue ($7.61M, 29.9%), Shopping Malls account for 26.0%, Online store 25.1%, and Retail shops 19.1%.
đ Real Data Exploratory Visuals
đĢ Category Revenue Distribution
đŦ Store Type Revenue Contribution
đ ī¸ Data Quality Issues & Engineering Governance Framework
1. Matrix Sparsity
Verified via `eda.py`, user-item interaction matrix sparsity reaches 90.10% (50,000 users over 200 SKUs). Most users have only 1-2 purchase logs.
Solution: Neural Embeddings + LightGBM GBDT fitting2. Cold-Start Guests
New guests or registered users have zero historical purchases in DB, causing pure ItemCF to fail candidate retrieval.
Solution: Static demographic fallback profiles (Age 35, neutral preferences)3. OOV & Unknown Mapping
Inference input might contain unseen brand or channel strings, triggering KeyNotFoundError or crashing online servers.
Solution: Pre-built mapping dictionaries mapping unseen values to 'Unknown'4. Continuous Feature Imputation
New product cocoa percentages (`cocoa_percent`) or user ages (`age`) might contain missing values, biasing GBDT splits.
Solution: Impute age by median `35`, neutral `0.0` for cocoa/specs5. Class Imbalance
In 1M transactions, positive purchase labels (1) are extremely rare compared to full candidates (0). Imbalance ratio ~ 1:199.
Solution: Apply 1:4 heuristic negative sampling during offline training6. Tensor Scaling & Multi-threading Lock
Large feature scale gaps (Age 18-70 vs Weight 50-200g). OpenMP/MKL multi-threading deadlocks easily in cloud environments.
Solution: Min-Max scaling + force single-thread `OMP_NUM_THREADS=1`1.5 Raw Data Schema & 10-Row Previews
Overview of entity relationship star schema and live 10-row raw table previews
Star Schema Relations
PK: product_id (200 SKUs)
PK: customer_id (50,000 Users)
PK: store_id (100 Stores)
Contains 1,000,000 records linking product_id, customer_id, store_id with revenue, quantity, and profit
| order_id | order_date | product_id | store_id | customer_id | quantity | unit_price | discount | revenue | cost | profit |
|---|---|---|---|---|---|---|---|---|---|---|
| 0RD00000001 | 2023-01-07 | P0080 | S093 | C040749 | 5 | 14.43 | 0.15 | 61.33 | 42.77 | 18.56 |
| 0RD00000002 | 2023-10-22 | P0173 | S065 | C020161 | 3 | 12.01 | 0.00 | 36.03 | 19.06 | 16.97 |
| 0RD00000003 | 2023-05-07 | P0115 | S078 | C048069 | 2 | 10.02 | 0.00 | 20.04 | 10.29 | 9.75 |
| 0RD00000004 | 2024-06-23 | P0186 | S088 | C047901 | 2 | 14.66 | 0.10 | 26.39 | 16.35 | 10.04 |
| 0RD00000005 | 2024-09-24 | P0197 | S054 | C033950 | 1 | 12.34 | 0.00 | 12.34 | 7.94 | 4.40 |
| 0RD00000006 | 2024-03-29 | P0160 | S089 | C008918 | 4 | 13.52 | 0.00 | 54.08 | 36.59 | 17.49 |
| 0RD00000007 | 2023-02-26 | P0062 | S024 | C002897 | 1 | 11.97 | 0.10 | 10.77 | 7.16 | 3.61 |
| 0RD00000008 | 2023-11-03 | P0111 | S085 | C038072 | 5 | 4.62 | 0.00 | 23.10 | 16.15 | 6.95 |
| 0RD00000009 | 2024-10-11 | P0135 | S029 | C003786 | 4 | 7.88 | 0.00 | 31.52 | 19.90 | 11.62 |
| 0RD00000010 | 2023-12-17 | P0069 | S056 | C043148 | 3 | 8.88 | 0.00 | 26.64 | 18.19 | 8.45 |
2. 18-Dimensional Feature Engineering Pipeline
Detailed feature extractions, aggregations, and selection rationale (cols_order)
đ¤ Feature Selection Rationale & Methodology
In recommendation ranking stages, relying purely on IDs leads to severe overfitting and data sparsity bottlenecks. Based on domain knowledge and statistical aggregation, we constructed this 18-dimensional feature matrix across 4 core dimensions:
- â Demographics Matching: Age groups, gender, and loyalty membership have strong preferences regarding cocoa percentage (`cocoa_percent`) and package weight (`weight_g`).
- ⥠Purchasing Power & Price Sensitivity: Summarizing user total spend (`user_avg_revenue`) and discount rates (`user_avg_discount`) allows models to distinguish high-ticket users from promo-driven shoppers.
- âĸ High-Order Cross Interactions: Essential for boosting AUC and ranking accuracy! Aggregating user historical counts over specific products, brands, and categories captures brand loyalty and category affinity directly.
- âŖ Contextual Signals: Airport Duty-Free vs Online shoppers exhibit distinct instant purchase intents. Day-of-week and month features capture seasonality and weekend shopping surges.
đ Full 18-Feature Matrix Overview
3. Multi-Algorithm Benchmark & Trade-Off Analysis
Unified offline metrics evaluation across Collaborative Filtering, GBDT trees, and Deep Neural Networks
| Algorithm Class | Precision@5 | Recall@5 | HitRate@5 | Inference Latency | System Role & Commercial Positioning |
|---|---|---|---|---|---|
| Item-based CF | 1.04% | 2.87% | 5.20% | ⥠0.19 ms | Retrieval Layer: Sub-millisecond candidate filtering from massive SKUs |
| XGBoost Ranking | 1.07% | 2.51% | 5.20% | 2.96 ms | Ranking Alternative: High interpretability feature fitting |
| LightGBM Ranking | 1.15% (Highest) | 2.65% | 5.70% | 3.09 ms | Ranking Champion: Optimal balance of accuracy and fast training |
| Neural NCF (PyTorch) | 0.84% | 1.99% | 4.15% | 0.50 ms | Embedding Provider: Captures non-linear deep interactions & generalization |
4. Industrial Two-Stage Pipeline Interactive Sandbox
Run dynamic simulation showing candidates pruned via ItemCF Retrieval and scored by LightGBM Ranking
Inventory containing 200 authentic chocolate SKUs across all brands and cocoa levels...
Calculating similarity against active user CUST_000001 history to retrieve Top 10 candidates:
Combining user profile (Age 38, Female) with product specs to compute GBDT CTR predictions: