🛠️ Recommendation System Engineering Showcase

1. Exploratory Data Analysis & Quality Audit (EDA)

Deep-dive audit based on 1,000,000 authentic transaction records from sales.csv

Dataset Scale

1,000,000 transaction logs covering 200 SKUs, 50,000 registered customers, and 100 store locations. Generated total revenue of $25,486,128.86 (40.0% average gross margin).

Category & Brand Distribution

Praline ($6.66M, 26.15%) and White Chocolate ($6.07M, 23.82%) lead sales revenue. Ferrero ($4.69M) and Cadbury ($4.67M) form the top revenue dual drivers.

Store Channel Contributions

Airport Duty-Free stores contribute the highest revenue ($7.61M, 29.9%), Shopping Malls account for 26.0%, Online store 25.1%, and Retail shops 19.1%.

📊 Real Data Exploratory Visuals

🍫 Category Revenue Distribution

Praline$6.67M (26.15%)

White Chocolate$6.07M (23.82%)

Dark Chocolate$5.30M (20.79%)

Truffle$3.92M (15.40%)

Milk Chocolate$3.28M (12.87%)

🏬 Store Type Revenue Contribution

Airport Duty-Free$7.61M (29.87%)

Shopping Mall$6.63M (26.01%)

Online Shop$6.39M (25.06%)

Retail Store$4.86M (19.06%)

🛠️ Data Quality Issues & Engineering Governance Framework

Click to Analyze

1. Matrix Sparsity

Verified via `eda.py`, user-item interaction matrix sparsity reaches 90.10% (50,000 users over 200 SKUs). Most users have only 1-2 purchase logs.

Solution: Neural Embeddings + LightGBM GBDT fitting

Click to Analyze

2. Cold-Start Guests

New guests or registered users have zero historical purchases in DB, causing pure ItemCF to fail candidate retrieval.

Solution: Static demographic fallback profiles (Age 35, neutral preferences)

Click to Analyze

3. OOV & Unknown Mapping

Inference input might contain unseen brand or channel strings, triggering KeyNotFoundError or crashing online servers.

Solution: Pre-built mapping dictionaries mapping unseen values to 'Unknown'

Click to Analyze

4. Continuous Feature Imputation

New product cocoa percentages (`cocoa_percent`) or user ages (`age`) might contain missing values, biasing GBDT splits.

Solution: Impute age by median `35`, neutral `0.0` for cocoa/specs

Click to Analyze

5. Class Imbalance

In 1M transactions, positive purchase labels (1) are extremely rare compared to full candidates (0). Imbalance ratio ~ 1:199.

Solution: Apply 1:4 heuristic negative sampling during offline training

Click to Analyze

6. Tensor Scaling & Multi-threading Lock

Large feature scale gaps (Age 18-70 vs Weight 50-200g). OpenMP/MKL multi-threading deadlocks easily in cloud environments.

Solution: Min-Max scaling + force single-thread `OMP_NUM_THREADS=1`

1.5 Raw Data Schema & 10-Row Previews

Overview of entity relationship star schema and live 10-row raw table previews

Star Schema Relations

products.csv
PK: product_id (200 SKUs)

customers.csv
PK: customer_id (50,000 Users)

stores.csv
PK: store_id (100 Stores)

Foreign Keys Mapping to Fact Table

sales.csv (Fact Table)
Contains 1,000,000 records linking product_id, customer_id, store_id with revenue, quantity, and profit

Download sales.csv Download products.csv Download customers.csv Download stores.csv

order_id	order_date	product_id	store_id	customer_id	quantity	unit_price	discount	revenue	cost	profit
0RD00000001	2023-01-07	P0080	S093	C040749	5	14.43	0.15	61.33	42.77	18.56
0RD00000002	2023-10-22	P0173	S065	C020161	3	12.01	0.00	36.03	19.06	16.97
0RD00000003	2023-05-07	P0115	S078	C048069	2	10.02	0.00	20.04	10.29	9.75
0RD00000004	2024-06-23	P0186	S088	C047901	2	14.66	0.10	26.39	16.35	10.04
0RD00000005	2024-09-24	P0197	S054	C033950	1	12.34	0.00	12.34	7.94	4.40
0RD00000006	2024-03-29	P0160	S089	C008918	4	13.52	0.00	54.08	36.59	17.49
0RD00000007	2023-02-26	P0062	S024	C002897	1	11.97	0.10	10.77	7.16	3.61
0RD00000008	2023-11-03	P0111	S085	C038072	5	4.62	0.00	23.10	16.15	6.95
0RD00000009	2024-10-11	P0135	S029	C003786	4	7.88	0.00	31.52	19.90	11.62
0RD00000010	2023-12-17	P0069	S056	C043148	3	8.88	0.00	26.64	18.19	8.45

product_id	product_name	brand	category	cocoa_percent	weight_g
P0001	White Chocolate 80%	Mars	Truffle	80	120
P0002	Dark Chocolate 70%	Cadbury	Praline	70	100
P0003	Truffle Chocolate 70%	Hershey	Praline	70	120
P0004	Milk Chocolate 50%	Mars	Praline	50	80
P0005	White Chocolate 70%	Ferrero	White	70	50
P0006	Milk Chocolate 50%	Hershey	Dark	50	50
P0007	Praline Chocolate 70%	Cadbury	Praline	70	120
P0008	White Chocolate 90%	Godiva	Dark	90	100
P0009	White Chocolate 50%	Ferrero	Dark	50	80
P0010	Milk Chocolate 70%	Hershey	Truffle	70	50

customer_id	age	gender	loyalty_member	join_date
C000001	40	Male	1	2025-05-21
C000002	47	Male	0	2021-12-26
C000003	58	Female	1	2022-09-13
C000004	25	Female	0	2025-02-27
C000005	43	Male	0	2023-08-31
C000006	32	Male	0	2022-05-22
C000007	51	Male	1	2024-07-28
C000008	56	Female	0	2024-08-12
C000009	18	Male	0	2025-06-23
C000010	40	Male	0	2023-06-07

store_id	store_name	city	country	store_type
S001	Chocolate Store 1	New York	Canada	Retail
S002	Chocolate Store 2	Melbourne	Canada	Mall
S003	Chocolate Store 3	Berlin	France	Mall
S004	Chocolate Store 4	Paris	UK	Airport
S005	Chocolate Store 5	Sydney	USA	Online
S006	Chocolate Store 6	Toronto	Canada	Online
S007	Chocolate Store 7	Sydney	France	Mall
S008	Chocolate Store 8	Paris	UK	Mall
S009	Chocolate Store 9	Paris	France	Online
S010	Chocolate Store 10	Toronto	UK	Retail

2. 18-Dimensional Feature Engineering Pipeline

Detailed feature extractions, aggregations, and selection rationale (cols_order)

🤔 Feature Selection Rationale & Methodology

In recommendation ranking stages, relying purely on IDs leads to severe overfitting and data sparsity bottlenecks. Based on domain knowledge and statistical aggregation, we constructed this 18-dimensional feature matrix across 4 core dimensions:

① Demographics Matching: Age groups, gender, and loyalty membership have strong preferences regarding cocoa percentage (`cocoa_percent`) and package weight (`weight_g`).
② Purchasing Power & Price Sensitivity: Summarizing user total spend (`user_avg_revenue`) and discount rates (`user_avg_discount`) allows models to distinguish high-ticket users from promo-driven shoppers.
③ High-Order Cross Interactions: Essential for boosting AUC and ranking accuracy! Aggregating user historical counts over specific products, brands, and categories captures brand loyalty and category affinity directly.
④ Contextual Signals: Airport Duty-Free vs Online shoppers exhibit distinct instant purchase intents. Day-of-week and month features capture seasonality and weekend shopping surges.

📋 Full 18-Feature Matrix Overview

1. User Profile user_age Continuous age normalized. Captures sweetness and cocoa dark preferences.

2. User Profile user_gender Gender encoded (Male/Female/Unknown) for demographic filtering.

3. User Profile user_loyalty Loyalty membership flag (0 or 1). Members exhibit higher AOV and repeat rates.

4. Item Spec item_cocoa Cocoa percentage (40%-90%) normalized. Core divider for Dark vs Milk chocolate.

5. Item Spec item_weight Package weight (g). Distinguishes snack packs from gift boxes.

6. Item Spec item_category Category Label Encoding (Praline, White, Dark, Truffle, Milk).

7. Item Spec item_brand Brand Label Encoding (Ferrero, Cadbury, Lindt, Mars, Godiva, etc.).

8. Contextual store_type Channel store type (Airport, Mall, Online, Retail) influencing consumer mindset.

9. Contextual day_of_week Day of week (0-6). Differentiates weekday indulgence from weekend shopping.

10. Contextual month Transaction month (1-12). Captures Valentine's and Christmas gifting peaks.

11. User Aggs user_total_purchases Total historical order count per user. Measures platform loyalty.

12. User Aggs user_avg_revenue Historical average spend per order. Quantifies purchasing power.

13. User Aggs user_avg_discount Average discount rate enjoyed historically. Quantifies price sensitivity.

14. Item Aggs item_total_sales Global cumulative sales per item. Popularity prior probability.

15. Item Aggs item_avg_discount Average item discount rate. Measures inherent promotional strength.

16. Cross Interaction user_item_purchase_count 🔥 Core Signal: Repeat purchase count by this user for this exact SKU.

17. Cross Interaction user_brand_purchase_count 🔥 Core Signal: Historical purchase count by this user for this specific brand.

18. Cross Interaction user_category_purchase_count 🔥 Core Signal: Historical purchase count by this user for this category.

3. Multi-Algorithm Benchmark & Trade-Off Analysis

Unified offline metrics evaluation across Collaborative Filtering, GBDT trees, and Deep Neural Networks

Algorithm Class	Precision@5	Recall@5	HitRate@5	Inference Latency	System Role & Commercial Positioning
Item-based CF	1.04%	2.87%	5.20%	⚡ 0.19 ms	Retrieval Layer: Sub-millisecond candidate filtering from massive SKUs
XGBoost Ranking	1.07%	2.51%	5.20%	2.96 ms	Ranking Alternative: High interpretability feature fitting
LightGBM Ranking	1.15% (Highest)	2.65%	5.70%	3.09 ms	Ranking Champion: Optimal balance of accuracy and fast training
Neural NCF (PyTorch)	0.84%	1.99%	4.15%	0.50 ms	Embedding Provider: Captures non-linear deep interactions & generalization

4. Industrial Two-Stage Pipeline Interactive Sandbox

Run dynamic simulation showing candidates pruned via ItemCF Retrieval and scored by LightGBM Ranking

🎯 Active User Context

User ID: CUST_000001 (Age 38 | Female | Online Channel)

Core History: Lindt 85% Dark, Sea Salt

Purchase Count: 12 Orders

Ready for live inference...

Stage 1: Full Candidate Inventory SKU Pool Size: 200 Products

Inventory containing 200 authentic chocolate SKUs across all brands and cocoa levels...

SKU-101 (Lindt Dark 85%) SKU-104 (Dark Sea Salt 70%) SKU-208 (Godiva Truffle) SKU-305 (Ritier Dark Almond) SKU-112 (Valrhona Cocoa 90%) SKU-402 (Patchi Espresso) ... (+194 authentic SKUs)

Stage 2: ItemCF Collaborative Filtering Retrieval Layer Latency: ~0.19 ms

Calculating similarity against active user CUST_000001 history to retrieve Top 10 candidates:

Click run button to trigger backend ItemCF retrieval...

Stage 3: LightGBM Real-time Ranking & Feature Scoring Latency: ~3.0 ms

Combining user profile (Age 38, Female) with product specs to compute GBDT CTR predictions:

                            Click run button above to view live LightGBM model score rankings