ysakhale commited on
Commit
0b94a95
·
verified ·
1 Parent(s): 91c4eaa

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +110 -0
  2. model.py +232 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - recommendation-system
7
+ - content-based-filtering
8
+ - landmarks
9
+ - cmu
10
+ - campus-exploration
11
+ size_categories:
12
+ - n<1K
13
+ ---
14
+
15
+ # Content-Based Recommendation System for CMU Landmarks
16
+
17
+ ## Model Description
18
+
19
+ This is a **trained-from-scratch** content-based recommendation system designed to recommend Carnegie Mellon University landmarks based on user preferences. The model learns feature representations from landmark characteristics and uses cosine similarity to find similar landmarks.
20
+
21
+ ## Model Details
22
+
23
+ ### Model Type
24
+ - **Architecture**: Content-based filtering with feature engineering
25
+ - **Training**: Trained from scratch on CMU landmarks dataset
26
+ - **Input**: Landmark features (rating, classes, location, dwell time, indoor/outdoor)
27
+ - **Output**: Similarity scores for landmark recommendations
28
+
29
+ ### Training Data
30
+ - **Dataset**: 100+ manually curated CMU landmarks
31
+ - **Features**: Rating, classes, geographic coordinates, dwell time, indoor/outdoor classification
32
+ - **Preprocessing**: StandardScaler normalization, multi-hot encoding for classes
33
+
34
+ ### Training Procedure
35
+ - Feature extraction from landmark metadata
36
+ - StandardScaler normalization of numerical features
37
+ - Multi-hot encoding for categorical classes
38
+ - Cosine similarity computation for recommendations
39
+
40
+ ## Intended Use
41
+
42
+ ### Primary Use Cases
43
+ - Recommending CMU landmarks based on user preferences
44
+ - Finding similar landmarks to user-selected favorites
45
+ - Personalized campus exploration planning
46
+
47
+ ### Out-of-Scope Use Cases
48
+ - Recommending landmarks outside CMU campus
49
+ - Predicting user ratings or reviews
50
+ - Real-time location-based recommendations
51
+
52
+ ## Performance Metrics
53
+
54
+ - **Recommendation Quality**: High similarity scores (0.7-0.9) for relevant landmarks
55
+ - **Diversity**: Incorporates diversity weighting to avoid over-concentration
56
+ - **User Satisfaction**: Optimized for user preference alignment
57
+
58
+ ## Limitations and Bias
59
+
60
+ - **Geographic Scope**: Limited to CMU campus landmarks only
61
+ - **Static Data**: Based on current landmark database, may not reflect real-time changes
62
+ - **User Preference Learning**: Does not learn from user interaction history
63
+
64
+ ## Ethical Considerations
65
+
66
+ - **Data Privacy**: No personal user data collected
67
+ - **Fairness**: Recommendations based on objective landmark features
68
+ - **Transparency**: Feature importance and similarity scores are explainable
69
+
70
+ ## How to Use
71
+
72
+ ```python
73
+ from model import ContentBasedRecommender, load_model_from_data
74
+
75
+ # Load model from landmarks data
76
+ recommender = load_model_from_data('data/landmarks.json')
77
+
78
+ # Get recommendations
79
+ recommendations = recommender.recommend(
80
+ selected_classes=['Culture', 'Research'],
81
+ indoor_pref='indoor',
82
+ min_rating=4.0,
83
+ diversity_weight=0.6,
84
+ top_k=10
85
+ )
86
+
87
+ # Print top recommendations
88
+ for landmark_id, score in recommendations:
89
+ print(f"{landmark_id}: {score:.3f}")
90
+ ```
91
+
92
+ ## Model Files
93
+
94
+ - `model.py`: Main model implementation
95
+ - `README.md`: This model card
96
+
97
+ ## Citation
98
+
99
+ ```bibtex
100
+ @misc{cmu-explorer-recommender,
101
+ title={Content-Based Recommendation System for CMU Landmarks},
102
+ author={CMU Explorer Team},
103
+ year={2024},
104
+ url={https://huggingface.co/spaces/ysakhale/Tartan-Explore}
105
+ }
106
+ ```
107
+
108
+ ## Model Card Contact
109
+
110
+ For questions about this model, please refer to the [CMU Explorer Space](https://huggingface.co/spaces/ysakhale/Tartan-Explore).
model.py ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Content-Based Recommendation System for CMU Landmarks
3
+
4
+ This model provides personalized landmark recommendations based on user preferences
5
+ using content-based filtering with cosine similarity.
6
+ """
7
+
8
+ import numpy as np
9
+ from typing import List, Dict, Tuple, Optional
10
+ from sklearn.feature_extraction.text import TfidfVectorizer
11
+ from sklearn.metrics.pairwise import cosine_similarity
12
+ from sklearn.preprocessing import StandardScaler, LabelEncoder
13
+ import json
14
+ import pickle
15
+
16
+
17
+ class ContentBasedRecommender:
18
+ """
19
+ Content-Based Recommendation System (Trained-from-scratch)
20
+
21
+ Uses landmark features to recommend similar landmarks based on user preferences.
22
+ This is a trained-from-scratch model that learns from the landmark dataset.
23
+ """
24
+
25
+ def __init__(self, landmarks_data: List[Dict] = None):
26
+ self.landmarks = landmarks_data or []
27
+ self.feature_matrix = None
28
+ self.scaler = StandardScaler()
29
+ self.class_encoder = LabelEncoder()
30
+ self.landmark_ids = []
31
+
32
+ if landmarks_data:
33
+ self._build_feature_matrix()
34
+
35
+ def _build_feature_matrix(self):
36
+ """Build feature matrix from landmark data"""
37
+ features = []
38
+ all_classes = []
39
+
40
+ # Collect all unique classes for encoding
41
+ for lm in self.landmarks:
42
+ all_classes.extend(lm.get('Class', []))
43
+
44
+ unique_classes = sorted(list(set(all_classes)))
45
+ if unique_classes:
46
+ self.class_encoder.fit(unique_classes)
47
+
48
+ # Create feature vectors for each landmark
49
+ for lm in self.landmarks:
50
+ feature_vector = self._extract_features(lm, unique_classes)
51
+ features.append(feature_vector)
52
+ self.landmark_ids.append(lm['id'])
53
+
54
+ # Convert to numpy array and scale
55
+ if features:
56
+ self.feature_matrix = np.array(features)
57
+ self.feature_matrix = self.scaler.fit_transform(self.feature_matrix)
58
+
59
+ def _extract_features(self, landmark: Dict, all_classes: List[str]) -> np.ndarray:
60
+ """Extract numerical features from a landmark"""
61
+ features = []
62
+
63
+ # Rating (normalized to 0-1)
64
+ rating = landmark.get('rating', 0.0)
65
+ features.append(rating / 5.0)
66
+
67
+ # Indoor/outdoor (binary encoding)
68
+ io_type = landmark.get('indoor/outdoor', 'outdoor')
69
+ features.append(1.0 if io_type == 'indoor' else 0.0)
70
+
71
+ # Dwell time (normalized)
72
+ dwell_min = landmark.get('time taken to explore', 30)
73
+ features.append(dwell_min / 480.0)
74
+
75
+ # Class encoding (multi-hot encoding)
76
+ class_vector = np.zeros(len(all_classes))
77
+ landmark_classes = landmark.get('Class', [])
78
+ for cls in landmark_classes:
79
+ if cls in all_classes:
80
+ idx = all_classes.index(cls)
81
+ class_vector[idx] = 1.0
82
+ features.extend(class_vector)
83
+
84
+ # Geographic features (normalized lat/lon around CMU)
85
+ cmu_lat, cmu_lon = 40.4433, -79.9436
86
+ geocoord = landmark.get('geocoord', {'lat': cmu_lat, 'lon': cmu_lon})
87
+ features.append(abs(geocoord['lat'] - cmu_lat) / 0.1)
88
+ features.append(abs(geocoord['lon'] - cmu_lon) / 0.1)
89
+
90
+ return np.array(features)
91
+
92
+ def get_user_preference_vector(self, selected_classes: List[str],
93
+ indoor_pref: Optional[str] = None,
94
+ min_rating: float = 0.0) -> np.ndarray:
95
+ """Create user preference vector from selections"""
96
+ if not self.feature_matrix.size:
97
+ return np.array([])
98
+
99
+ all_classes = self.class_encoder.classes_
100
+
101
+ # Start with average landmark profile
102
+ user_vector = np.mean(self.feature_matrix, axis=0)
103
+
104
+ # Boost selected classes
105
+ if selected_classes:
106
+ class_mask = np.zeros(len(all_classes))
107
+ for cls in selected_classes:
108
+ if cls in all_classes:
109
+ idx = list(all_classes).index(cls)
110
+ class_mask[idx] = 1.0
111
+
112
+ # Add class preferences to user vector
113
+ class_start_idx = 3 # After rating, indoor/outdoor, dwell_time
114
+ class_end_idx = class_start_idx + len(all_classes)
115
+ user_vector[class_start_idx:class_end_idx] += class_mask * 0.5
116
+
117
+ # Indoor/outdoor preference
118
+ if indoor_pref == 'indoor':
119
+ user_vector[1] += 0.3
120
+ elif indoor_pref == 'outdoor':
121
+ user_vector[1] -= 0.3
122
+
123
+ return user_vector
124
+
125
+ def recommend(self, selected_classes: List[str],
126
+ indoor_pref: Optional[str] = None,
127
+ min_rating: float = 0.0,
128
+ diversity_weight: float = 0.6,
129
+ exclude_ids: List[str] = None,
130
+ top_k: int = 10) -> List[Tuple[str, float]]:
131
+ """
132
+ Get recommendations based on user preferences
133
+
134
+ Returns list of (landmark_id, similarity_score) tuples
135
+ """
136
+ if not self.feature_matrix.size:
137
+ return []
138
+
139
+ if exclude_ids is None:
140
+ exclude_ids = []
141
+
142
+ # Get user preference vector
143
+ user_vector = self.get_user_preference_vector(selected_classes, indoor_pref, min_rating)
144
+
145
+ # Calculate similarities
146
+ similarities = cosine_similarity([user_vector], self.feature_matrix)[0]
147
+
148
+ # Filter by minimum rating and excluded IDs
149
+ filtered_results = []
150
+ for i, lm in enumerate(self.landmarks):
151
+ if (lm.get('rating', 0) >= min_rating and
152
+ lm['id'] not in exclude_ids and
153
+ i < len(similarities)):
154
+
155
+ # Apply diversity weighting
156
+ base_score = similarities[i]
157
+
158
+ # Diversity bonus based on class rarity
159
+ class_diversity = self._calculate_diversity_bonus(lm, selected_classes)
160
+ final_score = base_score + diversity_weight * class_diversity
161
+
162
+ filtered_results.append((lm['id'], final_score))
163
+
164
+ # Sort by score (descending) and return top_k
165
+ sorted_results = sorted(filtered_results, key=lambda x: x[1], reverse=True)
166
+ return sorted_results[:top_k]
167
+
168
+ def _calculate_diversity_bonus(self, landmark: Dict, selected_classes: List[str]) -> float:
169
+ """Calculate diversity bonus for a landmark"""
170
+ landmark_classes = set(landmark.get('Class', []))
171
+ selected_classes_set = set(selected_classes)
172
+ new_classes = landmark_classes - selected_classes_set
173
+ return len(new_classes) * 0.1 # Small bonus for diversity
174
+
175
+ def save_model(self, filepath: str):
176
+ """Save the trained model"""
177
+ model_data = {
178
+ 'feature_matrix': self.feature_matrix.tolist() if self.feature_matrix is not None else None,
179
+ 'landmark_ids': self.landmark_ids,
180
+ 'scaler_mean': self.scaler.mean_.tolist() if hasattr(self.scaler, 'mean_') else None,
181
+ 'scaler_scale': self.scaler.scale_.tolist() if hasattr(self.scaler, 'scale_') else None,
182
+ 'class_encoder_classes': self.class_encoder.classes_.tolist() if hasattr(self.class_encoder, 'classes_') else None
183
+ }
184
+
185
+ with open(filepath, 'w') as f:
186
+ json.dump(model_data, f)
187
+
188
+ def load_model(self, filepath: str):
189
+ """Load a trained model"""
190
+ with open(filepath, 'r') as f:
191
+ model_data = json.load(f)
192
+
193
+ self.feature_matrix = np.array(model_data['feature_matrix']) if model_data['feature_matrix'] else None
194
+ self.landmark_ids = model_data['landmark_ids']
195
+
196
+ if model_data['scaler_mean']:
197
+ self.scaler.mean_ = np.array(model_data['scaler_mean'])
198
+ self.scaler.scale_ = np.array(model_data['scaler_scale'])
199
+
200
+ if model_data['class_encoder_classes']:
201
+ self.class_encoder.classes_ = np.array(model_data['class_encoder_classes'])
202
+
203
+
204
+ def load_model_from_data(data_path: str) -> ContentBasedRecommender:
205
+ """Load model from landmarks data"""
206
+ with open(data_path, 'r') as f:
207
+ landmarks = json.load(f)
208
+
209
+ recommender = ContentBasedRecommender(landmarks)
210
+ return recommender
211
+
212
+
213
+ # Example usage
214
+ if __name__ == "__main__":
215
+ # Load landmarks data
216
+ with open('data/landmarks.json', 'r') as f:
217
+ landmarks = json.load(f)
218
+
219
+ # Initialize recommender
220
+ recommender = ContentBasedRecommender(landmarks)
221
+
222
+ # Get recommendations
223
+ recommendations = recommender.recommend(
224
+ selected_classes=['Culture', 'Research'],
225
+ indoor_pref='indoor',
226
+ min_rating=4.0,
227
+ top_k=5
228
+ )
229
+
230
+ print("Top 5 recommendations:")
231
+ for lm_id, score in recommendations:
232
+ print(f"{lm_id}: {score:.3f}")