π Ride Booking Value Analysis & Prediction
π Overview
This project analyzes a large ride-booking dataset to understand the drivers behind Booking Value and creates predictive models. The project follows a complete Data Science lifecycle, initially attempting to predict exact prices via regression, creating engineered features to capture behavioral patterns, clustering rides, and finally reframing the task into a successful classification problem.
Key Objectives:
- Predict booking price (Regression)
- Create engineered features for better performance
- Cluster rides into behavioral groups
- Reframe the task into Classification (High vs Low value rides)
- Train multiple classification models and select a winner
- Export the best model to Hugging Face
Workflow:
EDA β Cleaning β Feature Engineering β Regression β Clustering β Classification β Model Export
π Dataset
Each row in the dataset represents one completed ride.
Key Variables:
Ride_Distance: Distance of the trip.Time&Hour: Temporal features.Driver_Rating&Customer_Rating.Payment_Method.- Target:
Booking_Value(Price).
Cleaning Process:
- Removed missing & invalid values.
- Filtered outliers using the IQR rule.
- Converted categorical variables using One-Hot Encoding.
- Removed high-cardinality columns (e.g., specific locations) to prevent overfitting.
- Removed data leakage columns (features not available at prediction time).
π Exploratory Data Analysis (EDA)
Through statistical analysis and visualization, we uncovered critical insights about the data distribution.
- Distribution: Investigated the spread of Booking Values.
- Correlations: Analyzed the relationship between distance/time and price.
(Distribution of Booking Value)
(Scatter plot: Booking Value vs. Ride Distance)
Core Insight: >
Booking_Valuedoes NOT correlate strongly with standard ride metrics like distance, hour, or rating in this dataset. This lack of linear correlation explains why standard regression models struggled to predict exact prices.
π Feature Engineering
To improve model performance, we engineered new features to capture non-linear patterns and interactions:
Rush_Hour: Binary flag for peak times.Distance_Category: Binning distances into Short / Medium / Long.Distance_x_Hour: Interaction feature.- Polynomial Features:
Ride_DistanceΒ²to capture non-linear effects. Cluster_Label: Derived from Unsupervised Learning (K-Means).Distance_to_Centroid: Numerical anomaly measure.
π Regression Modeling
We trained three regression models to predict the exact price:
- Linear Regression
- Random Forest Regressor
- Gradient Boosting Regressor
Result: All models achieved an RΒ² score near 0.
- Meaning: Booking Value cannot be predicted well from available numerical/categorical features.
- Hypothesis: Pricing is likely driven by external business logic (surge pricing, specific promotions, internal rules) not present in the dataset features.
Winner: Baseline Linear Regression (simplest model with similar performance to complex ones).
- File:
winner_model.pkl
π― Clustering
We applied K-Means Clustering (k=4) to group rides based on Distance and Ride Time.
Identified Clusters:
- Short morning rides
- Long morning rides
- Short evening rides
- Long evening rides
(PCA Visualization of Ride Clusters)
These clusters were added as features (Cluster_Label) to the supervised learning models.
π Pivot: Regression β Classification
Since exact price prediction was not feasible, we reframed the business problem: Can we classify rides as High Value vs. Low Value?
- Method: Median Split.
- Classes:
0: Low Value1: High Value
- Balance: The split resulted in perfectly balanced classes (~50% / 50%).
Strategy: Since classes are balanced, Accuracy is a valid metric. However, for business purposes, Recall is prioritized to avoid missing out on High-Value customers (False Negatives).
π€ Classification Models
We trained and evaluated three classifiers:
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
Metrics Evaluated: Accuracy, Recall, Precision, F1-Score, ROC-AUC.
π The Winner: Random Forest Classifier
Reasons for selection:
- Highest ROC-AUC score.
- Best Recall for the positive class (High Value).
- Fewer critical mistakes (False Negatives) in the confusion matrix compared to others.
π¦ Model Export & Usage
The final models have been serialized and uploaded to this repository.
winner_model.pkl: Best Regression Model (Baseline).best_classifier.pkl: Best Classification Model (Random Forest).
How to Load the Classification Model
You can load the model using Python's pickle module:
import pickle
import pandas as pd
# Load the model
with open("best_classifier.pkl", "rb") as f:
model = pickle.load(f)
# Example: Predict on new data
# Ensure X_sample has the same engineered features as training data
# prediction = model.predict(X_sample)
Conclusion
This project demonstrates a full machine learning workflow: Data cleaning & feature engineering Regression modeling and HuggingFace deployment Task reframing into classification Three classifier evaluations with confusion matrices Main Insights Booking Value cannot be predicted accurately with available features. Classification also performs at chance level. Distance and ratings impact price slightly but do not capture true pricing logic. Random Forest was the most stable model across tasks.
π₯ Project Presentation Video
To watch the full 4-minute walkthrough explaining the project, models, insights, and results:
βΆ YouTube Video: https://youtu.be/L5VrUY1rL5s
π Google Colab Notebook
The full code used for data cleaning, EDA, feature engineering, regression, classification, and model export is available here:
π Colab Notebook: https://colab.research.google.com/drive/1EJgBJnDBvAhnixCsTulG9HmPVDMPkIF3#scrollTo=Kw06OJESuWGp
