🚖 Ride Booking Value Analysis & Prediction

📌 Overview

This project analyzes a large ride-booking dataset to understand the drivers behind Booking Value and creates predictive models. The project follows a complete Data Science lifecycle, initially attempting to predict exact prices via regression, creating engineered features to capture behavioral patterns, clustering rides, and finally reframing the task into a successful classification problem.

Key Objectives:

Predict booking price (Regression)
Create engineered features for better performance
Cluster rides into behavioral groups
Reframe the task into Classification (High vs Low value rides)
Train multiple classification models and select a winner
Export the best model to Hugging Face

Workflow: EDA → Cleaning → Feature Engineering → Regression → Clustering → Classification → Model Export

📂 Dataset

Each row in the dataset represents one completed ride.

Key Variables:

Ride_Distance: Distance of the trip.
Time & Hour: Temporal features.
Driver_Rating & Customer_Rating.
Payment_Method.
Target: Booking_Value (Price).

Cleaning Process:

Removed missing & invalid values.
Filtered outliers using the IQR rule.
Converted categorical variables using One-Hot Encoding.
Removed high-cardinality columns (e.g., specific locations) to prevent overfitting.
Removed data leakage columns (features not available at prediction time).

🔍 Exploratory Data Analysis (EDA)

Through statistical analysis and visualization, we uncovered critical insights about the data distribution.

Distribution: Investigated the spread of Booking Values.
Correlations: Analyzed the relationship between distance/time and price.

(Distribution of Booking Value)

(Scatter plot: Booking Value vs. Ride Distance)

Core Insight: > Booking_Value does NOT correlate strongly with standard ride metrics like distance, hour, or rating in this dataset. This lack of linear correlation explains why standard regression models struggled to predict exact prices.

🛠 Feature Engineering

To improve model performance, we engineered new features to capture non-linear patterns and interactions:

Rush_Hour: Binary flag for peak times.
Distance_Category: Binning distances into Short / Medium / Long.
Distance_x_Hour: Interaction feature.
Polynomial Features: Ride_Distance² to capture non-linear effects.
Cluster_Label: Derived from Unsupervised Learning (K-Means).
Distance_to_Centroid: Numerical anomaly measure.

📈 Regression Modeling

We trained three regression models to predict the exact price:

Linear Regression
Random Forest Regressor
Gradient Boosting Regressor

Result: All models achieved an R² score near 0.

Meaning: Booking Value cannot be predicted well from available numerical/categorical features.
Hypothesis: Pricing is likely driven by external business logic (surge pricing, specific promotions, internal rules) not present in the dataset features.

Winner: Baseline Linear Regression (simplest model with similar performance to complex ones).

File: winner_model.pkl

🎯 Clustering

We applied K-Means Clustering (k=4) to group rides based on Distance and Ride Time.

Identified Clusters:

Short morning rides
Long morning rides
Short evening rides
Long evening rides

(PCA Visualization of Ride Clusters)

These clusters were added as features (Cluster_Label) to the supervised learning models.

🔄 Pivot: Regression → Classification

Since exact price prediction was not feasible, we reframed the business problem: Can we classify rides as High Value vs. Low Value?

Method: Median Split.
Classes:
- 0: Low Value
- 1: High Value
Balance: The split resulted in perfectly balanced classes (~50% / 50%).

Strategy: Since classes are balanced, Accuracy is a valid metric. However, for business purposes, Recall is prioritized to avoid missing out on High-Value customers (False Negatives).

🤖 Classification Models

We trained and evaluated three classifiers:

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier

Metrics Evaluated: Accuracy, Recall, Precision, F1-Score, ROC-AUC.

🏆 The Winner: Random Forest Classifier

Reasons for selection:

Highest ROC-AUC score.
Best Recall for the positive class (High Value).
Fewer critical mistakes (False Negatives) in the confusion matrix compared to others.

Confusion Matrix Analysis:

📦 Model Export & Usage

The final models have been serialized and uploaded to this repository.

winner_model.pkl: Best Regression Model (Baseline).
best_classifier.pkl: Best Classification Model (Random Forest).

How to Load the Classification Model

You can load the model using Python's pickle module:

import pickle
import pandas as pd

# Load the model
with open("best_classifier.pkl", "rb") as f:
    model = pickle.load(f)

# Example: Predict on new data
# Ensure X_sample has the same engineered features as training data
# prediction = model.predict(X_sample)

Conclusion

This project demonstrates a full machine learning workflow: Data cleaning & feature engineering Regression modeling and HuggingFace deployment Task reframing into classification Three classifier evaluations with confusion matrices Main Insights Booking Value cannot be predicted accurately with available features. Classification also performs at chance level. Distance and ratings impact price slightly but do not capture true pricing logic. Random Forest was the most stable model across tasks.

🎥 Project Presentation Video

To watch the full 4-minute walkthrough explaining the project, models, insights, and results:

▶ YouTube Video: https://youtu.be/L5VrUY1rL5s

📓 Google Colab Notebook

The full code used for data cleaning, EDA, feature engineering, regression, classification, and model export is available here:

🔗 Colab Notebook: https://colab.research.google.com/drive/1EJgBJnDBvAhnixCsTulG9HmPVDMPkIF3#scrollTo=Kw06OJESuWGp

Downloads last month: -; Downloads are not tracked for this model. How to track