iBrokeTheCode's picture
chore: Redact README file
453dbe9
metadata
title: Home Credit Default Risk Prediction
emoji: 🍃
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: true
license: mit
short_description: ML Classification models applied to Home Credit Risk dataset

🏦 Home Credit Default Risk Prediction

Table of Contents

  1. Project Description
  2. Methodology & Key Features
  3. Technology Stack
  4. Dataset

1. Project Description

This project focuses on building a machine learning pipeline to predict a client's ability to repay a loan. It is a binary classification task that uses a real-world financial dataset to identify clients who may face payment difficulties.

The project goes beyond a standard model by including a practical application that:

  • Preprocesses and cleans the dataset for model training.
  • Trains a machine learning model to predict loan repayment risk.
  • Deploys an interactive predictor app using Marimo, hosted on Hugging Face Spaces.
  • Allows users to make predictions by providing the top 10 most influential features.

This work showcases a complete end-to-end workflow, transforming raw data into a functional, user-friendly tool for risk assessment.

App

2. Methodology & Key Features

  • Model Selection: Four different models were trained and evaluated, with LightGBM selected as the final model due to its superior performance, achieving a ROC AUC score of 0.751 on the test set.
  • Automated Preprocessing: The data preprocessing pipeline handles common tasks such as feature scaling and categorical encoding, ensuring the model receives clean and formatted data.
  • Interactive Predictor: An application built with Marimo allows users to interact with the trained model directly. It uses the top 10 most important features—identified from the final LightGBM model—to generate real-time predictions.

3. Technology Stack

This project was built using the following technologies and libraries:

Dashboard & Hosting:

  • Marimo: A Python library for building interactive dashboards.
  • Hugging Face Spaces: Used for hosting and sharing the interactive dashboard.

Data Analysis & Visualization:

  • Pandas: For data manipulation and analysis.
  • Matplotlib: For creating static visualizations.
  • Seaborn: For creating statistical graphics.

Modeling & Training:

  • Scikit-Learn: For machine learning tasks such as preprocessing, feature engineering, and model training.
  • LightGBM: It is a gradient boosting framework that uses tree based learning algorithms.

Development Tools:

  • Ruff: A fast Python linter and code formatter.
  • uv: A fast Python package installer and resolver.

4. Dataset

This project utilizes the Home Credit Default Risk from Kaggle, a public dataset containing details on over 246,000 of individuals who have made payments on their loans.