GitPulse: Multimodal Time Series Prediction for GitHub Project Health

GitPulse is a multimodal Transformer-based model that combines project text descriptions with historical activity data to predict GitHub project health metrics.

Model Description

GitPulse leverages both textual metadata (project descriptions, topics) and historical time series (commits, issues, stars, etc.) to forecast future project activity. The key innovation is the adaptive fusion mechanism that dynamically balances text and time-series features.

Architecture

Text Encoder: DistilBERT-based encoder with attention pooling
Time Series Encoder: Transformer encoder with positional embeddings
Adaptive Fusion: Dynamic gating mechanism for multimodal fusion
Prediction Head: MLP for generating future predictions

Model Parameters

Parameter	Value
d_model	128
n_heads	4
n_layers	2
hist_len	128
pred_len	32
n_vars	16

Performance

Evaluated on 636 test samples from 4,232 GitHub projects:

Model	MSE ↓	MAE ↓	R² ↑	DA ↑	TA@0.2 ↑
GitPulse	0.0755	0.1094	0.7559	86.68%	81.60%
CondGRU+Text	0.0915	0.1204	0.7043	84.05%	80.14%
Transformer	0.1142	0.1342	0.6312	84.02%	78.87%
LSTM	0.2142	0.1914	0.3800	56.00%	75.00%

Text Contribution

Architecture	TS-Only R²	+Text R²	Improvement
Transformer → GitPulse	0.6312	0.7559	+19.8%
CondGRU → CondGRU+Text	0.3328	0.7043	+111.6%

Usage

Installation

pip install torch transformers

Quick Start

import torch
from transformers import DistilBertTokenizer

# Load model
from model import GitPulseModel
model = GitPulseModel.from_pretrained('./')

# Prepare inputs
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
text = "A Python library for machine learning"
encoded = tokenizer(text, padding='max_length', truncation=True, 
                    max_length=128, return_tensors='pt')

# Time series: [batch, hist_len, n_vars]
time_series = torch.randn(1, 128, 16)

# Predict
model.eval()
with torch.no_grad():
    predictions = model(
        time_series,
        input_ids=encoded['input_ids'],
        attention_mask=encoded['attention_mask']
    )
# predictions shape: [1, 32, 16]

Inference API

# Simple prediction interface
predictions = model.predict(
    time_series=history_data,  # [batch, 128, 16]
    text="Project description...",
    tokenizer=tokenizer
)

Training Details

Dataset: GitHub project activity data (4,232 projects)
Train/Val/Test Split: 70% / 15% / 15%
Optimizer: AdamW (lr=1e-5, weight_decay=0.01)
Fine-tuning Strategy: Freeze encoder, train prediction head
Hardware: NVIDIA RTX GPU

Input Features (16 variables)

Commits count
Issues opened
Issues closed
Pull requests opened
Pull requests merged
Stars gained
Forks count
Contributors count
Code additions
Code deletions
Comments count
Releases count
Wiki updates
Discussions count
Sponsors count
Watchers count

Limitations

Trained on English project descriptions only
Best suited for projects with at least 128 months of history
Performance may vary for niche domains not well represented in training

Citation

@article{gitpulse2024,
  title={GitPulse: Multimodal Time Series Prediction for GitHub Project Health},
  author={Anonymous},
  journal={arXiv preprint},
  year={2024}
}

License

Apache 2.0

Downloads last month: 18

Inference Providers NEW

Time Series Forecasting

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support