GitPulse: Multimodal Time Series Prediction for GitHub Project Health
GitPulse is a multimodal Transformer-based model that combines project text descriptions with historical activity data to predict GitHub project health metrics.
Model Description
GitPulse leverages both textual metadata (project descriptions, topics) and historical time series (commits, issues, stars, etc.) to forecast future project activity. The key innovation is the adaptive fusion mechanism that dynamically balances text and time-series features.
Architecture
- Text Encoder: DistilBERT-based encoder with attention pooling
- Time Series Encoder: Transformer encoder with positional embeddings
- Adaptive Fusion: Dynamic gating mechanism for multimodal fusion
- Prediction Head: MLP for generating future predictions
Model Parameters
| Parameter | Value |
|---|---|
| d_model | 128 |
| n_heads | 4 |
| n_layers | 2 |
| hist_len | 128 |
| pred_len | 32 |
| n_vars | 16 |
Performance
Evaluated on 636 test samples from 4,232 GitHub projects:
| Model | MSE ↓ | MAE ↓ | R² ↑ | DA ↑ | TA@0.2 ↑ |
|---|---|---|---|---|---|
| GitPulse | 0.0755 | 0.1094 | 0.7559 | 86.68% | 81.60% |
| CondGRU+Text | 0.0915 | 0.1204 | 0.7043 | 84.05% | 80.14% |
| Transformer | 0.1142 | 0.1342 | 0.6312 | 84.02% | 78.87% |
| LSTM | 0.2142 | 0.1914 | 0.3800 | 56.00% | 75.00% |
Text Contribution
| Architecture | TS-Only R² | +Text R² | Improvement |
|---|---|---|---|
| Transformer → GitPulse | 0.6312 | 0.7559 | +19.8% |
| CondGRU → CondGRU+Text | 0.3328 | 0.7043 | +111.6% |
Usage
Installation
pip install torch transformers
Quick Start
import torch
from transformers import DistilBertTokenizer
# Load model
from model import GitPulseModel
model = GitPulseModel.from_pretrained('./')
# Prepare inputs
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
text = "A Python library for machine learning"
encoded = tokenizer(text, padding='max_length', truncation=True,
max_length=128, return_tensors='pt')
# Time series: [batch, hist_len, n_vars]
time_series = torch.randn(1, 128, 16)
# Predict
model.eval()
with torch.no_grad():
predictions = model(
time_series,
input_ids=encoded['input_ids'],
attention_mask=encoded['attention_mask']
)
# predictions shape: [1, 32, 16]
Inference API
# Simple prediction interface
predictions = model.predict(
time_series=history_data, # [batch, 128, 16]
text="Project description...",
tokenizer=tokenizer
)
Training Details
- Dataset: GitHub project activity data (4,232 projects)
- Train/Val/Test Split: 70% / 15% / 15%
- Optimizer: AdamW (lr=1e-5, weight_decay=0.01)
- Fine-tuning Strategy: Freeze encoder, train prediction head
- Hardware: NVIDIA RTX GPU
Input Features (16 variables)
- Commits count
- Issues opened
- Issues closed
- Pull requests opened
- Pull requests merged
- Stars gained
- Forks count
- Contributors count
- Code additions
- Code deletions
- Comments count
- Releases count
- Wiki updates
- Discussions count
- Sponsors count
- Watchers count
Limitations
- Trained on English project descriptions only
- Best suited for projects with at least 128 months of history
- Performance may vary for niche domains not well represented in training
Citation
@article{gitpulse2024,
title={GitPulse: Multimodal Time Series Prediction for GitHub Project Health},
author={Anonymous},
journal={arXiv preprint},
year={2024}
}
License
Apache 2.0
- Downloads last month
- 18
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support