StarCoderData Code Classifier

Multi-task code classification model for filtering large-scale code datasets. Built to aggressively curate training data for a 1B parameter structured-data-focused code model.

Model Details

  • Base model: microsoft/unixcoder-base (125M params)
  • Architecture: Shared encoder + three task-specific linear heads
  • Training data: 191,776 code samples from bigcode/starcoderdata, labeled by GPT-5-nano (~$80 Batch API)
  • Languages: Python, JavaScript, TypeScript, Java, Go, Rust, SQL, Shell (25K per language)
  • Training: 3 epochs, batch size 16, lr 2e-5, AMP (bf16), torch.compile

Tasks

Task Type Output Loss
Code Quality (1-5) Regression Float MSE
Structured Data Relevance (0-3) Regression Float MSE
Content Type (9 classes) Classification Softmax CrossEntropy

Test Set Performance

Code Quality (1-5 scale)

Metric Score
MAE 0.598
Rounded Accuracy 55.3%
Spearman r 0.575
Level Precision Recall F1
1 - Broken/gibberish 0.72 0.35 0.47
2 - Functional but poor 0.28 0.23 0.25
3 - Decent 0.49 0.67 0.56
4 - Good 0.68 0.63 0.65
5 - Excellent 0.08 0.00 0.01

Structured Data Relevance (0-3 scale)

Metric Score
MAE 0.421
Rounded Accuracy 66.7%
Spearman r 0.807
Level Precision Recall F1
0 - None 0.73 0.70 0.71
1 - Minor 0.43 0.48 0.45
2 - Significant 0.74 0.75 0.75
3 - Primary focus 0.67 0.59 0.63

Content Type (9 classes)

Metric Score
Accuracy 87.5%
Macro F1 0.678
Type Precision Recall F1 Support
library 0.89 0.92 0.91 7,990
application 0.83 0.75 0.79 3,404
test 0.92 0.93 0.93 1,818
config 0.77 0.68 0.72 309
tutorial 0.56 0.37 0.45 227
data 0.45 0.59 0.51 129
generated 0.66 0.49 0.56 316
script 0.90 0.93 0.91 4,970
other 0.75 0.20 0.32 15

Why These Scores Are Acceptable

This model is designed as a coarse filter, not a precise labeler. The intended workflow is:

  1. Run this model on the full StarCoderData (~250B tokens)
  2. Apply threshold filters (e.g., quality >= 3 AND structured_data >= 2)
  3. Train a 1B parameter model on the filtered subset

For this filtering use case, what matters is rank ordering, not exact classification:

  • Structured data (Spearman 0.81): The model's strongest dimension. It reliably separates code with heavy structured data usage (APIs, schemas, serialization) from code without it. At the filtering threshold of structured_data >= 2, the model achieves 0.75 F1 — meaning the filtered subset will be genuinely rich in structured data patterns.

  • Quality (Spearman 0.58): The weakest dimension, but still useful for filtering. The model struggles most with the quality 2-3 boundary (decent vs. poor) and virtually ignores quality 5 (only 1.2% of training data). However, for the intended filter of quality >= 3, the model has decent precision at levels 3-4 (0.49-0.68). The key insight: false positives at the boundary (quality-2 code scored as 3) are tolerable because the structured data filter provides a second gate. Code that passes both filters is unlikely to be low quality.

  • Content type (87.5% accuracy): Strong performance on the high-volume categories that matter most for filtering: library (0.91 F1), script (0.91 F1), test (0.93 F1), and application (0.79 F1). The weaker categories (tutorial, data, generated, other) have low support — together they represent only 3.5% of the data. Even with lower recall on these rare types, the model will still flag enough of them for filtering decisions.

  • Errors are symmetric, not catastrophic. A quality MAE of 0.60 means predictions are typically off by less than one level. A file scored as quality 4 is almost certainly quality 3-5, not quality 1. This is precisely the behavior needed for threshold-based filtering — the model rarely makes predictions that are off by more than one level.

How to Improve

The primary bottleneck is training data volume and class balance, not model capacity:

  1. Scale up the GPT-5-nano labeling set. The current model was trained on 192K labeled samples. Doubling to 400K samples (~$80) would particularly help quality levels 2 and 5, where the model struggles most. Level 5 (excellent code) had only 2,345 training examples — far too few for the model to learn the pattern.

  2. Increase max token length. The current model uses 512 tokens, but code files often need more context to assess quality. Increasing to 1024 or 2048 tokens (UniXcoder supports up to 1024) would give the model more signal, particularly for quality assessment where style and documentation patterns emerge over longer spans.

Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mdonigian/code-curator-v1

Finetuned
(10)
this model

Dataset used to train mdonigian/code-curator-v1