--- license: cc-by-4.0 language: - en tags: - proteomics - mass-spectrometry - peptide-sequencing - de-novo - calibration - fdr --- ## Winnow General Probability Calibrator **Winnow** recalibrates confidence scores and provides FDR control for *de novo* peptide sequencing (DNS) workflows. This repository hosts a pretrained, general-purpose calibrator that maps raw InstaNovo model confidences and complementary features (mass error, retention time, chimericity, beam features, Prosit features) to well-calibrated probabilities. - Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by InstaNovo - Outputs: calibrated per-PSM probabilities in `calibrated_confidence`. ### What’s inside - `calibrator.pkl`: trained classifier - `scaler.pkl`: feature standardiser - `irt_predictor.pkl`: Prosit iRT regressor used by RT features --- ## How to use ### Python ```python from pathlib import Path from huggingface_hub import snapshot_download from winnow.calibration.calibrator import ProbabilityCalibrator from winnow.datasets.data_loaders import InstaNovoDatasetLoader from winnow.scripts.main import filter_dataset from winnow.fdr.nonparametric import NonParametricFDRControl # 1) Download model files snapshot_download( repo_id="InstaDeepAI/winnow-general-model", allow_patterns=["*.pkl"]), repo_type="model", local_dir=general_model, ) # 2) Load calibrator calibrator = ProbabilityCalibrator.load(general_model) # 3) Load your dataset (InstaNovo-style config) dataset = InstaNovoDatasetLoader().load( "path_to_spectrum_data.parquet", "path_to_instanovo_predictions.csv", ) dataset = filter_dataset(dataset) # standard Winnow filtering # 4) Predict calibrated confidences calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"] # 5) Optional: FDR control on calibrated confidence fdr = NonParametricFDRControl() fdr.fit(dataset.metadata["calibrated_confidence"]) cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff ``` ### CLI ```bash # After `pip install winnow` winnow predict \ --data-source instanovo \ --dataset-config-path config_with_dataset_paths.yaml \ --model-folder general_model_folder \ --method winnow \ --fdr-threshold 0.05 \ --confidence-column calibrated_confidence \ --output-path outputs/winnow_predictions.csv ``` --- ## Inputs and outputs **Required columns for calibration:** - Spectrum data (*.parquet) - `spectrum_id` (string): unique spectrum identifier - `sequence` (string): ground truth peptide sequence from database search (optional) - `retention_time` (float): retention time (seconds) - `precursor_mass` (float): mass of the precursor ion (from MS1) - `mz_array` (list[float]): mass-to-charge values of the MS2 spectrum - `intensity_array` (list[float]): intensity values of the MS2 spectrum - `precursor_charge` (int): charge of the precursor (from MS1) - Beam predictions (*_beams.csv) - `spectrum_id` (string) - `sequence` (string): ground truth peptide sequence from database search (optional) - `preds` (string): top prediction, untokenised sequence - `preds_tokenised` (string): comma‐separated tokens for the top prediction - `log_probs` (float): top prediction log probability - `preds_beam_k` (string): untokenised sequence for beam k (k≥0) - `log_probs_beam_k` (float) - `token_log_probs_k` (string/list-encoded): per-token log probabilities for beam k **Output columns (added by Winnow's calibrator on `predict`):** - `calibrated_confidence`: calibrated probability - Optional (if requested): `psm_pep`, `psm_fdr`, `psm_qvalue` - All input columns are retained in-place --- ## Training data - The general model was trained on a pooled, labelled set spanning multiple public datasets to encourage cross-dataset generalisation: - HeLa single-shot (PXD044934) - *Candidatus* Scalindua Brodae (PXD044934) - Wound exudates (PXD025748) - HepG2 (PXD019483) - Immunopeptidomics (PXD006939) - HeLa degradome (PXD044934) - Snake venoms (PXD036161) - All default features were enabled for the training of this model. - Predictions were obtained using InstaNovo v1.1.1 with knapsack beam search set to 50 beams. --- ## Citation If you use Winnow or this model, please cite: