slap-combined

SLAP-Combined is a unified dual-encoder model trained on both intrinsic and situational speech style data. It excels on compositional evaluation that requires knowledge of both attribute types.

This is part of the SLAP model family from the paper:

SLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath — Interspeech 2026

Model Details

Architecture: WavLM-Large (speech encoder) + Granite Embedding 278M (text encoder) with projection to 768-dim shared space
Speech encoder: microsoft/wavlm-large (317M params)
Text encoder: ibm-granite/granite-embedding-278m-multilingual (278M params)
Embedding dimension: 768
Training data: ParaSpeechCaps (union of intrinsic and situational subsets with equal balancing)
Training objective: InfoNCE contrastive loss
Training: 4500 steps, Adam optimizer, lr=1e-5, 4x NVIDIA A40 GPUs, batch size 128

Usage

import torch
from slap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor

model = CLAP(
    speech_name="microsoft/wavlm-large",
    text_name="ibm-granite/granite-embedding-278m-multilingual",
    embedding_dim=768,
)
state_dict = torch.load("slap-combined.pth.tar", map_location="cpu")
model.load_state_dict(state_dict, strict=False)
model.eval()

See the GitHub repository for full usage examples.

Citation

@inproceedings{diwan2026slap,
  title={{SLAP}}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining,
  author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
  booktitle={Proc. Interspeech},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

ajd12342
/

slap-combined

slap-combined

Model Details

Usage

Citation

Dataset used to train ajd12342/slap-combined