slap-combined
SLAP-Combined is a unified dual-encoder model trained on both intrinsic and situational speech style data. It excels on compositional evaluation that requires knowledge of both attribute types.
This is part of the SLAP model family from the paper:
SLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining Anuj Diwan, Eunsol Choi, David Harwath — Interspeech 2026
Model Details
- Architecture: WavLM-Large (speech encoder) + Granite Embedding 278M (text encoder) with projection to 768-dim shared space
- Speech encoder:
microsoft/wavlm-large(317M params) - Text encoder:
ibm-granite/granite-embedding-278m-multilingual(278M params) - Embedding dimension: 768
- Training data: ParaSpeechCaps (union of intrinsic and situational subsets with equal balancing)
- Training objective: InfoNCE contrastive loss
- Training: 4500 steps, Adam optimizer, lr=1e-5, 4x NVIDIA A40 GPUs, batch size 128
Usage
import torch
from slap.model import CLAP
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor
model = CLAP(
speech_name="microsoft/wavlm-large",
text_name="ibm-granite/granite-embedding-278m-multilingual",
embedding_dim=768,
)
state_dict = torch.load("slap-combined.pth.tar", map_location="cpu")
model.load_state_dict(state_dict, strict=False)
model.eval()
See the GitHub repository for full usage examples.
Citation
@inproceedings{diwan2026slap,
title={{SLAP}}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining,
author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
booktitle={Proc. Interspeech},
year={2026}
}