Alcove Walks and GKM Theory for Affine Flags
Paper
β’
2303.12170
β’
Published
This directory contains ONNX exports of the google/siglip-base-patch16-256-multilingual model.
SigLIP (Sigmoid Loss for Language Image Pre-training) is a multimodal model similar to CLIP but with key improvements:
siglip-base-patch16-256-multilingual-onnx/
βββ vision/
β βββ model.onnx # Vision encoder
β βββ config.json # Model configuration
β βββ preprocessor_config.json
βββ text/
β βββ model.onnx # Text encoder
β βββ config.json # Model configuration
β βββ tokenizer.json # Fast tokenizer
β βββ special_tokens_map.json
β βββ spiece.model # SentencePiece model
βββ README.md
pip install onnxruntime pillow transformers
For GPU support:
pip install onnxruntime-gpu
import numpy as np
import onnxruntime as ort
from PIL import Image
from transformers import AutoProcessor
# Load processors
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-256-multilingual")
# Load ONNX sessions
vision_session = ort.InferenceSession("vision/model.onnx")
text_session = ort.InferenceSession("text/model.onnx")
# Process image
image = Image.open("your_image.jpg")
image_inputs = processor(images=image, return_tensors="np")
image_embeddings = vision_session.run(None, {"pixel_values": image_inputs["pixel_values"]})[0]
# Process text
texts = ["a photo of a cat", "une photo d'un chat", "una foto de un gato"]
text_inputs = processor(text=texts, padding=True, return_tensors="np")
text_embeddings = text_session.run(None, {
"input_ids": text_inputs["input_ids"],
"attention_mask": text_inputs["attention_mask"]
})[0]
# Compute similarity using sigmoid (not softmax like CLIP!)
# SigLIP uses sigmoid activation, so we compute sigmoid of the dot product
logits = np.dot(image_embeddings, text_embeddings.T)
probs = 1 / (1 + np.exp(-logits)) # sigmoid activation
print("Probabilities:")
for i, text in enumerate(texts):
print(f" {text}: {probs[0][i]:.2%}")
Arabic (ar), Bengali (bn), German (de), Greek (el), English (en), Spanish (es), Finnish (fi), French (fr), Hebrew (he), Hindi (hi), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Turkish (tr), Ukrainian (uk), Vietnamese (vi), Chinese (zh)
@article{zhai2023sigmoid,
title={Sigmoid Loss for Language Image Pre-Training},
author={Zhai, Xiaohua and others},
journal={arXiv preprint arXiv:2303.12170},
year={2023}
}
Please refer to the original model's license at: https://huggingface.co/google/siglip-base-patch16-256-multilingual