Just wanted to add that I found the github readme on transformers that shows how to perform pre and post-processing yourself using AutoModel and AutoProcessor. I noted that this example performs torch.sigmoid on the raw model output, which leaves the values looking similar to how they look when running it via the pipeline.

Following the github example almost exactly with my dog image I can see, for labels:
man, cat, horse, dog (with template This is a photo of a xxxx.)
Raw logits: tensor([[-16.1657, -14.3962, -15.7023, -7.3122]])
Post sigmoid: tensor([[9.5352e-08, 5.5954e-07, 1.5156e-07, 6.6692e-04]])
0.0% that image 0 is 'man'
0.0% that image 0 is 'cat'
0.0% that image 0 is 'horse'
0.1% that image 0 is 'dog'

did u fix it? i meet the same error

commented on SigLIP 2: A better multilingual vision language encoder 11 months ago

cross modal similarity

when i input an img and a text to model

def get_output(url, text):
    text = "a photo of 2 cats"
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = load_image(url)
    inputs = processor(text=[text], images=image, padding="max_length", return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model(**inputs)
    return output

the output logits is
SiglipOutput(loss=None, logits_per_image=tensor([[-15.5217]], device='cuda:0'), logits_per_text=tensor([[-15.5217]], device='cuda:0')

after sigmoid the output is ~0.

How can i fix it?

New activity in google/siglip2-base-patch16-224 11 months ago