baishanduan
AI & ML interests
Recent Activity
Organizations
Just wanted to add that I found the github readme on
transformersthat shows how to perform pre and post-processing yourself usingAutoModelandAutoProcessor. I noted that this example performstorch.sigmoidon the raw model output, which leaves the values looking similar to how they look when running it via the pipeline.Following the github example almost exactly with my dog image I can see, for labels:
man, cat, horse, dog (with template This is a photo of a xxxx.)
Raw logits: tensor([[-16.1657, -14.3962, -15.7023, -7.3122]])
Post sigmoid: tensor([[9.5352e-08, 5.5954e-07, 1.5156e-07, 6.6692e-04]])
0.0% that image 0 is 'man'
0.0% that image 0 is 'cat'
0.0% that image 0 is 'horse'
0.1% that image 0 is 'dog'
did u fix it? i meet the same error
cross modal similarity
when i input an img and a text to model
def get_output(url, text):
text = "a photo of 2 cats"
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)
inputs = processor(text=[text], images=image, padding="max_length", return_tensors="pt").to(model.device)
with torch.no_grad():
output = model(**inputs)
return output
the output logits is
SiglipOutput(loss=None, logits_per_image=tensor([[-15.5217]], device='cuda:0'), logits_per_text=tensor([[-15.5217]], device='cuda:0')
after sigmoid the output is ~0.
How can i fix it?
text feature extract
SigLIP 2: A better multilingual vision language encoder
- +1