Model Card for envisage

This is the official model card for envisage, a Vision Transformer (ViT) model fine-tuned for image classification.

This model was fine-tuned from the google/vit-base-patch16-224-in21k base model on the cifar10 dataset, which consists of 60,000 32x32 color images in 10 distinct classes.

Model Description

  • Base Model: google/vit-base-patch16-224-in21k
  • Dataset: cifar10
  • Task: Image Classification
  • Framework: PyTorch, Transformers
  • Classes (10): airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

How to Use

The easiest way to use this model for inference is with the pipeline API from the transformers library.

First, ensure you have the necessary libraries installed:

pip install transformers torch pillow

Then, you can use the following Python snippet to classify an image:

from transformers import pipeline
from PIL import Image
import requests

# Load the classification pipeline with your model
pipe = pipeline("image-classification", model="louijiec/envisage")

# Load an image from a URL (e.g., a cat)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cat-tree.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

# Get the predictions
predictions = pipe(image)

print("Predictions:")
for p in predictions:
    print(f"- {p['label']}: {p['score']:.4f}")

# Expected output will show the model's confidence for each class,
# with 'cat' likely having the highest score.

Training Procedure

The model was trained in a Google Colab environment using the transformers Trainer API.

Hyperparameters

  • Learning Rate: 5e-5
  • Training Epochs: 3
  • Batch Size: 16 per device
  • Gradient Accumulation Steps: 4 (Effective batch size of 64)
  • Optimizer: AdamW with a linear learning rate schedule
  • Warmup Ratio: 0.1

Evaluation

The model was evaluated on the cifar10 test split, which contains 10,000 images.

  • Final Accuracy on Test Set: [TODO: Add final accuracy from the trainer.evaluate() step here. For example: 0.965]

Intended Use & Limitations

This model is intended for educational purposes and as a demonstration of fine-tuning a Vision Transformer on a common benchmark dataset. It performs well on images similar to those in the cifar10 dataset (small, low-resolution images of the 10 specified classes).

Limitations:

  • The model will likely perform poorly on images that are significantly different from the cifar10 data (e.g., high-resolution photos, medical images, or classes not seen during training).
  • The training data may reflect biases present in the original cifar10 dataset.
Downloads last month
15
Safetensors
Model size
85.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for louijiec/envisage

Finetuned
(2462)
this model

Evaluation results