TachiwinOCR

for the Indigenous Languages of Mexico

16 bits precision

This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights

Inference

You can perform inference using the PaddleOCR pipeline or the transformers library.

Option A: Using PaddleOCR (Easy Pipeline)

from paddleocr import PaddleOCRVL

# Load the fine-tuned model
pipeline = PaddleOCRVL(
    vl_rec_model_name="PaddleOCR-VL-0.9B",
    vl_rec_model_dir=path_to_tachiwin_downloaded_model,
)

# Predict on an image
output = pipeline.predict("test.png")

for res in output:
    res.print()
    res.save_to_json(save_path="output")
    res.save_to_markdown(save_path="output")

Option B: Using Transformers (Advanced Control)

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

# ---- Settings ----
model_path = "tachiwin/PaddleOCR-VL-Tachiwin-BF16"
image_path = "test.png"
# ------------------

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

image = Image.open(image_path).convert("RGB")

model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "OCR:"},
    ]}
]

inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True, 	
    return_dict=True,
    return_tensors="pt"
).to(DEVICE)

outputs = model.generate(**inputs, max_new_tokens=1024, min_new_tokens=1)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(generated_text)

๐Ÿ“Š Benchmark Results

Tachiwin-OCR was evaluated against the base PaddleOCR-VL model using a diverse subset of Indigenous language samples. The fine-tuning results demonstrate significant improvements in both character and word recognition accuracy.

Summary Metrics

Metric Base Model (Raw) Tachiwin-OCR (Fine-tuned) Improvement
Character Error Rate (CER) 7.59% 6.80% 10.4% (Relative Reduction)
Word Error Rate (WER) 25.17% 17.36% +7.81% (Absolute)
OCR Accuracy (1 - CER) 92.41% 93.20% +0.79% (Absolute)

Detailed Comparison (Sample)

A subset of the evaluation results across different languages, where tonal languages are the most improved by this fine-tuning:

Language Raw CER FT CER Raw WER FT WER Improvement
stp (Tepehuรกn) 10.95% 0.00% 43.55% 0.00% +10.95%
maz (Central Mazahua) 3.29% 0.41% 9.09% 0.00% +2.88%
chj (Ojitlรกn Chinantec) 16.97% 2.21% 52.78% 9.72% +14.76%
maa (Tecรณatl Mazatec) 86.70% 8.49% 105.08% 10.17% +78.21%

Key Findings

  • High Accuracy Gains: In many tonal languages like Tepehuรกn (stp) and Mazatec (maa), the fine-tuning process reduced the error rate from significant levels to nearly zero or double digits.
  • Robustness: The model shows high resilience against synthetic distortions implemented during the data generation phase.
  • Word-Level Performance: The relative reduction in Word Error Rate (WER) highlights the model's improved capability in contextualizing character sequences specific to these language families.

Tachiwin (from Totonac - "Language") is dedicated to bridging the digital divide for indigenous languages of Mexico through AI technology.

  • Developed by: Tachiwin
  • License: apache-2.0
  • Finetuned from model : PaddlePaddle/PaddleOCR-VL
Downloads last month
64
Safetensors
Model size
1.0B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tachiwin/PaddleOCR-VL-Tachiwin-BF16

Finetuned
(10)
this model

Dataset used to train tachiwin/PaddleOCR-VL-Tachiwin-BF16

Space using tachiwin/PaddleOCR-VL-Tachiwin-BF16 1