AI & ML interests

Language model development for Bulgarian and non-English languages, low-resource language processing, multilingual model fine-tuning, dataset curation and evaluation frameworks, custom model development, inference optimization, and research in underrepresented language AI applications.

Recent Activity

s-emanuilov 
posted an update 27 days ago
view post
Post
295
Converted PaddleOCR models to ONNX for easier deployment and faster inference.

These have been working well in production at Monkt.com, so figured I'd share them with the community.

Just straight conversions of the original models—might save you some time if you're building OCR pipelines.

monkt/paddleocr-onnx
s-emanuilov 
posted an update 3 months ago
view post
Post
296
Ran MTEB evaluation on Bulgarian tasks comparing EmbeddingGemma-300M ( google/embeddinggemma-300m) vs Multilingual-E5-Large ( intfloat/multilingual-e5-large)

EmbeddingGemma-300M scored 71.6% average while E5-Large got 75.9%. Pretty solid results for EmbeddingGemma considering it's half the size and uses way less resources.

EmbeddingGemma actually beats E5-Large on sentiment analysis and natural language inference. E5-Large wins on retrieval and bitext mining tasks.

The 300M model has 4x longer context window (2048 vs 512 tokens) and lower carbon footprint which is good.

Both models work great for Bulgarian but have different strengths depending what you need.

Blog article about the usage: https://huggingface.co/blog/embeddinggemma

PS: Don't forget to use the recommended libraries versions :D

pip install git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview
pip install sentence-transformers>=5.0.0

  • 4 replies
·
s-emanuilov 
posted an update 3 months ago
s-emanuilov 
updated a Space 3 months ago