LLM.bg

company

llm-bg

AI & ML interests

Language model development for Bulgarian and non-English languages, low-resource language processing, multilingual model fine-tuning, dataset curation and evaluation frameworks, custom model development, inference optimization, and research in underrepresented language AI applications.

Recent Activity

s-emanuilov authored a paper 29 days ago

How Large Language Models are Designed to Hallucinate

s-emanuilov authored a paper 29 days ago

Stemming Hallucination in Language Models Using a Licensing Oracle

s-emanuilov updated a dataset 3 months ago

llm-bg/bulgarian-history-complex

View all activity

s-emanuilov

posted an update 27 days ago

Post

295

Converted PaddleOCR models to ONNX for easier deployment and faster inference.

These have been working well in production at Monkt.com, so figured I'd share them with the community.

Just straight conversions of the original models—might save you some time if you're building OCR pipelines.

monkt/paddleocr-onnx

s-emanuilov

authored 2 papers 29 days ago

How Large Language Models are Designed to Hallucinate

Paper • 2509.16297 • Published Sep 19

Stemming Hallucination in Language Models Using a Licensing Oracle

Paper • 2511.06073 • Published Nov 8 • 1

s-emanuilov

updated a dataset 3 months ago

llm-bg/bulgarian-history-complex

Viewer • Updated Sep 9 • 9.47k • 66

s-emanuilov

published a dataset 3 months ago

llm-bg/bulgarian-history-complex

Viewer • Updated Sep 9 • 9.47k • 66

s-emanuilov

updated a dataset 3 months ago

llm-bg/bulgarian-history-qa

Viewer • Updated Sep 7 • 339 • 25

s-emanuilov

published a dataset 3 months ago

llm-bg/bulgarian-history-qa

Viewer • Updated Sep 7 • 339 • 25

s-emanuilov

posted an update 3 months ago

Post

296

Ran MTEB evaluation on Bulgarian tasks comparing EmbeddingGemma-300M ( google/embeddinggemma-300m) vs Multilingual-E5-Large ( intfloat/multilingual-e5-large)

EmbeddingGemma-300M scored 71.6% average while E5-Large got 75.9%. Pretty solid results for EmbeddingGemma considering it's half the size and uses way less resources.

EmbeddingGemma actually beats E5-Large on sentiment analysis and natural language inference. E5-Large wins on retrieval and bitext mining tasks.

The 300M model has 4x longer context window (2048 vs 512 tokens) and lower carbon footprint which is good.

Both models work great for Bulgarian but have different strengths depending what you need.

Blog article about the usage: https://huggingface.co/blog/embeddinggemma

PS: Don't forget to use the recommended libraries versions :D

pip install git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview
pip install sentence-transformers>=5.0.0

4 replies

s-emanuilov

posted an update 3 months ago

Post

319

Embeddings are pretty useful, but mathematically limited.

Great insights from Google DeepMind: On the Theoretical Limitations of Embedding-Based Retrieval (2508.21038)

What could be the alternative? Cross-Encoders (good but can't scale), Multi-vector, Sparse models...or something new.

Hybrid retrieval is the current quick fix, imo.