14 41 461

Matricardi Fabio

FM-1976

https://medium.com/@fabio.matricardi

AI & ML interests

control system engineering, AI, LLM with python. ThePoorGPUguy on substack

Recent Activity

liked a model about 12 hours ago

mradermacher/Llama-3.2-1B-4B-Quad-MoE-GGUF

liked a model about 22 hours ago

ethicalabs/Kurtis-EON1

reacted to mrs83's post with 🔥 about 22 hours ago

In 2017, my RNNs were babbling. Today, they are hallucinating beautifully. 10 years ago, getting an LSTM to output coherent English was a struggle. 10 years later, after a "cure" based on FineWeb-EDU and a custom synthetic mix for causal conversation, the results are fascinating. We trained this on ~10B tokens on a single AMD GPU (ROCm). It is not a Transformer: Echo-DSRN (400M) is a novel recurrent architecture inspired by Hymba, RWKV, and xLSTM, designed to challenge the "Attention is All You Need" monopoly on the Edge. The ambitious goal is to build a small instruct model with RAG and tool usage capabilities (https://huggingface.co/ethicalabs/Kurtis-EON1) 📊 The Benchmarks (Size: 400M) For a model this size (trained on <10B tokens), the specialized performance is surprising: *SciQ*: 73.8% 🦄 (This rivals billion-parameter models in pure fact retrieval). *PIQA*: 62.3% (Solid physical intuition for a sub-1B model). The Reality Check: HellaSwag (29.3%) and Winogrande (50.2%) show the limits of 400M parameters and 10B tokens training. We are hitting the "Reasoning Wall" which confirms we need to scale to (hopefully) unlock deeper common sense. As you can see in the visualization (to be released soon on HF), the FineWeb-EDU bias is strong. The model is convinced it is in a classroom ("In this course, we explore..."). The Instruct Model is not ready yet and we are currently using curriculum learning to test model plasticity. Source code and weights will not be released yet. This is not a fork or a fine-tune: the base model is built in-house at https://www.ethicalabs.ai/, with novel components that do not exist in current open libraries. 🤝 Call for Collaboration: I am looking for Peer Reviewers interested in recurrent/hybrid architectures. If you want to explore what lies beyond Transformers, let’s connect! Training diary: https://huggingface.co/ethicalabs/Kurtis-EON1

View all activity

Organizations

None yet

liked a model about 12 hours ago

mradermacher/Llama-3.2-1B-4B-Quad-MoE-GGUF

4B • Updated about 17 hours ago • 1

liked a model about 22 hours ago

ethicalabs/Kurtis-EON1

Text Generation • Updated 27 days ago • 2

reacted to mrs83's post with 🔥 about 22 hours ago

Post

734

In 2017, my RNNs were babbling. Today, they are hallucinating beautifully.

10 years ago, getting an LSTM to output coherent English was a struggle.
10 years later, after a "cure" based on FineWeb-EDU and a custom synthetic mix for causal conversation, the results are fascinating.

We trained this on ~10B tokens on a single AMD GPU (ROCm). It is not a Transformer: Echo-DSRN (400M) is a novel recurrent architecture inspired by Hymba, RWKV, and xLSTM, designed to challenge the "Attention is All You Need" monopoly on the Edge.

The ambitious goal is to build a small instruct model with RAG and tool usage capabilities ( ethicalabs/Kurtis-EON1)

📊 The Benchmarks (Size: 400M)

For a model this size (trained on <10B tokens), the specialized performance is surprising:

*SciQ*: 73.8% 🦄 (This rivals billion-parameter models in pure fact retrieval).
*PIQA*: 62.3% (Solid physical intuition for a sub-1B model).

The Reality Check:

HellaSwag (29.3%) and Winogrande (50.2%) show the limits of 400M parameters and 10B tokens training.

We are hitting the "Reasoning Wall" which confirms we need to scale to (hopefully) unlock deeper common sense. As you can see in the visualization (to be released soon on HF), the FineWeb-EDU bias is strong. The model is convinced it is in a classroom ("In this course, we explore...").

The Instruct Model is not ready yet and we are currently using curriculum learning to test model plasticity.

Source code and weights will not be released yet. This is not a fork or a fine-tune: the base model is built in-house at https://www.ethicalabs.ai/, with novel components that do not exist in current open libraries.

🤝 Call for Collaboration: I am looking for Peer Reviewers interested in recurrent/hybrid architectures. If you want to explore what lies beyond Transformers, let’s connect!

Training diary: ethicalabs/Kurtis-EON1

4 replies

liked a model about 22 hours ago

NoesisLab/NanoHammer-1.5B-Instruct

Text Generation • 2B • Updated about 1 hour ago • 10 • 3

liked a model 1 day ago

spartan8806/ATLES-1.5B

Text Generation • 2B • Updated 2 days ago • 7 • 1

reacted to marksverdhei's post with 👍 4 days ago

Post

4482

Poll: Will 2026 be the year of subquadratic attention?

The transformer architecture is cursed by its computational complexity.
It is why you run out of tokens and have to compact. But some would argue that this is a feature not a bug and that this is also why these models are so good. We've been doing a lot of research on trying to make equally good models that are computationally cheaper, But so far, none of the approaches have stood the test of time. Or so it seems.

Please vote, don't be shy. Remember that the Dunning-Kruger effect is very real, so the person who knows less about transformers than you is going to vote. We want everyone's opinion, no matter confidence.

👍 if you think at least one frontier model* will have no O(n^2) attention by the end of 2026
🔥 If you disagree

* Frontier models - models that match / outperform the flagship claude, gemini or chatgpt at the time on multiple popular benchmarks