Mike Ravkine PRO
AI & ML interests
Recent Activity
Organizations
Here's an example of a model that behaves perfectly well up to 8k, smoothly increasing its entropy before going into a struggle zone, collapsing, seeing a region of recovery and finally falling down hard at the 16k wall.
Is your model implementation behaving badly like this?
Would you know if it was? ๐
goal: understand how GGUF compression works - what exactly is being lost?
approach: quantize/dequantize some images and look at error maps
spent 80% of the time tracing down what turns out to be a data distribution assumption: real LLM weights are symmetric and their mean is 0 so our test image MUST retain these properties or the results turn into a kind of nonsense soup where Q5_1 beats Q8
with that issue solved, we have some fun results! from left to right:
- test pattern image (mean value is around 0.01)
- q8 error (almost nothing - some light banding in the gradients)
- q5km error (starting to see the 'blocks' around the circles)
- q4_0 error (this is why q4_1 is 'preferred')
- q3k error. q3k is a really interesting set of trade-offs: it does not have a block-offset so it really leans into the 0-mean assumption HARD, if you violate it locally the results are BAD
- q2k error: q2k has a block-offset so for certain patterns the errors are actually less then q3k (a rather counter-intuitive result)
looking at mxfp4, i-quants and the other stuff that's possible inside gguf remains future work.. aiming to clean up this repo and push it this week, feel free to ping me if you want to play sooner.
564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons
TL;DR You don't just run AI on AMD. You negotiate with it.
The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint
Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.
We shrank the 1T model to 245GB (-62%) & retained ~85% of accuracy on Aider Polyglot. Run on >247GB RAM for fast inference.
We also collaborated with the Moonshot AI Kimi team on a system prompt fix! ๐ฅฐ
Guide + fix details: https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-locally
The way the coolers on these cards are designed is VERY unusual - this photo should come in the box! If you're having trouble in a closed case, see if you have space to add an intake on the bottom beside the PCIe edge - if all air is coming from the front like in this pic, the rear blower steals it all!
I am running 3x140mm 89CFM as intakes on the front and directly underneath the in-blowers of those two 3090FE is the main trick - a dedicated 140mm 140CFM blowing upwards to feed the intake blowers at the front/pcie edge of these coolers. The cross-blower air passes through both cards and then has space to vent their heat on the right.
At 280W load the 'outside' card is bored and sitting at 50C with the blower fans at 50% while the 'inside' card maintains 60-65C with it's blowers closer to 70-80%.
The main thing to be careful about with the FE dual-blower coolers is do not try to pump air into the 'front' (pcie side) of them! I see this configuration on many mining rigs and its fine for air cooled cards but the FEs actually have a blower venting out this pci-slot side so you HAVE to feed them from either the rear or underneath.
When I had the 4-slot bridge installed across them, the rear-feed alone was sufficient.
what if we keep it simple: gzip the resulting text and take the length of the compressed stream... "compressed bytes of information per output token" becomes the KPI
if we split across correct answers vs incorrect answers vs truncated answers and group by difficulty, a whole new world of analysis becomes not just possible but visually intuitive and almost trivial:
1) what is the model's overall reasoning efficiency? this is the slope of the scatterplot curve segments (there may be more then one..)
2) is the model able to apply more test-time compute towards more difficult variations of the task? the two on the left are not, the two on the right are.
3) when applying more test-time compute, is that compute useful? this is the curvature of the scatterplot trends - the two in the middle are 'losing their mojo' as answers get longer the information content falls down
4) is the model applying multiple approaches to the task? (right) do those approaches change with difficulty?
5) are truncations because we don't have enough context budget (left) or because the model has lost its mind and gone into a repeat loop (middle two) and does this happen across the board (middle left) or only when the problem is more difficult (middle right)
would love to hear your guys feedback on this kind of analysis, is anyone doing similar work?
this approach generates 12 plots per model (one for each task) so quite a bit of data and i've been hesitant to publish it so far, consider this post a toe tip.

