Grok praises Hitler, gives credit to Musk for removing “woke filters”

vegeta@lemmy.world · 19 hours ago

Grok praises Hitler, gives credit to Musk for removing “woke filters”

brucethemoose@lemmy.world · edit-2 4 hours ago

A lot, but less than you’d think! Basically a RTX 3090/threadripper system with a lot of RAM (192GB?)

With this framework, specifically: https://github.com/ikawrakow/ik_llama.cpp?tab=readme-ov-file

The “dense” part of the model can stay on the GPU while the experts can be offloaded to the CPU, and the whole thing can be quantized to ~3 bits average, instead of 8 bits like the full model.

That’s just a hack for personal use, though. The intended way to run it is on a couple of H100 boxes, and to serve it to many, many, many users at once. LLMs run more efficiently when they serve in parallel. Eg generating tokens for 4 users isn’t much slower than generating them for 2, and Deepseek explicitly architected it to be really fast at scale. It is “lightweight” in a sense.

…But if you have a “sane” system, it’s indeed a bit large. The best I can run on my 24GB vram system are 32B - 49B dense models (like Qwen 3 or nemotron), or 70B mixture of experts (like the new Hunyuan 70B).