• ggtdbz@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    2
    ·
    8 hours ago

    That model is over a terabyte, I don’t know why I thought it was lightweight. Not that any reporting on machine learning has been particularly good, but this isn’t what I expected at all.

    What can even run it?

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      4 hours ago

      A lot, but less than you’d think! Basically a RTX 3090/threadripper system with a lot of RAM (192GB?)

      With this framework, specifically: https://github.com/ikawrakow/ik_llama.cpp?tab=readme-ov-file

      The “dense” part of the model can stay on the GPU while the experts can be offloaded to the CPU, and the whole thing can be quantized to ~3 bits average, instead of 8 bits like the full model.


      That’s just a hack for personal use, though. The intended way to run it is on a couple of H100 boxes, and to serve it to many, many, many users at once. LLMs run more efficiently when they serve in parallel. Eg generating tokens for 4 users isn’t much slower than generating them for 2, and Deepseek explicitly architected it to be really fast at scale. It is “lightweight” in a sense.


      …But if you have a “sane” system, it’s indeed a bit large. The best I can run on my 24GB vram system are 32B - 49B dense models (like Qwen 3 or nemotron), or 70B mixture of experts (like the new Hunyuan 70B).