Qwen3-Coder: Agentic coding in the world

qwenlm.github.io

759 points by danielhanchen 4 days ago

I'm currently making 2bit to 8bit GGUFs for local deployment! Will be up in an hour or so at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...

Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder

mathrawka - 4 days ago
Looks like the docs have a typo:
```
    Recommended context: 65,536 tokens (can be increased)
```
That should be recommended token output, as shown in the official docs as:
```
    Adequate Output Length: We recommend using an output length of 65,536 tokens for most queries, which is adequate for instruct models.
```
- danielhanchen - 3 days ago
  
  Oh thanks - so the output can be any length you like - I'm actually also making 1 million context length GGUFs as well! https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...
gnulinux - 3 days ago

Do 2bit quantizations really work? All the ones I've seen/tried were completely broken even when 4bit+ quantizations worked perfectly. Even if it works for these extremely large models, is it really much better than using something slightly smaller on 4 or 5 bit quant?
- danielhanchen - 3 days ago
  
  Oh the Unsloth dynamic ones are not 2bit at all - it's a mixture of 2, 3, 4, 5, 6 and sometimes 8bit.
  Important layers are in 8bit, 6bit. Less important ones are left in 2bit! I talk more about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs
  - blensor - 3 days ago
    
    Not an AI researcher here so this is probably common knowledge for people in this field, but I saw a video about the quantization recently and wondered exactly about that, if it's possible to compress a net by using more precision where it counts and less precision where it's not important. And also wondered how one would go about deciding which parts count and which don't
    Great to know that this is already a thing and I assume model "compression" is going to be the next hot topic
    
    danielhanchen - 3 days ago
    
    Yes you're exactly thinking correctly! We shouldn't quantize a model naively to 2bit or 4bit, but we should do it smartly!
    
    qxfys - 3 days ago
    
    How do you pick which one should be 2, which one should be 4, etc. Is this secret sauce? or, something open?
    
    danielhanchen - 2 days ago
    
    Oh I wrote about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs We might provide some scripts for them in the future!
    
    qxfys - a day ago
    
    Thanks! But, I can't find any details on how you "intelligently adjust quantization for every possible layer" from that page. I assume this is a secret?
    I am wondering about the possibility that different use cases might require different "intelligent quantization", i.e., quantization for LLM for financial analysis might be different from LLM for code generation. I am currently doing a postdoc in this. Interested in doing research together?
    
    danielhanchen - a day ago
    
    Oh we haven't yet published about it yet! I talk about in bits and pieces - we might do a larger blog on it!
    Yes different use cases will be different - oh interesting! Sorry I doubt I can be of much in our research - I'm mainly an engineering guy so less research focused!
  - CMCDragonkai - 3 days ago
    
    How do you decide which layers are the important ones?
    
    danielhanchen - 3 days ago
    
    I wrote approximately in the blog about it and linked some papers! I also wrote about it here - https://unsloth.ai/blog/dynamic-4bit - one has to inspect the activation and weight quantization errors!
    
    blensor - 3 days ago
    
    So you are basically looking at "fMRI" of the "brain" while it's doing a wide range of things and cutting out the things that stay dark the most?
    
    danielhanchen - 2 days ago
    
    Oh that's a good analogy! Yes that sounds right!
    
    menaerus - 3 days ago
    
    > The key reason to use Unsloth quants is because of our deep involvement in fixing critical bugs across major models
    sounds convincing, eh ... /s
    On the less cynical note, approach does look interesting but I'd also like to understand how and why does it work, if it works at all.
    
    danielhanchen - 3 days ago
    
    Oh we actually fixed bugs! We fixed a few bugs in Gemma - see https://news.ycombinator.com/item?id=39671146, a gradient accumulation bug see https://news.ycombinator.com/item?id=41859037, Phi bugs, Llama bugs and more! See https://unsloth.ai/blog/reintroducing for more details!
    
    menaerus - 3 days ago
    
    What does your approach with dynamics weights has to do with those bugs? All those bugs seem uncorrelated to the technique.
    
    danielhanchen - 3 days ago
    
    Oh apologies I got confused - it's because when we calculate our dynamic quants, we have to do it on the fixed model!
    For example in Phi 3 for example, the end of sentence token was wrong - if we use this, then our quants would be calibrated incorrectly, since chatting with the model will use the actual correct token.
    Another is Llama 4 - https://github.com/ggml-org/llama.cpp/pull/12889 in which I fixed a RoPE issue - if we didn't fix it first, then again the calibration process would be incorrect.
    
    menaerus - 3 days ago
    
    Ok, this then goes to say that your approach doesn't work without applying whatever fixes to the vanilla models. What I'm trying to understand is the approach itself. Why does it and how does it work?