Reproducing DeepSeek's MHC: When Residual Connections Explode

taylorkolasinski.com

107 points by taykolasinski 19 hours ago


cpldcpu - 17 hours ago

May be worth pointing out, that this is not the first residual connection innovation to be in production.

Gemma 3n is also using a low-rank projection of the residual stream called LAuReL. Google did not publicize this too much, I noted it when poking around in the model file.

https://arxiv.org/pdf/2411.07501v3

https://old.reddit.com/r/LocalLLaMA/comments/1kuy45r/gemma_3...

Seems to be what they call LAuReL-LR in the paper, with D=2048 and R=64

taykolasinski - 19 hours ago

OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880).

Two key takeaways from the reproduction:

Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale).

I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken.

This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation.

- 3 hours ago
[deleted]
Scene_Cast2 - 17 hours ago

I implemented this for a toy 8M ViT-style model. Got neutral results. This is just an anecdote and is not representative - I think mHC will help with larger parameter sizes and larger token counts.

- 3 hours ago
[deleted]
- 3 hours ago
[deleted]
AlexCoventry - 10 hours ago

What's the advantage of having multiple channels with separate residual connections? Why not just concatenate those channels, and do residual connections on the concatenated channel?

in-silico - 14 hours ago

Why can't you just leave H_res as the identity matrix (or just not use it at all)? In that case, the model is basically a ResNet again and you don't need to worry about exploding/vanishing gradients from H_res.

I would think that H_post and H_pre could cover the lost expressiveness.

theschwa - 16 hours ago

Between the clear writing and the diagrams, this was a great write up. I had actually skipped reading up on mHC as it sounded like it was going to take some time to grok, but this made it immediately approachable. I hope you do more write ups like this in the future.

solarkraft - 18 hours ago

I’ve been wondering for a while: Why isn’t this architecture more common in other LLMs? The context efficiency is amazing, after all - doesn’t that translate to a lot of money at scale?

sbondaryev - 17 hours ago

Nice visualization of the residual connections. Is the animated svg manually created or programmatically generated? What tools did you use?

john-titor - 11 hours ago

great write up. it's been a while since I had the pleasure to read a straightforward blog post about ML tricks that feel genuinely applicable to many use cases.