Discussion
LLM Neuroanatomy: How I Topped the AI Leaderboard Without Changing a Single Weight
dnhkng: Author here. I found that duplicating a specific block of 7 middle layers in Qwen2-72B, without modifying any weights, improved performance across all Open LLM Leaderboard benchmarks and took #1. As of 2026, the top 4 models on that leaderboard are still descendants.The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.Happy to answer questions.
naasking: This layer duplication strikes me as a bit of "poor man's" version of looped language models:https://ouro-llm.github.io/Pretty cool though. LLM brain surgery.
dnhkng: Agrees, but one thing to note:I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?This would give 'organs' space to develop.
hmokiguess: I really enjoyed reading this. I feel like generalists intuitively experience this exact thing so much throughout their lives because they must have this neuroanatomy you describe. There’s a certain geometry to knowledge that makes possible for this orthogonal movement and it is really fascinating to me. Thank you for publishing this, you made my day!
goodmythical: Isn't this similar to models that have "double check the answer"?First pass runs your input through, second pass runs it's output as input?Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?
dnhkng: Maybe, but the interesting thing for me it this only works with specific 'chunks' of the transformer layer stack. More or less that the optimal leads to worse performance.
patchnull: This lines up with what I have seen doing CKA (centered kernel alignment) analysis on transformer internals. The middle layers in most large models have surprisingly similar representations to their neighbors, so duplicating them is basically giving the model extra compute cycles in a region where it is already doing useful refinement without messing up the input/output encoding stages. Curious whether picking layers by representation similarity instead of just a contiguous block would do even better.
dnhkng: Have a look at the boundaries in the heatmaps.They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?
Balinares: The idea that there may be a cognitive lingua franca hiding in the layers is fascinating and gives me hope for a neat idea: pluggable knowledge banks.MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.
rob_c: very awesome writeup, glad to see someone with access to hw actually playing with this.Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state. i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections? global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.
priowise: Very cool build. I’m always curious with experiments like this — was the biggest bottleneck compute, data curation, or evaluation methodology?
user_7832: A 5 hour old account with a standard chatgpt reply? Seriously, try harder.
user_7832: Thanks for the post, really cool stuff you did!Extra thanks for making it written in a readable and approachable way! I don't have much of a background in this topic, but still managed to understand about 70-80% of it :) You're a good writer
afpx: Thank you so much for sharing this in a delightful blog post. One of the more enjoyable things I've read in a while. Very motivating!