[non record] Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models#1546
Open
SPThole wants to merge 16 commits intoopenai:mainfrom
Open
[non record] Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models#1546SPThole wants to merge 16 commits intoopenai:mainfrom
SPThole wants to merge 16 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This started as a simple question — why does the last transformer block barely do anything? — and turned into a deeper investigation of how tied embeddings shape the entire capacity profile of a model, what the architecture is quietly spending parameters on to work around it, and where that leaves us for future improvements.
The experiment itself doesn't beat the baseline. It goes over budget. But the analysis behind it reveals structural patterns in the current architecture that we think the community should see.
The starting point: a really well-built model
The ImprovedParallelResiduals baseline (couldnt tag PR here I guess @msisovic closed it) (1.0744 BPB) is genuinely impressive. Depth recurrence on blocks 3-5 gives 17 virtual layers from 11 physical — 55% more depth at zero parameter cost. The parallel residual routing on the final 3 blocks lets attention and MLP specialize independently, and the model takes full advantage: block 8 routes 99% of its attention output into the MLP lane, which is a striking learned behavior. The U-net skip connections, per-head Q-gain scaling, per-dim residual mixing, CUTLASS EVT fusion for throughput, and legal score-first TTT with bigram hash embeddings — it all fits together tightly. At ~124 ms/step on 8xH100, this architecture is squeezing near-maximum value out of the 36M parameter budget.
Which is exactly why the remaining inefficiency is so visible when you look at the weights.
What the weight analysis revealed
We computed an "effective contribution" for each block — the product of each block's output projection norm and its learned scale factor — and found a pattern that's hard to unsee once you notice it.
The model's capacity follows a symmetric V-shape:
The weakest block has barely a quarter of the capacity of the strongest. Blocks 0 and 10 together hold ~5.8M parameters but deliver only 9.8% of the model's total effective contribution. That's 16% of the parameters doing less than 10% of the work.
Why this happens: the tied embedding tug-of-war
With tied embeddings,
tok_emb.weighthas to be two things at once — a good lookup table for input tokens and a good linear classifier for output predictions. These are fundamentally different jobs, and the compromise shows up at both ends of the network.Block 0 receives this compromise representation directly and can't freely transform it — its attention heads show the highest Q-gain variance of any block (std=1.17 vs 0.29 in the middle), meaning the heads can't agree on what's even useful in this space. Block 10 has the opposite problem: it needs to produce representations that dot-product well with the embedding matrix for classification, so it suppresses its own MLP (scale=0.15, lowest of all blocks) to avoid creating features that would misalign with the vocabulary.
The evidence runs deeper than just the scales:
Raw embedding dependency: Only blocks 0 and 10 significantly mix in the original embedding signal (x0_weight of +0.11 and +0.08). Every middle block has learned to ignore it entirely (x0_weight ≈ 0). The model maintains an x0 pathway through all 11 blocks, but 9 of them don't use it.
Skip gate suppression: The deepest U-net skip — connecting the first block's output region to the last block's input — has the lowest effective weight of any skip (0.06). The model learned to almost completely gate this off, because carrying vocabulary-space features from the beginning to the end of the network is actively harmful.
Dead parallel routing: Block 10's two-lane parallel routing has a lane 0 residual of -0.01. The model essentially kills an entire lane of the routing machinery it was given. The hardware is there, it just can't use it in the current constrained space.
What we tried
We tested embedding decoupling — setting
embedding_dimsmaller thanmodel_dimto create learned projection layers that rotate the representation into a model-friendly basis before block 0 and back to vocabulary space after block 10.We ran three configurations with different decoupling gaps:
The architectural effect was clear and consistent across all three: decoupling activated block 10 (up to +51% effective contribution) and flattened the V-shape dramatically. The capacity distribution went from 3.7x to as low as 2.1x — the most uniform we observed in any configuration.
But the BPB went the wrong way every time.
Why it didn't translate to better BPB
Three things worked against us:
Parameter loss from smaller embeddings. Shrinking tok_emb from 8192×512 to 8192×384 saves 1.1M parameters, but the projection layers (embed_proj and head_proj) only add back ~400K. The net loss of 655K parameters — a 7.4% drop in total effective capacity — hurt more than activating the boundary blocks helped.
fp16 passthrough budget bloat. The projection matrices (196K params each) land above the 65K passthrough threshold in the quantization pipeline, so they're stored as fp16 instead of being GPTQ-quantized. That's 400-900 KB of uncompressible overhead in the artifact — enough to push the total over the 16MB budget.
The clip_sigmas trap. We tried using the headroom from the smaller embedding to tighten quantization clipping (12.85 → 12.0), expecting lower quantization error. But tighter clipping actually increases compressed size — the quantized values spread across more unique levels, and brotli can't compress them as well. The rate-distortion curve is unforgiving here: you can have less error or a smaller file, not both.
This experiment's results
This final configuration combined the decoupling (embed=416) with extended parallel residuals (starting at block 7 instead of 8), tighter clip_sigmas (12.0), and TTT block freezing (blocks 0-1 frozen during test-time training).
The TTT gain was only -0.0005 BPB (vs baseline's -0.0012). Freezing blocks 0-1 during TTT, which we expected to focus gradient signal on the activated blocks, actually made adaptation less effective. The model apparently needs all blocks to participate in TTT even if their steady-state contribution is low.
What this tells us about simplification
Looking at the weight analysis, there's a surprising amount of architectural machinery in the current model that exists primarily to compensate for the tied-embedding constraint — not to improve modeling:
The x0 pathway. Every block receives the original embedding as a second input through the
resid_mixparameter. Nine out of eleven blocks have learned to set this to approximately zero. The pathway threads a tensor through the entire forward pass that 82% of the model ignores. Removing it would save 11K parameters and a multiply-add per block per step — modest, but it's pure dead weight.Skip gate suppression. The U-net skip connections are designed to help gradient flow between encoder and decoder halves. But the deepest skip (boundary-to-boundary) has been gated down to 0.06 effective weight. The model is spending gate parameters to suppress a connection that would be useful if the boundary blocks weren't constrained. In our decoupled experiments, this suppression disappeared — the skips were free to carry signal instead of blocking contamination.
Half-dead parallel routing. Block 10's two-lane parallel routing has a dead lane (residual = -0.01). The model was given routing machinery for two independent streams and chose to kill one entirely. When we decoupled the embedding, block 10's routing normalized to a balanced 50/50 split — the hardware always had the capacity, it just couldn't use it in vocabulary-aligned space.
The V-shape itself. If blocks 0 and 10 operated at even the average block's effectiveness (30.6 instead of 20.8 and 12.1), the model would gain the equivalent of 60% of a full middle block — for free, with no additional parameters.
Where this leads
The dimension-change approach to decoupling has a fundamental problem: it trades parameters for architectural freedom, and at this budget the parameters are worth more. But the analysis points to a cleaner path.
The projection matrices we learned turned out to be near-perfect orthogonal rotations — all singular values within 0.2% of 1.0. The model doesn't want to compress or expand the embedding; it just wants a different basis. This suggests a residual low-rank projection (
x → x + B(Ax)with rank 32) could provide the same basis rotation at a fraction of the cost: 65K parameters, 128 KB fp16 passthrough, no embedding shrinkage, no dimension mismatch, no hash_embed bugs. It starts as identity (B=0) and learns the rotation during training.If that works, it opens a simplification cascade: remove the x0 pathway (the decoupled model doesn't need it), let the skip gates operate freely (no more contamination to suppress), and let block 10 use both parallel lanes (no more vocabulary alignment constraint). The model becomes simpler and every block contributes.
The broader takeaway
At 36M parameters and 16MB, structural constraints matter more than architectural cleverness. The tied embedding saves ~4M parameters but costs roughly 10% of effective capacity through the V-shape. Whether that's a net positive trade depends on what those 4M parameters would buy if spent on a second embedding matrix or wider middle layers — and that's a question worth the community's attention.
The three things we're most confident about from this investigation:
The bottleneck is real and measurable. Tied embeddings structurally constrain boundary blocks. It shows up in weight norms, learned scales, skip gate values, Q-gain distributions, and parallel routing lambdas — all independently pointing to the same story.
The 16MB budget is a rate-distortion problem. You can't reduce quantization error without growing the artifact. Lower clip_sigmas, fp16 passthrough, extra parameters — everything has a compression cost that must be weighed against the modeling benefit.
The x0 pathway, skip suppression, and dead parallel lanes are symptoms, not causes. They're the model's optimal adaptation to a structural constraint. Fixing the constraint resolves the symptoms; treating the symptoms individually doesn't fix the constraint.