Skip to content

[non record] Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models#1546

Open
SPThole wants to merge 16 commits intoopenai:mainfrom
SPThole:non_record_6
Open

[non record] Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models#1546
SPThole wants to merge 16 commits intoopenai:mainfrom
SPThole:non_record_6

Conversation

@SPThole
Copy link
Copy Markdown

@SPThole SPThole commented Apr 11, 2026

This started as a simple question — why does the last transformer block barely do anything? — and turned into a deeper investigation of how tied embeddings shape the entire capacity profile of a model, what the architecture is quietly spending parameters on to work around it, and where that leaves us for future improvements.

The experiment itself doesn't beat the baseline. It goes over budget. But the analysis behind it reveals structural patterns in the current architecture that we think the community should see.

The starting point: a really well-built model

The ImprovedParallelResiduals baseline (couldnt tag PR here I guess @msisovic closed it) (1.0744 BPB) is genuinely impressive. Depth recurrence on blocks 3-5 gives 17 virtual layers from 11 physical — 55% more depth at zero parameter cost. The parallel residual routing on the final 3 blocks lets attention and MLP specialize independently, and the model takes full advantage: block 8 routes 99% of its attention output into the MLP lane, which is a striking learned behavior. The U-net skip connections, per-head Q-gain scaling, per-dim residual mixing, CUTLASS EVT fusion for throughput, and legal score-first TTT with bigram hash embeddings — it all fits together tightly. At ~124 ms/step on 8xH100, this architecture is squeezing near-maximum value out of the 36M parameter budget.

Which is exactly why the remaining inefficiency is so visible when you look at the weights.

What the weight analysis revealed

We computed an "effective contribution" for each block — the product of each block's output projection norm and its learned scale factor — and found a pattern that's hard to unsee once you notice it.

The model's capacity follows a symmetric V-shape:

Block Effective Contribution What's happening
Block 0 (first) 20.8 Stuck in the input embedding space
Block 5 (peak) 44.1 Free to develop rich MLP-dominant features
Block 10 (last) 12.1 Forced to align with the output vocabulary projection

The weakest block has barely a quarter of the capacity of the strongest. Blocks 0 and 10 together hold ~5.8M parameters but deliver only 9.8% of the model's total effective contribution. That's 16% of the parameters doing less than 10% of the work.

Why this happens: the tied embedding tug-of-war

With tied embeddings, tok_emb.weight has to be two things at once — a good lookup table for input tokens and a good linear classifier for output predictions. These are fundamentally different jobs, and the compromise shows up at both ends of the network.

Block 0 receives this compromise representation directly and can't freely transform it — its attention heads show the highest Q-gain variance of any block (std=1.17 vs 0.29 in the middle), meaning the heads can't agree on what's even useful in this space. Block 10 has the opposite problem: it needs to produce representations that dot-product well with the embedding matrix for classification, so it suppresses its own MLP (scale=0.15, lowest of all blocks) to avoid creating features that would misalign with the vocabulary.

The evidence runs deeper than just the scales:

  • Raw embedding dependency: Only blocks 0 and 10 significantly mix in the original embedding signal (x0_weight of +0.11 and +0.08). Every middle block has learned to ignore it entirely (x0_weight ≈ 0). The model maintains an x0 pathway through all 11 blocks, but 9 of them don't use it.

  • Skip gate suppression: The deepest U-net skip — connecting the first block's output region to the last block's input — has the lowest effective weight of any skip (0.06). The model learned to almost completely gate this off, because carrying vocabulary-space features from the beginning to the end of the network is actively harmful.

  • Dead parallel routing: Block 10's two-lane parallel routing has a lane 0 residual of -0.01. The model essentially kills an entire lane of the routing machinery it was given. The hardware is there, it just can't use it in the current constrained space.

What we tried

We tested embedding decoupling — setting embedding_dim smaller than model_dim to create learned projection layers that rotate the representation into a model-friendly basis before block 0 and back to vocabulary space after block 10.

We ran three configurations with different decoupling gaps:

Configuration Block 10 Eff Capacity spread Total Eff Pre-quant BPB Artifact
Baseline (embed=512) 12.1 3.7x min/max 336.8 1.0821 15.98 MB
embed=384 (gap=128d) 17.2 (+42%) 2.3x 311.8 (-7.4%) 1.0950
embed=448 (gap=64d) 18.2 (+51%) 2.2x 320.4 (-4.9%) 1.0877 16.28 MB
embed=416 + par_start=7 + clip=12.0 16.6 (+37%) 2.1x 297.0 (-11.8%) 1.0915 16.44 MB

The architectural effect was clear and consistent across all three: decoupling activated block 10 (up to +51% effective contribution) and flattened the V-shape dramatically. The capacity distribution went from 3.7x to as low as 2.1x — the most uniform we observed in any configuration.

But the BPB went the wrong way every time.

Why it didn't translate to better BPB

Three things worked against us:

Parameter loss from smaller embeddings. Shrinking tok_emb from 8192×512 to 8192×384 saves 1.1M parameters, but the projection layers (embed_proj and head_proj) only add back ~400K. The net loss of 655K parameters — a 7.4% drop in total effective capacity — hurt more than activating the boundary blocks helped.

fp16 passthrough budget bloat. The projection matrices (196K params each) land above the 65K passthrough threshold in the quantization pipeline, so they're stored as fp16 instead of being GPTQ-quantized. That's 400-900 KB of uncompressible overhead in the artifact — enough to push the total over the 16MB budget.

The clip_sigmas trap. We tried using the headroom from the smaller embedding to tighten quantization clipping (12.85 → 12.0), expecting lower quantization error. But tighter clipping actually increases compressed size — the quantized values spread across more unique levels, and brotli can't compress them as well. The rate-distortion curve is unforgiving here: you can have less error or a smaller file, not both.

This experiment's results

This final configuration combined the decoupling (embed=416) with extended parallel residuals (starting at block 7 instead of 8), tighter clip_sigmas (12.0), and TTT block freezing (blocks 0-1 frozen during test-time training).

Eval Stage This Experiment Baseline
Pre-quantization post-EMA 1.0915 1.0821
Quantized (basic) 1.1013 1.0925
Quantized sliding window 1.0845 1.0756
Quantized TTT (legal) 1.0850 1.0744
Metric This Experiment Baseline
Training steps 4,727 4,733
Artifact size 16.44 MB (over budget — addressable) 15.98 MB
Peak memory 40,177 MiB 39,948 MiB

The TTT gain was only -0.0005 BPB (vs baseline's -0.0012). Freezing blocks 0-1 during TTT, which we expected to focus gradient signal on the activated blocks, actually made adaptation less effective. The model apparently needs all blocks to participate in TTT even if their steady-state contribution is low.

What this tells us about simplification

Looking at the weight analysis, there's a surprising amount of architectural machinery in the current model that exists primarily to compensate for the tied-embedding constraint — not to improve modeling:

The x0 pathway. Every block receives the original embedding as a second input through the resid_mix parameter. Nine out of eleven blocks have learned to set this to approximately zero. The pathway threads a tensor through the entire forward pass that 82% of the model ignores. Removing it would save 11K parameters and a multiply-add per block per step — modest, but it's pure dead weight.

Skip gate suppression. The U-net skip connections are designed to help gradient flow between encoder and decoder halves. But the deepest skip (boundary-to-boundary) has been gated down to 0.06 effective weight. The model is spending gate parameters to suppress a connection that would be useful if the boundary blocks weren't constrained. In our decoupled experiments, this suppression disappeared — the skips were free to carry signal instead of blocking contamination.

Half-dead parallel routing. Block 10's two-lane parallel routing has a dead lane (residual = -0.01). The model was given routing machinery for two independent streams and chose to kill one entirely. When we decoupled the embedding, block 10's routing normalized to a balanced 50/50 split — the hardware always had the capacity, it just couldn't use it in vocabulary-aligned space.

The V-shape itself. If blocks 0 and 10 operated at even the average block's effectiveness (30.6 instead of 20.8 and 12.1), the model would gain the equivalent of 60% of a full middle block — for free, with no additional parameters.

Where this leads

The dimension-change approach to decoupling has a fundamental problem: it trades parameters for architectural freedom, and at this budget the parameters are worth more. But the analysis points to a cleaner path.

The projection matrices we learned turned out to be near-perfect orthogonal rotations — all singular values within 0.2% of 1.0. The model doesn't want to compress or expand the embedding; it just wants a different basis. This suggests a residual low-rank projection (x → x + B(Ax) with rank 32) could provide the same basis rotation at a fraction of the cost: 65K parameters, 128 KB fp16 passthrough, no embedding shrinkage, no dimension mismatch, no hash_embed bugs. It starts as identity (B=0) and learns the rotation during training.

If that works, it opens a simplification cascade: remove the x0 pathway (the decoupled model doesn't need it), let the skip gates operate freely (no more contamination to suppress), and let block 10 use both parallel lanes (no more vocabulary alignment constraint). The model becomes simpler and every block contributes.

The broader takeaway

At 36M parameters and 16MB, structural constraints matter more than architectural cleverness. The tied embedding saves ~4M parameters but costs roughly 10% of effective capacity through the V-shape. Whether that's a net positive trade depends on what those 4M parameters would buy if spent on a second embedding matrix or wider middle layers — and that's a question worth the community's attention.

The three things we're most confident about from this investigation:

  1. The bottleneck is real and measurable. Tied embeddings structurally constrain boundary blocks. It shows up in weight norms, learned scales, skip gate values, Q-gain distributions, and parallel routing lambdas — all independently pointing to the same story.

  2. The 16MB budget is a rate-distortion problem. You can't reduce quantization error without growing the artifact. Lower clip_sigmas, fp16 passthrough, extra parameters — everything has a compression cost that must be weighed against the modeling benefit.

  3. The x0 pathway, skip suppression, and dead parallel lanes are symptoms, not causes. They're the model's optimal adaptation to a structural constraint. Fixing the constraint resolves the symptoms; treating the symptoms individually doesn't fix the constraint.

@SPThole SPThole changed the title Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models [non record]Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models Apr 11, 2026
@SPThole SPThole changed the title [non record]Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models [non record] Investigating the Tied-Embedding Bottleneck: Why Boundary Blocks Underperform and What It Means for 16MB Models Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant