Kurs
Modern LLMs have become deeper, wider, and more compute-hungry, yet stacking more Transformer layers does not always translate into proportional gains. One reason is that standard residual connections aggregate layer outputs with fixed unit weights, so every layer inherits a uniform sum of everything that came before it. This can dilute earlier representations, amplify hidden-state magnitude, and make it harder for the network to selectively reuse the most useful intermediate features.
Instead of treating depth as a fixed additive recurrence, the Attention Residuals paper from the Kimi team lets each layer attend over earlier layer output using learned softmax weights. In this article, we’ll understand why standard residual aggregation becomes a bottleneck, how Attention Residuals works, why the block variant matters, and what results actually suggest about scaling deeper language models.
If you’re keen to learn more about some of the ideas behind modern Transformer architectures, I recommend checking out the DataCamp course Transformer Models Tutorial in PyTorch.
The Problem: The Drowning Signal
Residual connections have been foundational to deep learning for years. In Transformers, the standard residual update is:
hl=hl−1+fl−1(hl−1)
This helps gradients flow through very deep networks. But, residuals are not only a gradient trick. They also define how information is aggregated across depth. If we unroll the recurrence, we get:
hl=h1+∑i=1l−1fi(hi)
This means the hidden state at layer l is just the embedding and is a uniformly weighted sum of all earlier layer outputs. So, every contribution is effectively assigned the same weight.
That becomes a problem at scale. The paper shows that in PreNorm architectures, unweighted residual accumulation causes hidden-state magnitudes to grow with depth, roughly as O(L).
As the residual stream grows, earlier layer outputs become increasingly diluted inside a larger running sum. Once an early-layer signal is mixed into this accumulation, deeper layers cannot selectively recover it. Hence, they only operate on the aggregated state.
This leads to what the paper calls the “drowning signal” effect. A strong empirical signal of this inefficiency comes from layer-pruning studies, i.e., a substantial fraction of layers in trained models can often be removed with minimal impact on performance. This suggests that while we keep increasing depth, models lack an effective mechanism to fully utilize it. Instead of forming a hierarchical reasoning chain, the network behaves more like a redundant relay where early signals are progressively diluted.
There is also a forward-pass trade-off. As the residual stream grows, later layers may need to produce higher-magnitude outputs to meaningfully influence the accumulated state. The paper connects this to PreNorm behavior, where hidden-state magnitudes increase monotonically with depth under standard residual aggregation.
Residual connections help gradients flow, and the gradient with respect to an intermediate hidden state can be written as:

The identity term here preserves a direct gradient path. But the residuals still force the forward aggregation path to treat every earlier layer with a fixed weight of 1.0. That is the structural limitation AttnRes tries to fix.
Additive Residuals to Depth-Wise Attention
The paper’s most interesting conceptual move is time–depth duality. Residual connections compress information over depth, the way recurrent networks compress information over time. In sequence modeling, attention replaced recurrence by letting each position selectively access earlier positions. AttnRes applies the same transition to network depth. Instead of defining the next hidden state as a fixed sum over earlier layers, AttnRes lets each layer attend over previous layer outputs:

where the weights αi→l are softmax attention weights over depth and sum to 1. Those weights are computed as:

with the paper using:
![]()
Each layer gets a learned pseudo-query vector wl, and that query attends over keys and values built from earlier layer outputs. RMSNorm is applied to the keys so that layers with naturally larger output magnitudes do not dominate the softmax just because they are larger in scale.
The key shift in the paper is that the standard residuals are treated as a kind of depth-wise linear attention, while AttnRes upgrades that to depth-wise softmax attention. Instead of uniform accumulation, we get selective retrieval across depth.
A small but important implementation detail is initialization. The authors suggest that all pseudo-query vectors should be initialized to zero, which ensures that the initial attention weights are uniform, so the model begins training as an equal-weight average instead of a randomly biased attention mechanism.

Figure 1: Overview of Attention Residuals (Attention Residuals paper)
Full Attention Residuals
In Full Attention Residuals, each layer attends over all previous layer outputs, allowing it to selectively aggregate information from across depth. This gives the model maximum flexibility, so that a deeper layer can emphasize its immediate predecessor, the original embedding, or any earlier layer if that is where the most useful signal resides.
In standard training, Full AttnRes’s memory overhead is smaller than it may first appear, because many layer outputs are already retained for backpropagation. But large-scale training changes the picture. Once you introduce activation recomputation and pipeline parallelism, those earlier outputs have to be explicitly preserved and communicated for later layers to attend to them, and that becomes expensive.
So Full AttnRes is the best way to understand the core idea, but it is not the version most teams would want to deploy at scale.
Block Attention Residuals
The paper introduced the idea of Block Attention Residuals to reduce memory and communication overhead. The model’s layers are partitioned into blocks such that within each block, outputs are combined using standard additive residual accumulation, but across blocks, the model performs attention over block-level summaries instead of every individual earlier layer.
If Bn is the set of layers in block n, then the block representation is given by:

The model then attends over the embedding b0=h1 earlier block summaries, and the current block’s partial sum as computation progresses. This reduces memory and communication from O(Ld) to O(Nd), where N is the number of blocks.
The paper reports that using around eight blocks recovers most of the benefit of the full version, and the performance gap between Full AttnRes and Block AttnRes narrows as scale increases. Block sizes such as S=2,4,8 all remain close to the full version, while much coarser groupings trend back toward baseline behavior.
The most useful takeaway from the paper is that you do not need full depth-wise attention over every layer to get most of the gain. While Block AttnRes makes depth-wise attention computationally feasible, deploying it efficiently at scale still requires careful systems design. The paper also introduces several optimizations that make AttnRes practical for real-world training and inference.

Figure 2: Cache-based pipeline communication example with 4 physical ranks and 2 virtual stages per rank, where hatched boxes denote end of AttnRes blocks(Attention Residuals paper)
Here are three key system optimizations that make AttnRes a drop-in replacement for standard residuals:
- Two-phase computation: By decoupling the pseudo-query from the hidden state, we can batch queries for all layers within a block. Phase 1 computes inter-block attention in parallel, while Phase 2 handles sequential intra-block dependencies using an Online Softmax Merge. This reduces per-layer I/O significantly compared to prior multi-stream methods like mHC.
- Cross-stage caching: In pipeline parallelism, we eliminate redundant data transfers by caching previously received blocks locally. This provides a V× improvement (where V is the number of virtual stages as shown in Figure 2) in peak per-transition cost, allowing communication to be fully overlapped with computation.
- Memory-efficient prefilling: By sharding block representations across tensor-parallel devices, the memory footprint for long-context (128K+) sequences is reduced by orders of magnitude (e.g., from 15GB to 0.3GB per device).
Results and Performance Analysis
The paper validates AttnRes at multiple levels, such as scaling laws, ablations, training dynamics, and downstream benchmarks.

Figure 3: Scaling law curves for Attention Residuals. (Attention Residuals paper)
However, the topline result occurs from the scaling-law experiments. Across five model sizes, both Full AttnRes and Block AttnRes consistently achieve lower validation loss than the baseline. Based on the fitted curves, Block AttnRes reaches the same loss as a baseline trained with about 1.25× more compute. This suggests that AttnRes is not just better in theory, but it is also more compute-efficient.
On downstream tasks, AttnRes improves over the baseline across all evaluated benchmarks. Some of the biggest gains are on reasoning-intensive tasks, as shown in Table 1:

Table 1: Performance comparison of AttnRes with the baseline, both after the same pre-training recipe (Attention Residuals paper)
The gains are especially strong on multi-step reasoning, math, and code, which fits the paper’s core hypothesis that if later layers can selectively retrieve earlier representations instead of inheriting a blurred aggregate, compositional reasoning should improve.
Compared with the baseline, AttnRes shows lower validation loss throughout training, more bounded output magnitudes across depth, and more uniform gradient magnitudes across layers. The baseline suffers from the usual PreNorm dilution pattern, where hidden-state magnitudes grow monotonically with depth, while Block AttnRes produces a more controlled, bounded pattern thanks to selective aggregation at block boundaries.
There is also a broader architecture-level finding that, in a fixed-compute, fixed-parameter sweep, the optimal configuration for AttnRes shifts toward a deeper, narrower model compared with the baseline. This suggests that AttnRes may make additional depth more useful than it is under standard residual aggregation.
Conclusion
Residuals are often treated as a necessary training trick, i.e., an identity shortcut that keeps gradients alive. AttnRes suggests that this view is too narrow. Residual paths are also the mechanism by which information is routed across depth, and fixed additive accumulation may simply be too primitive for very deep models.
Attention Residuals are less about patching a known problem and more about upgrading a neglected design choice. Sequence modeling evolved from recurrence to attention because fixed recurrence was too restrictive. AttnRes argues that depth may now be ready for the same transition.
The model still shows strong locality, with layers often attending most to nearby predecessors, but it also learns nontrivial skip patterns, preserves persistent weight on the embedding, and maintains distinct behavior between pre-attention and pre-MLP layers.
The practical takeaway is not that all models should replace residuals with full depth-wise attention. Instead, it reframes residual aggregation as a core architectural design space.
Attention Residuals FAQs
How is AttnRes different from standard residual connections?
Standard residuals sum earlier layer outputs with fixed unit weights. AttnRes replaces that with learned softmax weighting over earlier layers, so a layer can selectively retrieve the most relevant prior representations.
What problem is AttnRes trying to solve?
AttnRes targets the limitations of uniform residual accumulation under depth, especially hidden-state growth, loss of selective access to earlier layers, and the broader PreNorm dilution effect.
What is the difference between Full AttnRes and Block AttnRes?
Full AttnRes attends to all previous layer outputs. While Block AttnRes groups layers into blocks and attends over block summaries and partial sums, reducing memory and communication from O(Ld) to O(Nd).
Does AttnRes increase inference latency?
Yes, but the paper reports that the optimized two-phase implementation keeps the end-to-end inference latency overhead under 2% on typical inference workloads.
What are the biggest reported gains?
In the large Kimi Linear experiment, AttnRes improves over baseline on all evaluated tasks, with strong gains on GPQA-Diamond, Math, and HumanEval benchmarks.

I am a Google Developers Expert in ML(Gen AI), a Kaggle 3x Expert, and a Women Techmakers Ambassador with 3+ years of experience in tech. I co-founded a health-tech startup in 2020 and am pursuing a master's in computer science at Georgia Tech, specializing in machine learning.


