Why build a bigger model when you can just loop twice for twice the power?

Modern language models can refine their reasoning by looping back through their own computation, repeatedly applying the same layers to polish an initial answer. In theory, this is powerful. A model that thinks twice should produce better code than one that thinks once. But practice has a stubborn cost: if you loop sequentially, each additional pass multiplies both latency (time to first token) and memory usage (the KV-cache that stores what the model has attended to). For real-time applications, this trade-off is brutal. You gain refinement at the cost of responsiveness.This tension sits at the heart of test-time computation scaling. The question isn’t whether models benefit from extra thinking, but how to let them think without breaking latency. Parallel Loop Transformers (PLT) were designed to solve exactly this problem: instead of looping sequentially, run all loops simultaneously on different hardware. Use position offsets to tell the model which iteration it’s in, and add gating mechanisms so the model can decide whether to use fresh computation or rely on what it already learned. In theory, this lets loop count become a design choice rather than something imposed by speed constraints. You could loop 10 times if it helped.But should you? That question led researchers to train LoopCoder-v2, a family of 7-billion-parameter code models with loop counts ranging from one to five. What they found was counterintuitive enough to demand explanation: two loops was optimal. Three loops got worse. Not incrementally worse, as diminishing returns would suggest, but genuinely regressed. The model produced worse code despite having more refinement opportunity.How parallel looping actually worksThe standard approach to test-time scaling is straightforward: take a transformer layer and apply it repeatedly to the hidden states. Each pass refines the computation, pushing the model’s hidden states through the same learned operations again and again. But if you do this sequentially, you’re blocked. Loop two can’t start until loop one finishes, so latency scales linearly with loop count. The KV-cache, which stores attention patterns from each position, grows proportionally too.Parallel Loop Transformers invert this constraint by running all loops in parallel. The trick is telling the model which loop it’s in. This is where cross-loop position offsets (CLP) come in: instead of using the same position indices for every loop, shift them. Loop one uses positions 0, 1, 2… Loop two uses positions N, N+1, N+2… The model learns that different position ranges correspond to different refinement stages. Because loops run simultaneously, latency stays roughly constant regardless of how many you add. Memory cost scales too, but far more gently than sequential looping.AIModels.fyi is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.The second component is shared-KV gated sliding-window attention (G-SWA). Instead of recomputing everything from scratch in each loop, position-by-position gates decide whether to use fresh computation from the current loop or rely on what was cached from loop one. This is subtle but crucial: it gives the model fine-grained control over when it refines versus when it reuses.Together, these mechanisms solve the engineering problem: loop count is no longer constrained by latency. But solving the engineering problem creates a new scientific one. If you can loop 10 times at nearly the same cost as looping twice, why don’t you? That’s the question that defines the rest of the paper.Overview of PLT loop-count selection. Left: standard sequential looping increases latency and KV-cache memory with the loop count. Right: PLT uses cross-loop position offsets and shared-KV gated sliding-window attention to keep both costs nearly constant.Empirical surpriseLoopCoder-v2 was trained from scratch on 18 trillion tokens across a family of models with one, two, three, four, and five loops. All models were the same size (7 billion parameters), trained on the same data, and evaluated on the same benchmarks: code generation, code reasoning, and agentic software engineering tasks including SWE-bench Verified and Multi-SWE.

Related Posts

Anthropic Aims to Transform Enterprise Collaboration With Artifacts

Nvidia Launches System to Make Robots Safer

Mitigating vendor lock-in with Sakana AI Fugu multi-agent models