Xiaomi released MiMo-V2.5-Pro under an MIT license a few days ago, and the response has been quietly enthusiastic on r/LocalLLaMA but barely registered on other places like Hacker News. The phone-manufacturer-makes-LLM angle keeps tripping people up. MiMo-V2.5-Pro is a Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active per token, and it landed at 54 on the Artificial Analysis Intelligence Index – squarely in frontier territory. On reddit, u/lendo93 reported that in their benchmark suite the model averages higher than Opus 4.6 on coding reasoning, agentic work, and decision making.About the modelThe architecture is built around two ideas… First, hybrid attention: 60 of 70 layers use sliding-window attention with a window of 128 tokens, while only 10 layers run global attention, in a 6:1 SWA-to-GA ratio. This cuts KV-cache storage by roughly 7x compared to a standard transformer, and it’s how Xiaomi gets a usable 1M-token context window without the cache exploding.Second, multi-token prediction. There are three lightweight MTP modules with dense FFNs that predict ahead of the main token stream, and Xiaomi reports this triples inference output speed. The MTP modules are trained natively rather than bolted on as speculative decoding, so the speedup compounds with the long-context handling.AIModels.fyi is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Read more