{"id":2441,"date":"2026-04-11T17:53:47","date_gmt":"2026-04-11T17:53:47","guid":{"rendered":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/"},"modified":"2026-04-11T17:53:47","modified_gmt":"2026-04-11T17:53:47","slug":"googles-best-open-model-yet-has-a-memory-problem","status":"publish","type":"post","link":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/","title":{"rendered":"Google\u2019s Best Open Model Yet Has a Memory Problem"},"content":{"rendered":"<p>Google DeepMind released Gemma 4 on Easter weekend, and the local AI community responded like it was Christmas. The family spans four sizes &#8211; E2B, E4B, 26B A4B (MoE), and 31B dense &#8211; with the 31B landing on Hugging Face under an Apache 2.0 license. That licensing change matters: previous Gemma releases used a custom Google license with usage restrictions. Apache 2.0 removes that friction for commercial deployment.The benchmark numbers are good. The 31B scores 89.2% on AIME 2026 without tools, 80% on LiveCodeBench v6, and a Codeforces ELO of 2150. For comparison, Gemma 3 27B scored 110 on that same Codeforces benchmark. The smaller E2B model &#8211; which has only 2.3 billion effective parameters &#8211; outperforms Gemma 3 27B on MMLU Pro (60% vs 67.6%), GPQA Diamond (43.4% vs 42.4%), and LiveCodeBench (44% vs 29.1%). Some users called it \u201cinsane\u201d &#8211; a fair reaction.What the 31B actually doesThe 31B is a dense model with 30.7B parameters, a 256K token context window, and a hybrid attention mechanism that interleaves local sliding window attention (1024-token window) with global attention layers. The final layer is always global. For long-context tasks, global layers use unified Keys and Values with Proportional RoPE (p-RoPE), which is how Google gets memory efficiency at scale without completely tanking reasoning quality.Multimodal support covers text and images, with a 550M-parameter vision encoder. The model can process images at variable resolutions using a configurable token budget (70 to 1120 tokens per image) &#8211; lower budgets for speed on classification tasks, higher budgets for OCR and document parsing where fine-grained detail matters. The smaller E2B and E4B models additionally support audio input for up to 30 seconds, enabling single-model pipelines for voice applications.Benchmarks, from HuggingFaceThinking mode is built in and configurable. Include &lt;|think|&gt; in the system prompt to activate it; remove it to disable. The model outputs its reasoning trace in &lt;|channel&gt;thoughtn[reasoning]&lt;channel|&gt; blocks before the final answer. In multi-turn conversations, you strip the thinking content from history before the next user turn &#8211; thinking traces don\u2019t get passed back.From r\/LocalLLaMa (link)Coding is a clear strength. The 31B\u2019s Codeforces ELO of 2150 is a significant jump from anything in the open-weight space at this size. On r\/LocalLLaMA, u\/DigiDecode_ posted a screenshot showing the 31B ranking above GLM-5 on LMSys, which landed with some force given GLM-5\u2019s reputation.How to run itThe model is available on Hugging Face and loads through the standard Transformers interface. For text and image inputs:pip install -U transformers torch acceleratefrom transformers import AutoProcessor, AutoModelForCausalLM<\/p>\n<p>processor = AutoProcessor.from_pretrained(&#8220;google\/gemma-4-31B-it&#8221;)<br \/>\nmodel = AutoModelForCausalLM.from_pretrained(&#8220;google\/gemma-4-31B-it&#8221;, dtype=&#8221;auto&#8221;, device_map=&#8221;auto&#8221;)<br \/>\nUse AutoModelForMultimodalLM instead if you\u2019re working with images or video (or audio on the E2B\/E4B variants). Recommended sampling parameters from Google: temperature=1.0top_p=0.95top_k=64. For thinking mode, pass enable_thinking=True to apply_chat_template and use processor.parse_response() to separate the thinking trace from the final answer.GGUF quantizations are available via Unsloth. NVIDIA also offers a free API endpoint at build.nvidia.com at 40 requests per minute, which is useful for evaluation before committing to local deployment.For local inference, Google\u2019s recommended config for llama.cpp: &#8211;flash-attn on, &#8211;temp 1.0, &#8211;top-p 0.95, &#8211;top-k 64, &#8211;jinja. You\u2019ll want KV quantization unless you have unusual amounts of VRAM available.The KV cache problemThis is where the reception gets complicated. The 31B has a massive KV cache footprint &#8211; a consequence of its multimodal architecture. On reddit, users reported that on a 40GB VRAM card, the Q8 quantization (35GB) can\u2019t fit even a 2K context without also quantizing the KV cache to Q4. Qwen3.5-27B, by comparison, fits at full context without KV quantization on the same hardware. A llama.cpp update since release improved this by properly implementing Sliding Window Attention, which reduces the fixed KV allocation significantly &#8211; but you need to re-download the Unsloth quants if you grabbed them at launch.<\/p>\n<p>              Read more<\/p>\n","protected":false},"excerpt":{"rendered":"<div>Gemma 4 31B&#8217;s 256K context window is real. So is the VRAM bill that comes with it.<\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"site-container-style":"default","site-container-layout":"default","site-sidebar-layout":"default","disable-article-header":"default","disable-site-header":"default","disable-site-footer":"default","disable-content-area-spacing":"default","footnotes":""},"categories":[1],"tags":[3],"class_list":["post-2441","post","type-post","status-publish","format-standard","hentry","category-ai-and-ml","tag-ai"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Google\u2019s Best Open Model Yet Has a Memory Problem - Imperative Business Ventures Limited<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Google\u2019s Best Open Model Yet Has a Memory Problem - Imperative Business Ventures Limited\" \/>\n<meta property=\"og:description\" content=\"Gemma 4 31B&#039;s 256K context window is real. So is the VRAM bill that comes with it.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/\" \/>\n<meta property=\"og:site_name\" content=\"Imperative Business Ventures Limited\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-11T17:53:47+00:00\" \/>\n<meta name=\"author\" content=\"admin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/\"},\"author\":{\"name\":\"admin\",\"@id\":\"https:\/\/blog.ibvl.in\/#\/schema\/person\/55b87b72a56b1bbe9295fe5ef7a20b02\"},\"headline\":\"Google\u2019s Best Open Model Yet Has a Memory Problem\",\"datePublished\":\"2026-04-11T17:53:47+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/\"},\"wordCount\":677,\"keywords\":[\"AI\"],\"articleSection\":[\"AI and ML\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/\",\"url\":\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/\",\"name\":\"Google\u2019s Best Open Model Yet Has a Memory Problem - Imperative Business Ventures Limited\",\"isPartOf\":{\"@id\":\"https:\/\/blog.ibvl.in\/#website\"},\"datePublished\":\"2026-04-11T17:53:47+00:00\",\"author\":{\"@id\":\"https:\/\/blog.ibvl.in\/#\/schema\/person\/55b87b72a56b1bbe9295fe5ef7a20b02\"},\"breadcrumb\":{\"@id\":\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.ibvl.in\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Google\u2019s Best Open Model Yet Has a Memory Problem\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.ibvl.in\/#website\",\"url\":\"https:\/\/blog.ibvl.in\/\",\"name\":\"Imperative Business Ventures Limited\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.ibvl.in\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.ibvl.in\/#\/schema\/person\/55b87b72a56b1bbe9295fe5ef7a20b02\",\"name\":\"admin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.ibvl.in\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/4d20b2cd313e4417a599678e950e6fb7d4dfa178a72f2b769335a08aaa615aa9?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/4d20b2cd313e4417a599678e950e6fb7d4dfa178a72f2b769335a08aaa615aa9?s=96&d=mm&r=g\",\"caption\":\"admin\"},\"sameAs\":[\"https:\/\/blog.ibvl.in\"],\"url\":\"https:\/\/blog.ibvl.in\/index.php\/author\/admin_hcbs9yw6\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Google\u2019s Best Open Model Yet Has a Memory Problem - Imperative Business Ventures Limited","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/","og_locale":"en_US","og_type":"article","og_title":"Google\u2019s Best Open Model Yet Has a Memory Problem - Imperative Business Ventures Limited","og_description":"Gemma 4 31B's 256K context window is real. So is the VRAM bill that comes with it.","og_url":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/","og_site_name":"Imperative Business Ventures Limited","article_published_time":"2026-04-11T17:53:47+00:00","author":"admin","twitter_card":"summary_large_image","twitter_misc":{"Written by":"admin","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/#article","isPartOf":{"@id":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/"},"author":{"name":"admin","@id":"https:\/\/blog.ibvl.in\/#\/schema\/person\/55b87b72a56b1bbe9295fe5ef7a20b02"},"headline":"Google\u2019s Best Open Model Yet Has a Memory Problem","datePublished":"2026-04-11T17:53:47+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/"},"wordCount":677,"keywords":["AI"],"articleSection":["AI and ML"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/","url":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/","name":"Google\u2019s Best Open Model Yet Has a Memory Problem - Imperative Business Ventures Limited","isPartOf":{"@id":"https:\/\/blog.ibvl.in\/#website"},"datePublished":"2026-04-11T17:53:47+00:00","author":{"@id":"https:\/\/blog.ibvl.in\/#\/schema\/person\/55b87b72a56b1bbe9295fe5ef7a20b02"},"breadcrumb":{"@id":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/blog.ibvl.in\/index.php\/2026\/04\/11\/googles-best-open-model-yet-has-a-memory-problem\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.ibvl.in\/"},{"@type":"ListItem","position":2,"name":"Google\u2019s Best Open Model Yet Has a Memory Problem"}]},{"@type":"WebSite","@id":"https:\/\/blog.ibvl.in\/#website","url":"https:\/\/blog.ibvl.in\/","name":"Imperative Business Ventures Limited","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.ibvl.in\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.ibvl.in\/#\/schema\/person\/55b87b72a56b1bbe9295fe5ef7a20b02","name":"admin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.ibvl.in\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/4d20b2cd313e4417a599678e950e6fb7d4dfa178a72f2b769335a08aaa615aa9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/4d20b2cd313e4417a599678e950e6fb7d4dfa178a72f2b769335a08aaa615aa9?s=96&d=mm&r=g","caption":"admin"},"sameAs":["https:\/\/blog.ibvl.in"],"url":"https:\/\/blog.ibvl.in\/index.php\/author\/admin_hcbs9yw6\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/posts\/2441","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/comments?post=2441"}],"version-history":[{"count":0,"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/posts\/2441\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/media?parent=2441"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/categories?post=2441"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ibvl.in\/index.php\/wp-json\/wp\/v2\/tags?post=2441"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}