Mr Elokusa
Article

Gemma 4: Free, Local, and Good Enough to Matter

I built a product chatbot using Gemma 4. A user lands on a vitamins page, asks "do you have vegan friendly options," and the model checks the available product data and answers. From there they can add to cart. Simple concept. The whole inference layer costs nothing. That last part is the story.



By Elokusa Zondi : April 2026






The License Is the Feature


Gemma 1, 2, and 3 shipped under Google's custom terms. They looked open but weren't. The acceptable use clauses were broad enough to give enterprise legal teams genuine pause, defining what constitutes "harm to minors" or "attacks on critical infrastructure" is not something a compliance review can easily pin down. There was also a clause prohibiting the use of model outputs to train or improve competing models, which created a data lock-in problem for anyone doing fine-tuning or synthetic data generation.

Gemma 4 ships under Apache 2.0. Clean, standard, OSI-approved. No royalties, no revenue thresholds, no ambiguous acceptable use definitions that could be enforced retroactively. You can fine-tune it on proprietary data, wrap it in a commercial product, and sell that product globally without owing Google anything.

For an enterprise this removes the "legal tax" the months of auditing that bespoke AI licenses typically require. For a startup it means your core infrastructure is owned, not rented.

That shift from rented intelligence to owned infrastructure is what makes Gemma 4 a different conversation from every previous open-weight release.




What I Actually Built


The POC is a product recommender chatbot. The model receives category and product information contextually, depending on which page the user is on, and handles natural language queries against that data.

"Do you have vegan friendly vitamins?" It checks. It answers. The user can act on the response directly.

Implementation was straightforward. The API is clean. I did not need to engineer the prompts heavily to get useful outputs. For a POC this came together fast.

One thing caught me out early.




The Reasoning Leakage Problem


When I first integrated the API the outputs were messy. The model was thinking out loud, drafting plans, checking its own logic, generating internal monologue before surfacing a final answer. In a chat interface that looks broken.

I initially assumed this was a skill issue on my end. It wasn't.

Gemma 4 has native thinking built in. When thinking mode is enabled, the model generates internal chain-of-thought tokens before producing the visible response. This is architectural, not accidental it is designed to improve output quality on complex tasks by letting the model verify its own reasoning before committing to an answer.

The problem is that in a production chatbot, users should never see the thinking chain. You need a sanitization layer. In the Hugging Face transformers library, processor.parse_response() handles this automatically. In Ollama you can pass think: false to disable it entirely, though this comes at a quality cost on complex queries. For most conversational use cases, product recommendations, customer service, FAQ, disabling thinking or stripping it at the middleware layer is the right call.

Once I understood what was happening it took twenty minutes to fix. But if you're building a production interface and you don't know this going in, your first demo is going to look broken.




The Architecture, in Plain English


Three things make Gemma 4 technically interesting beyond the benchmarks.

Mixture of Experts (26B MoE variant)

The 26B MoE model contains roughly 25 billion total parameters but only activates 3.8 billion for any given inference pass. The feed-forward layers are divided into 128 tiny experts. A routing network assigns each token to a specific subset. This means the compute cost per token matches a 4 billion parameter dense model, while the knowledge capacity reflects a 26 billion parameter system.

Practically: it fits on 16GB VRAM and runs at over 30 tokens per second on consumer hardware. That is fast enough for real-time agentic workflows.

Per-Layer Embeddings (E2B and E4B variants)

Standard transformers embed a token once at the input and carry that representation through the network. Gemma 4's smaller models inject a secondary embedding signal at every decoder layer. Each layer gets a fresh positional and semantic anchor rather than relying on the original signal to survive thirty layers of processing.

For edge deployment, phones, Raspberry Pi, NVIDIA Jetson, this keeps spatial and temporal coherence intact as the model processes vision and audio inputs. The E2B model runs with an effective memory footprint under 1.5GB.

Native thinking

As described above, reasoning as a first-class API surface rather than a prompt engineering trick. The model can generate over 4,000 tokens of internal reasoning before producing the final answer. For coding, architecture review, or complex multi-step logic, this is genuinely useful. For conversational product queries, strip it.




How It Benchmarks


The 31B Dense variant ranks third globally on the Arena AI leaderboard. It scores 84.3% on GPQA Diamond (expert-level scientific reasoning) and 80% on LiveCodeBench v6. For context, the previous generation Gemma 3 27B scored 42.4% on GPQA Diamond. That is not a small jump.

Against frontier models: Claude Opus and GPT-5 still lead on deep reasoning, creative nuance, and tasks requiring multi-faceted judgment. The gap is real, roughly 10 to 15 percent on the hardest tasks. For a product chatbot, that gap is irrelevant. For a legal document analysis tool or a complex coding copilot, it matters.

The honest position: Gemma 4 is not trying to replace frontier models. It is the model you reach for when API costs are a constraint, when data privacy matters, or when you need something running locally with zero third-party dependency.




The Cost Argument


Consider a production system serving 10,000 active users, 50 interactions each per month, averaging 1,000 input tokens and 250 output tokens per interaction. That is roughly 625 million tokens monthly.

On Claude 3.5 Sonnet API pricing, that bill runs approximately R70,000 per month. On self-hosted Gemma 4 running on cloud GPU instances, the same workload costs around R22,000 in infrastructure. On owned hardware, the cost drops to electricity and depreciation, roughly R900 per month.

For a startup watching unit economics, that difference is existential. For an enterprise processing sensitive data, the cost argument is secondary to the privacy argument.




The Data Sovereignty Angle


This connects directly to something I wrote about recently in the context of South Africa's Draft National AI Policy.

Most functional AI tools available today — GPT, Claude, Gemini — process data on servers outside South Africa. Under Section 72 of POPIA, cross-border transfers of personal information require strict compliance, often demanding proof that the destination country meets adequate data protection standards or that explicit consent has been obtained from data subjects.

A South African law firm, healthcare practice, or financial advisor using a US-based AI API is, on most interpretations of POPIA, in a grey area at best. Running Gemma 4 locally eliminates the problem entirely. The data never leaves the building. No third-party cloud provider processes the prompts. No foreign model is trained on South African personal information.

This is the honest answer to the sovereignty question raised in the policy. You cannot simultaneously require data sovereignty and mandate that businesses use only foreign-hosted models. Gemma 4 — and open-weight models generally — are the technical resolution to that contradiction.




Where It Falls Short


For completeness.

Complex multi-step reasoning gaps compared to frontier models are real. If you are building something that requires extreme reasoning depth, high creative nuance, or ultra-long context beyond 256K tokens, Gemma 4 is not the right choice.

The 26B MoE and 31B Dense variants require 16GB and 20GB of VRAM respectively, even with 4-bit quantization. An 8GB consumer GPU will not run the flagship models. This limits local deployment to high-end workstations or Apple Silicon machines with unified memory.

Fine-tuning the MoE variant requires care. Because only a subset of experts activates per token, full fine-tuning risks expert collapse certain experts stop receiving routing and the model degrades. LoRA is the recommended approach.




The Practical Conclusion


Gemma 4 is the first open-weight model I would actually put in a production environment for the right use case.

The licensing is clean. The efficiency is real. The cost economics are compelling. And for anyone building in a jurisdiction with data sovereignty requirements which is most of the world the ability to run locally is not a nice-to-have, it is a compliance requirement.

It will not replace Claude or GPT-5 for the tasks that genuinely need them. The right architecture is a hybrid stack: Gemma 4 as the workhorse for high-volume, sensitive, and edge-based tasks, with frontier models reserved for the subset of queries that require premium reasoning depth.

That stack is now accessible to any developer with a capable workstation and an afternoon.




Elokusa Zondi is an AI and Automation Engineer based in Cape Town. He builds production AI systems and writes about what actually works.

Model weights: Hugging Face | Ollama | Google AI Studio

Discussion

(0)

Join the first cohort. Unlock badges.

Sign In to Comment