Post

This article was created collaboratively by the author and AI. The author provided the core content, and AI helped organize, polish, and improve readability. This English version was also translated by AI from the Chinese version, so minor wording differences may exist.

After the release of Gemma 4, I tested it locally using Ollama and LM Studio, focusing on the smaller E2B and E4B versions. The core conclusion is: extremely low VRAM usage, outstanding Japanese capability among similar small models, and a very good local translation experience.

🚀 Why Is It Best Suited for Japanese Users?

The E2B and E4B models consume minimal VRAM; even the E2B can run on mobile phones, making it sufficient for basic tasks.

Most open-source small models currently available are led by Chinese teams. However, their Japanese processing capabilities generally fall short: MiniMax has no dedicated Japanese offering, while GLM and Kimi often occasionally mix in Chinese characters when translating Japanese content. Gemma 4 avoids this issue entirely. For the target user group (Japanese users), it represents the best current option.

💻 Primary Use Cases for Small Models

The applications of small models are relatively niche. Simple tasks can still be done manually, and complex problems remain beyond their scope. For me personally, the biggest use case right now is translation.

🛠️ Practical Workflow Analysis

Translation Plugin Choice

I switched from “Immersive Translate” to “Read Frog.” Immersive Translate had too much commercial residue and bloat; Read Frog offers a much smoother open-source experience, although it occasionally encounters small bugs.

Comparison of Translation Models

Google previously released TranslateGemma (a dedicated translation model for 4B), but its drawback was inaccurate recognition between Simplified and Traditional Chinese when translating to Chinese, resulting in poor quality. Using Gemma 4 eliminates this issue almost entirely.

However, such translation plugins have a fixed limitation: they can only translate by “snippets” and cannot understand full-page context, making contextual errors inevitable—this fact must be accepted.

Controlling Reasoning (System Prompt)

Gemma 4 supports toggling the reasoning process using system prompts. When performing translation tasks, it is recommended to close the reasoning switch. This prevents extraneous thought steps from being outputted, ensuring maximum speed; conversely, opening it is ideal for casual conversation or deep analysis.

📊 Performance Data Reference

I tested the dynamically quantized version by Unsloth:

  • RTX 4080: E4B runs at approximately 110 tokens/s
  • M5 MacBook Pro: E2B runs at 60+ tokens/s

⚠️ Critical Deployment Considerations (Key Supplement)

I found an issue: when using Unsloth’s dynamic quantization version, executing a tools call within tool-calling flows like those in Claude Code often fails to execute properly. Therefore, to ensure optimal compatibility and stability, it is strongly recommended that users prioritize deploying and testing with the official or officially recommended quantized versions provided by Ollama and LM Studio.

Final Takeaway

Gemma 4 small models are outstanding in VRAM footprint, Japanese accuracy, and translation quality, which LMArena rankings confirm. If you seek a local deployable, reliable, and high-quality model for Japanese language tasks and translation, Gemma 4 is currently the premier choice.

This post is licensed under CC BY 4.0 by the author.