Run Gemini 2.5 Flash-level multimodal AI on your phone: 9B parameter model handles vision, speech, and full-duplex streaming conversations locally

Ever wanted to run a GPT-4V-class multimodal AI on your actual phone? MiniCPM-o cracks the code with a 9B parameter model that rivals Gemini 2.5 Flash in vision and speech tasks while running locally. The killer feature isn’t just the size efficiency – it’s full-duplex streaming where the AI can see through your camera, listen to audio, and speak back simultaneously without blocking each other. Think real-time video analysis with natural conversation, not the clunky turn-based chat you’re used to.

The numbers back up the hype: MiniCPM-V 4.0 (the 4B vision variant) actually outperforms GPT-4.1-mini on OpenCompass benchmarks, while the 9B omni model handles end-to-end speech without separate TTS/STT pipelines. They’ve built proper tooling too – a WebRTC demo runs on MacBooks via their llama.cpp-omni framework, and the voice cloning works in bilingual mode. With 23K+ stars and active development, this isn’t vaporware.

If you’re building mobile AI apps or want to experiment with multimodal interactions without API costs, this is your entry point. The cookbook repo shows real implementations, and the community on Discord is surprisingly active for technical depth.

⭐ Stars: 23542
💻 Language: Python
🔗 Repository: OpenBMB/MiniCPM-o