Run LLMs locally with blazing speed - pure C++ inference engine that turns your laptop into an AI powerhouse, no Python dependencies or GPU required

Tired of sending your data to OpenAI or waiting for cloud APIs? llama.cpp lets you run Llama, Mistral, Gemma and dozens of other language models directly on your machine - CPU only, no NVIDIA tax required. With 95k+ stars, this isn’t just another AI wrapper; it’s the de facto standard for local LLM inference that powers everything from VS Code extensions to production servers.

What makes it special? Pure C++ means serious performance - we’re talking about running 7B parameter models at conversational speeds on a MacBook. The project supports quantized models (think 4-bit precision) that shrink memory usage by 75% while maintaining quality. Plus you get an OpenAI-compatible API server, so your existing code works unchanged. The ecosystem is massive: official Hugging Face integration, Docker images, package manager support, and even a WebUI that just landed.

Whether you’re building privacy-focused apps, prototyping offline, or just want to experiment without API costs, this is your entry point. Installation takes minutes via brew/Docker, and you can literally download and run models directly from Hugging Face with a single command. The active development and corporate backing (NVIDIA collaboration) mean this isn’t going anywhere.

⭐ Stars: 95570
💻 Language: C++
🔗 Repository: ggml-org/llama.cpp