Multi-modal RAG framework that handles text, images, audio, and video in one unified system - finally, RAG that works with everything, not just documents

Ever tried building RAG with images, audio, or video? You’ve probably hit the wall where most frameworks just… don’t. RAG-Anything solves this by being genuinely multi-modal from the ground up. While others bolt on image support as an afterthought, this framework treats text, images, audio, and video as first-class citizens in your knowledge base.

Built on LightRAG, it delivers the performance you need with the flexibility you want. The 13.5k stars aren’t just hype - developers are discovering they can finally build RAG systems that mirror how humans actually consume information: through multiple senses and formats. It comes with ready-to-use components, proper documentation, and even supports the blazing-fast uv package manager.

Whether you’re building a document analysis system that needs to understand charts and diagrams, or creating an AI assistant that can process meeting recordings and slides together, this is your fastest path from idea to working prototype. The active Discord community and bilingual docs show this isn’t just another research project - it’s production-ready tooling.

⭐ Stars: 13541
💻 Language: Python
🔗 Repository: HKUDS/RAG-Anything