llama.cpp
SHA-256High-performance LLM inference engine in C/C++ with minimal dependencies, supporting quantized models (1.5–8 bit) and diverse hardware (Apple Silicon, CUDA, Vulkan, etc.).
Smart Download
Download Download Version
vb9222 · 383.9 MB
Lightweight, pure C/C++ LLM inference with minimal setup and top performance on any hardware.
Core Features
- Pure C/C++ implementation with no external dependencies
- Supports 1.5-bit to 8-bit integer quantization for low VRAM usage
- Runs on Apple Silicon (NEON/Metal), x86 (AVX/AVX2/AVX512), NVIDIA (CUDA), AMD (HIP), Vulkan, and SYCL
- Compatible with dozens of model architectures via GGUF format
- Both CLI client and OpenAI-compatible API server included
What It Can't Do
- •Models must be in GGUF format; older tools may not support latest specs. 2. Heavy quantization (< 3-bit) may noticeably degrade output quality. 3. First launch downloads the model from Hugging Face (requires internet).
Use Cases
- Run local LLMs on personal laptops or edge devices without internet
- Embed LLM inference into custom applications (desktop, mobile, server)
- Batch text generation, translation, summarization with low cost
llama.cpp is a pure C/C++ implementation for running large language models (LLMs) on local devices. It requires no heavy frameworks (PyTorch, TensorFlow) and works out‑of‑the‑box on Apple Silicon, x86 (AVX/AVX2/AVX512), RISC‑V, NVIDIA (CUDA), AMD (HIP), and Intel/AMD GPUs (Vulkan, SYCL). Key innovation: ultra‑efficient integer quantization from 1.5‑bit to 8‑bit, drastically reducing memory usage while retaining acceptable accuracy. It supports dozens of architectures (LLaMA, Mistral, Qwen, Gemma, DeepSeek, etc.) and provides both a CLI (`llama-cli`) and an OpenAI‑compatible API server (`llama-server`). Compared to Ollama or LM Studio, llama.cpp is more stripped‑down – no background daemon, no rigid UI – making it perfect for developers who want to integrate LLM inference into their own applications or scripts.
Tags
Getting Started
Download installer
Click the button above to download the installer for your system
Install the software
Double-click the downloaded installer and follow the prompts
Download a prebuilt binary from GitHub Releases or install via brew/nix/winget
Obtain a GGUF model file (e.g., `ggml-org/gemma-3-1b-it-GGUF` from Hugging Face)
Run `llama-cli -m model.gguf` to chat, or `llama-server -m model.gguf` to start an OpenAI-compatible API
- Download a prebuilt binary from GitHub Releases or install via brew/nix/winget
- Obtain a GGUF model file (e.g., `ggml-org/gemma-3-1b-it-GGUF` from Hugging Face)
- Run `llama-cli -m model.gguf` to chat, or `llama-server -m model.gguf` to start an OpenAI-compatible API
SHA-256 checksum verified
Checksum extracted from GitHub official Release page
SHA256 Checksum
f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18This checksum is extracted from the GitHub Release page. Verify file integrity after download.
All SHA-256 checksums on this platform are extracted from the project's official GitHub Release page, without any modification. You can independently verify them on the GitHub Releases page.
Open Source Transparency
View GitHub SourceUninstall Info
If installed via brew: `brew uninstall llama.cpp`. Via nix: `nix profile remove llama.cpp`. For manual install, delete the executable and `~/.cache/llama.cpp`.
No Extra Dependencies
Ready to use after download. No additional runtime required.
Having issues? Check the FAQ below
4 FAQs
Similar Projects
ollama
Ollama lets you download, run, and manage large language models locally. One command, multiple platforms, endless possibilities.
Chatbox
Chatbox Community Edition is an open-source desktop client for interacting with multiple large language models. It supports OpenAI (ChatGPT), Azure OpenAI, Claude, Google Gemini Pro, Ollama (local models like Llama 2, Mistral), and ChatGLM-6B. All your chat data is stored locally on your device, ensuring privacy and preventing data loss. The app features a clean, ergonomic UI with dark mode, keyboard shortcuts, streaming replies, and full Markdown/LaTeX rendering with code highlighting. It also includes a prompt library, message quoting, and team collaboration for sharing API resources. Available on Windows, macOS, Linux, Web, iOS, and Android. The community edition is fully functional but may lack some advanced features from the pro version.
opencv
OpenCV is an open-source computer vision and machine learning library with over 2500 optimized algorithms for real-time image and video analysis.