llama.cpp
SHA-256High-performance LLM inference engine in C/C++ with minimal dependencies, supporting quantized models (1.5–8 bit) and diverse hardware (Apple Silicon, CUDA, Vulkan, etc.).
Smart Download
Download Download Version
vb9222 · 383.9 MB
Lightweight, pure C/C++ LLM inference with minimal setup and top performance on any hardware.
Core Features
- Pure C/C++ implementation with no external dependencies
- Supports 1.5-bit to 8-bit integer quantization for low VRAM usage
- Runs on Apple Silicon (NEON/Metal), x86 (AVX/AVX2/AVX512), NVIDIA (CUDA), AMD (HIP), Vulkan, and SYCL
- Compatible with dozens of model architectures via GGUF format
- Both CLI client and OpenAI-compatible API server included
What It Can't Do
- •Models must be in GGUF format; older tools may not support latest specs. 2. Heavy quantization (< 3-bit) may noticeably degrade output quality. 3. First launch downloads the model from Hugging Face (requires internet).
Use Cases
- Run local LLMs on personal laptops or edge devices without internet
- Embed LLM inference into custom applications (desktop, mobile, server)
- Batch text generation, translation, summarization with low cost
Detailed Introduction
llama.cpp is a pure C/C++ implementation for running large language models (LLMs) on local devices. It requires no heavy frameworks (PyTorch, TensorFlow) and works out‑of‑the‑box on Apple Silicon, x86 (AVX/AVX2/AVX512), RISC‑V, NVIDIA (CUDA), AMD (HIP), and Intel/AMD GPUs (Vulkan, SYCL). Key innovation: ultra‑efficient integer quantization from 1.5‑bit to 8‑bit, drastically reducing memory usage while retaining acceptable accuracy. It supports dozens of architectures (LLaMA, Mistral, Qwen, Gemma, DeepSeek, etc.) and provides both a CLI (`llama-cli`) and an OpenAI‑compatible API server (`llama-server`). Compared to Ollama or LM Studio, llama.cpp is more stripped‑down – no background daemon, no rigid UI – making it perfect for developers who want to integrate LLM inference into their own applications or scripts.
Tags
Getting Started
Download installer
Click the button above to download the installer for your system
Install the software
Double-click the downloaded installer and follow the prompts
Download a prebuilt binary from GitHub Releases or install via brew/nix/winget
Obtain a GGUF model file (e.g., `ggml-org/gemma-3-1b-it-GGUF` from Hugging Face)
Run `llama-cli -m model.gguf` to chat, or `llama-server -m model.gguf` to start an OpenAI-compatible API
- Download a prebuilt binary from GitHub Releases or install via brew/nix/winget
- Obtain a GGUF model file (e.g., `ggml-org/gemma-3-1b-it-GGUF` from Hugging Face)
- Run `llama-cli -m model.gguf` to chat, or `llama-server -m model.gguf` to start an OpenAI-compatible API
Latest Release Notes
<details open>
hexagon: add support for TRI op (#22822)
* Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context
* addressed PR review comments for TRI op
* hexagon: clang format
* hex-unary: remove merge conflict markers
* hex-ggml: remove duplicate op cases (merge conflict)
* hex-ggml: fix editor config errors
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
</details>
**macOS/iOS:**
- [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-arm64.tar.gz)
- [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-arm64-kleidiai.tar.gz)
- [macOS Intel (x64)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-x64.tar.gz)
- [iOS XCFramework](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-xcframework.zip)
**Linux:**
- [Ubuntu x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-x64.tar.gz)
- [Ubuntu arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-arm64.tar.gz)
- [Ubuntu s390x (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-s390x.tar.gz)
- [Ubuntu x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-vulkan-x64.tar.gz)
- [Ubuntu arm64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-vulkan-arm64.tar.gz)
- [Ubuntu x64 (ROCm 7.2)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-rocm-7.2-x64.tar.gz)
- [Ubuntu x64 (OpenVINO)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-openvino-2026.0-x64.tar.gz)
- [Ubuntu x64 (SYCL FP32)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-sycl-fp32-x64.tar.gz)
- [Ubuntu x64 (SYCL FP16)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-sycl-fp16-x64.tar.gz)
**Android:**
- [Android arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-android-arm64.tar.gz)
**Windows:**
- [Windows x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cpu-x64.zip)
- [Windows arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cpu-arm64.zip)
- [Windows x64 (CUDA 12)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cuda-12.4-x64.zip) - [CUDA 12.4 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/b9222/cudart-llama-bin-win-cuda-12.4-x64.zip)
- [Windows x64 (CUDA 13)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cuda-13.1-x64.zip) - [CUDA 13.1 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/b9222/cudart-llama-bin-win-cuda-13.1-x64.zip)
- [Windows x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-vulkan-x64.zip)
- [Windows x64 (SYCL)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-sycl-x64.zip)
- [Windows x64 (HIP)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-hip-radeon-x64.zip)
**openEuler:**
- [openEuler x86 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-310p-openEuler-x86.tar.gz)
- [openEuler x86 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-910b-openEuler-x86-aclgraph.tar.gz)
- [openEuler aarch64 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-310p-openEuler-aarch64.tar.gz)
- [openEuler aarch64 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-910b-openEuler-aarch64-aclgraph.tar.gz)
SHA-256 checksum verified
Checksum extracted from GitHub official Release page
SHA256 Checksum
f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18This checksum is extracted from the GitHub Release page. Verify file integrity after download.
All SHA-256 checksums on this platform are extracted from the project's official GitHub Release page, without any modification. You can independently verify them on the GitHub Releases page.
Open Source Transparency
View GitHub SourceUninstall Info
If installed via brew: `brew uninstall llama.cpp`. Via nix: `nix profile remove llama.cpp`. For manual install, delete the executable and `~/.cache/llama.cpp`.
No Extra Dependencies
Ready to use after download. No additional runtime required.
Similar Projects
ollama
Ollama lets you download, run, and manage large language models locally. One command, multiple platforms, endless possibilities.
Chatbox
Chatbox Community Edition is an open-source desktop client for interacting with multiple large language models. It supports OpenAI (ChatGPT), Azure OpenAI, Claude, Google Gemini Pro, Ollama (local models like Llama 2, Mistral), and ChatGLM-6B. All your chat data is stored locally on your device, ensuring privacy and preventing data loss. The app features a clean, ergonomic UI with dark mode, keyboard shortcuts, streaming replies, and full Markdown/LaTeX rendering with code highlighting. It also includes a prompt library, message quoting, and team collaboration for sharing API resources. Available on Windows, macOS, Linux, Web, iOS, and Android. The community edition is fully functional but may lack some advanced features from the pro version.
AnythingLLM
Chat with your docs, use AI agents, multi-user support, runs locally with zero setup.