llama.cpp
SHA-256纯 C/C++ 的高性能大模型推理引擎,支持低比特量化与多种硬件(Apple Silicon、CUDA、Vulkan 等),轻量可嵌入。
本地运行大语言模型的最轻量引擎,不用装 PyTorch,省内存!
核心功能
- 纯 C/C++ 实现,零依赖,可直接嵌入到各种应用中
- 支持 1.5 至 8 比特整数量化,显存占用极低
- 多后端:Apple Silicon、x86、NVIDIA、AMD、Vulkan、SYCL
- 兼容数十种模型格式(GGUF),覆盖主流开源大模型
- 提供命令行推理和 OpenAI 兼容的 API 服务器
避坑指南
- •模型必须为 GGUF 格式,部分旧版本工具不支持最新 GGUF;2. 量化模型(尤其 2-bit 以下)会损失部分推理质量,需要根据任务平衡速度与效果;3. 首次运行时会从 Hugging Face 下载模型,需保证网络畅通。
适用场景
- 在个人电脑上运行 7B~70B 参数的大模型,无网络延迟
- 将 LLM 推理集成到桌面、移动或服务器软件中
- 批量处理文本生成、翻译、摘要等任务,低成本部署
详细介绍
llama.cpp 是一个纯 C/C++ 实现的大语言模型推理引擎,不需要安装 PyTorch 或 TensorFlow 等重型框架。它原生支持 Apple Silicon、x86(AVX/AVX2/AVX512)、RISC‑V、NVIDIA(CUDA)、AMD(HIP)以及 Vulkan/SYCL 后端。核心亮点是极高效的整数量化(1.5 比特到 8 比特),大幅降低显存占用,同时保持不错的效果。它兼容数十种模型架构(如 LLaMA、Mistral、Qwen、Gemma、DeepSeek 等),并提供命令行工具 `llama-cli` 和兼容 OpenAI 的 API 服务器 `llama-server`。相比 Ollama 或 LM Studio,llama.cpp 更轻量、无后台常驻进程、无固定界面,非常适合开发者将其嵌入自己的应用或脚本中。
标签
快速上手
安装软件
双击下载的安装程序,按提示完成安装
从 GitHub Releases 下载适合你系统的预编译包,或通过 brew/nix/winget 安装
准备一个 GGUF 格式的模型文件(可从 Hugging Face 直接下载,如 `ggml-org/gemma-3-1b-it-GGUF`)
打开终端,运行 `llama-cli -m 模型路径.gguf` 开始对话;或运行 `llama-server -m 模型路径.gguf` 启动 API 服务器
- 从 GitHub Releases 下载适合你系统的预编译包,或通过 brew/nix/winget 安装
- 准备一个 GGUF 格式的模型文件(可从 Hugging Face 直接下载,如 `ggml-org/gemma-3-1b-it-GGUF`)
- 打开终端,运行 `llama-cli -m 模型路径.gguf` 开始对话;或运行 `llama-server -m 模型路径.gguf` 启动 API 服务器
最新更新
<details open>
hexagon: add support for TRI op (#22822)
* Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context
* addressed PR review comments for TRI op
* hexagon: clang format
* hex-unary: remove merge conflict markers
* hex-ggml: remove duplicate op cases (merge conflict)
* hex-ggml: fix editor config errors
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
</details>
**macOS/iOS:**
- [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-arm64.tar.gz)
- [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-arm64-kleidiai.tar.gz)
- [macOS Intel (x64)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-x64.tar.gz)
- [iOS XCFramework](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-xcframework.zip)
**Linux:**
- [Ubuntu x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-x64.tar.gz)
- [Ubuntu arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-arm64.tar.gz)
- [Ubuntu s390x (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-s390x.tar.gz)
- [Ubuntu x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-vulkan-x64.tar.gz)
- [Ubuntu arm64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-vulkan-arm64.tar.gz)
- [Ubuntu x64 (ROCm 7.2)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-rocm-7.2-x64.tar.gz)
- [Ubuntu x64 (OpenVINO)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-openvino-2026.0-x64.tar.gz)
- [Ubuntu x64 (SYCL FP32)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-sycl-fp32-x64.tar.gz)
- [Ubuntu x64 (SYCL FP16)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-sycl-fp16-x64.tar.gz)
**Android:**
- [Android arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-android-arm64.tar.gz)
**Windows:**
- [Windows x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cpu-x64.zip)
- [Windows arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cpu-arm64.zip)
- [Windows x64 (CUDA 12)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cuda-12.4-x64.zip) - [CUDA 12.4 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/b9222/cudart-llama-bin-win-cuda-12.4-x64.zip)
- [Windows x64 (CUDA 13)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cuda-13.1-x64.zip) - [CUDA 13.1 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/b9222/cudart-llama-bin-win-cuda-13.1-x64.zip)
- [Windows x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-vulkan-x64.zip)
- [Windows x64 (SYCL)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-sycl-x64.zip)
- [Windows x64 (HIP)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-hip-radeon-x64.zip)
**openEuler:**
- [openEuler x86 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-310p-openEuler-x86.tar.gz)
- [openEuler x86 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-910b-openEuler-x86-aclgraph.tar.gz)
- [openEuler aarch64 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-310p-openEuler-aarch64.tar.gz)
- [openEuler aarch64 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-910b-openEuler-aarch64-aclgraph.tar.gz)
已提供 SHA-256 校验码,下载后可自行核对文件完整性
该校验码提取自 GitHub 官方 Release 页面
SHA256 校验码
f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18该校验码提取自 GitHub Release 页面,下载后请自行核对文件完整性
本平台所有 SHA-256 校验码均提取自项目在 GitHub 官方 Release 页面发布的文件,未做任何修改。你可以通过 GitHub Releases 页面自行验证。
开源透明
查看 GitHub 源码卸载说明
若通过 brew 安装则 `brew uninstall llama.cpp`;通过 nix 安装则 `nix profile remove llama.cpp`;手动下载的包直接删除可执行文件和 `~/.cache/llama.cpp` 缓存目录即可。
无额外依赖
下载后即可直接使用,无需安装其他运行环境