OpenSource-Hub
L

llama.cpp

SHA-256
111.2k stars·AI Productivity·SHA-256 checksum verified

High-performance LLM inference engine in C/C++ with minimal dependencies, supporting quantized models (1.5–8 bit) and diverse hardware (Apple Silicon, CUDA, Vulkan, etc.).

Smart Download

Download Download Version

vb9222 · 383.9 MB

Lightweight, pure C/C++ LLM inference with minimal setup and top performance on any hardware.

Core Features

  • Pure C/C++ implementation with no external dependencies
  • Supports 1.5-bit to 8-bit integer quantization for low VRAM usage
  • Runs on Apple Silicon (NEON/Metal), x86 (AVX/AVX2/AVX512), NVIDIA (CUDA), AMD (HIP), Vulkan, and SYCL
  • Compatible with dozens of model architectures via GGUF format
  • Both CLI client and OpenAI-compatible API server included

What It Can't Do

  • Models must be in GGUF format; older tools may not support latest specs. 2. Heavy quantization (< 3-bit) may noticeably degrade output quality. 3. First launch downloads the model from Hugging Face (requires internet).

Use Cases

  • Run local LLMs on personal laptops or edge devices without internet
  • Embed LLM inference into custom applications (desktop, mobile, server)
  • Batch text generation, translation, summarization with low cost

Detailed Introduction

llama.cpp is a pure C/C++ implementation for running large language models (LLMs) on local devices. It requires no heavy frameworks (PyTorch, TensorFlow) and works out‑of‑the‑box on Apple Silicon, x86 (AVX/AVX2/AVX512), RISC‑V, NVIDIA (CUDA), AMD (HIP), and Intel/AMD GPUs (Vulkan, SYCL). Key innovation: ultra‑efficient integer quantization from 1.5‑bit to 8‑bit, drastically reducing memory usage while retaining acceptable accuracy. It supports dozens of architectures (LLaMA, Mistral, Qwen, Gemma, DeepSeek, etc.) and provides both a CLI (`llama-cli`) and an OpenAI‑compatible API server (`llama-server`). Compared to Ollama or LM Studio, llama.cpp is more stripped‑down – no background daemon, no rigid UI – making it perfect for developers who want to integrate LLM inference into their own applications or scripts.

Tags

llminferencec++quantizationggufapple-silicongpulocal-ai

Getting Started

1

Download installer

Click the button above to download the installer for your system

2

Install the software

Double-click the downloaded installer and follow the prompts

3

Download a prebuilt binary from GitHub Releases or install via brew/nix/winget

4

Obtain a GGUF model file (e.g., `ggml-org/gemma-3-1b-it-GGUF` from Hugging Face)

5

Run `llama-cli -m model.gguf` to chat, or `llama-server -m model.gguf` to start an OpenAI-compatible API

Install Guide
  1. Download a prebuilt binary from GitHub Releases or install via brew/nix/winget
  2. Obtain a GGUF model file (e.g., `ggml-org/gemma-3-1b-it-GGUF` from Hugging Face)
  3. Run `llama-cli -m model.gguf` to chat, or `llama-server -m model.gguf` to start an OpenAI-compatible API

Latest Release Notes

<details open>

hexagon: add support for TRI op (#22822)

* Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context

* addressed PR review comments for TRI op

* hexagon: clang format

* hex-unary: remove merge conflict markers

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-ggml: fix editor config errors

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

</details>

**macOS/iOS:**

- [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-arm64.tar.gz)

- [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-arm64-kleidiai.tar.gz)

- [macOS Intel (x64)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-macos-x64.tar.gz)

- [iOS XCFramework](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-xcframework.zip)

**Linux:**

- [Ubuntu x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-x64.tar.gz)

- [Ubuntu arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-arm64.tar.gz)

- [Ubuntu s390x (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-s390x.tar.gz)

- [Ubuntu x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-vulkan-x64.tar.gz)

- [Ubuntu arm64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-vulkan-arm64.tar.gz)

- [Ubuntu x64 (ROCm 7.2)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-rocm-7.2-x64.tar.gz)

- [Ubuntu x64 (OpenVINO)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-openvino-2026.0-x64.tar.gz)

- [Ubuntu x64 (SYCL FP32)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-sycl-fp32-x64.tar.gz)

- [Ubuntu x64 (SYCL FP16)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-ubuntu-sycl-fp16-x64.tar.gz)

**Android:**

- [Android arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-android-arm64.tar.gz)

**Windows:**

- [Windows x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cpu-x64.zip)

- [Windows arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cpu-arm64.zip)

- [Windows x64 (CUDA 12)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cuda-12.4-x64.zip) - [CUDA 12.4 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/b9222/cudart-llama-bin-win-cuda-12.4-x64.zip)

- [Windows x64 (CUDA 13)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-cuda-13.1-x64.zip) - [CUDA 13.1 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/b9222/cudart-llama-bin-win-cuda-13.1-x64.zip)

- [Windows x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-vulkan-x64.zip)

- [Windows x64 (SYCL)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-sycl-x64.zip)

- [Windows x64 (HIP)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-win-hip-radeon-x64.zip)

**openEuler:**

- [openEuler x86 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-310p-openEuler-x86.tar.gz)

- [openEuler x86 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-910b-openEuler-x86-aclgraph.tar.gz)

- [openEuler aarch64 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-310p-openEuler-aarch64.tar.gz)

- [openEuler aarch64 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/b9222/llama-b9222-bin-910b-openEuler-aarch64-aclgraph.tar.gz)

File Integrity

SHA-256 checksum verified

Checksum extracted from GitHub official Release page

SHA256 Checksum

f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18

This checksum is extracted from the GitHub Release page. Verify file integrity after download.

All SHA-256 checksums on this platform are extracted from the project's official GitHub Release page, without any modification. You can independently verify them on the GitHub Releases page.

Open Source Transparency

View GitHub Source
Environment Guide

Uninstall Info

If installed via brew: `brew uninstall llama.cpp`. Via nix: `nix profile remove llama.cpp`. For manual install, delete the executable and `~/.cache/llama.cpp`.

No Extra Dependencies

Ready to use after download. No additional runtime required.

Project Info
LicenseMIT
Last Updated2026-05-19T06:14:00Z
GitHub RepositoryOfficial Website

Similar Projects