OpenSource-Hub
L

llama.cpp

SHA-256
111.2k stars·AI Productivity·SHA-256 checksum verified

High-performance LLM inference engine in C/C++ with minimal dependencies, supporting quantized models (1.5–8 bit) and diverse hardware (Apple Silicon, CUDA, Vulkan, etc.).

Smart Download

Download Download Version

vb9222 · 383.9 MB

Lightweight, pure C/C++ LLM inference with minimal setup and top performance on any hardware.

Core Features

  • Pure C/C++ implementation with no external dependencies
  • Supports 1.5-bit to 8-bit integer quantization for low VRAM usage
  • Runs on Apple Silicon (NEON/Metal), x86 (AVX/AVX2/AVX512), NVIDIA (CUDA), AMD (HIP), Vulkan, and SYCL
  • Compatible with dozens of model architectures via GGUF format
  • Both CLI client and OpenAI-compatible API server included

What It Can't Do

  • Models must be in GGUF format; older tools may not support latest specs. 2. Heavy quantization (< 3-bit) may noticeably degrade output quality. 3. First launch downloads the model from Hugging Face (requires internet).

Use Cases

  • Run local LLMs on personal laptops or edge devices without internet
  • Embed LLM inference into custom applications (desktop, mobile, server)
  • Batch text generation, translation, summarization with low cost

llama.cpp is a pure C/C++ implementation for running large language models (LLMs) on local devices. It requires no heavy frameworks (PyTorch, TensorFlow) and works out‑of‑the‑box on Apple Silicon, x86 (AVX/AVX2/AVX512), RISC‑V, NVIDIA (CUDA), AMD (HIP), and Intel/AMD GPUs (Vulkan, SYCL). Key innovation: ultra‑efficient integer quantization from 1.5‑bit to 8‑bit, drastically reducing memory usage while retaining acceptable accuracy. It supports dozens of architectures (LLaMA, Mistral, Qwen, Gemma, DeepSeek, etc.) and provides both a CLI (`llama-cli`) and an OpenAI‑compatible API server (`llama-server`). Compared to Ollama or LM Studio, llama.cpp is more stripped‑down – no background daemon, no rigid UI – making it perfect for developers who want to integrate LLM inference into their own applications or scripts.

Tags

llminferencec++quantizationggufapple-silicongpulocal-ai

Getting Started

1

Download installer

Click the button above to download the installer for your system

2

Install the software

Double-click the downloaded installer and follow the prompts

3

Download a prebuilt binary from GitHub Releases or install via brew/nix/winget

4

Obtain a GGUF model file (e.g., `ggml-org/gemma-3-1b-it-GGUF` from Hugging Face)

5

Run `llama-cli -m model.gguf` to chat, or `llama-server -m model.gguf` to start an OpenAI-compatible API

Install Guide
  1. Download a prebuilt binary from GitHub Releases or install via brew/nix/winget
  2. Obtain a GGUF model file (e.g., `ggml-org/gemma-3-1b-it-GGUF` from Hugging Face)
  3. Run `llama-cli -m model.gguf` to chat, or `llama-server -m model.gguf` to start an OpenAI-compatible API
File Integrity

SHA-256 checksum verified

Checksum extracted from GitHub official Release page

SHA256 Checksum

f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18

This checksum is extracted from the GitHub Release page. Verify file integrity after download.

All SHA-256 checksums on this platform are extracted from the project's official GitHub Release page, without any modification. You can independently verify them on the GitHub Releases page.

Open Source Transparency

View GitHub Source
Environment Guide

Uninstall Info

If installed via brew: `brew uninstall llama.cpp`. Via nix: `nix profile remove llama.cpp`. For manual install, delete the executable and `~/.cache/llama.cpp`.

No Extra Dependencies

Ready to use after download. No additional runtime required.

Project Info
LicenseMIT
Last Updated2026-06-26 07:00:33
GitHub RepositoryOfficial Website

Having issues? Check the FAQ below

4 FAQs

Similar Projects