LMCache

Name: LMCache
Author: LMCache

SHA-256

8.7k stars·AI Productivity·SHA-256 checksum verified

A vendor-neutral KV cache management layer that accelerates LLM inference by making KV caches persistent, reusable, and observable across engines, reducing TTFT and improving throughput.

Smart Download

Download Download Version

v0.4.7 · 12.7 MB

A vendor-neutral KV cache layer that makes LLM caches persistent, reusable, and observable, speeding up inference across engines.

Core Features

Engine-independent deployment: KV cache survives inference engine crashes, with no fate-sharing
Tiered offloading and reuse: Offload KV cache to CPU, local disk, Redis, etc., and reuse across requests and sessions
Non-prefix KV reuse: Reuse cached blocks at arbitrary positions via CacheBlend, beyond prefix caching limitations
Production observability: Exposes Kubernetes metrics, token-level cache hits, request-level performance
Pluggable backends: Supports CPU, SSD, Redis/Valkey, S3, Mooncake, GDS, and RDMA/TCP transport

What It Can't Do

•Default backend uses CPU memory; ensure sufficient RAM for your workload
•Non-prefix reuse (CacheBlend) may increase computational overhead; evaluate for your use case
•Uninstallation does not delete cached data; manually remove storage directories if needed

Use Cases

Multi-turn agentic workloads and long-context conversations, reducing repeated prefill computation
Retrieval-Augmented Generation (RAG) systems, caching knowledge embeddings to lower TTFT

Detailed Introduction

LMCache is a KV cache management layer designed to supercharge LLM inference by transforming temporary KV caches into persistent, reusable AI-native knowledge. Unlike built-in KV cache in frameworks like vLLM or SGLang, LMCache operates engine-independently, enabling cache reuse across crashes, sessions, and multiple serving engines. It supports tiered offloading (CPU, SSD, Redis, etc.), non-prefix KV reuse via CacheBlend, and production-level observability (Kubernetes metrics, token-level hits). With pluggable storage backends and multi-node P2P sharing, it reduces time-to-first-token (TTFT) and boosts throughput for long-context agentic tasks, multi-turn conversations, and RAG workloads. Its vendor-neutral design allows easy switching between inference engines and storage vendors without losing cached data.

Getting Started

Download installer

Click the button above to download the installer for your system

Windows· 12.7 MB

Install the software

Double-click the downloaded installer and follow the prompts

Install via pip: pip install lmcache

Configure a storage backend (e.g., CPU memory) and enable LMCache plugin in your serving engine (vLLM, SGLang)

Start inference service; LMCache manages KV cache transparently

Install Guide

Install via pip: pip install lmcache
Configure a storage backend (e.g., CPU memory) and enable LMCache plugin in your serving engine (vLLM, SGLang)
Start inference service; LMCache manages KV cache transparently

File Integrity

SHA-256 checksum verified

Checksum extracted from GitHub official Release page

SHA256 Checksum

a8d251fa10e8e8e0df91eeef056d473929f38ac7ad8d771c6fbe656da228ca89

This checksum is extracted from the GitHub Release page. Verify file integrity after download.

All SHA-256 checksums on this platform are extracted from the project's official GitHub Release page, without any modification. You can independently verify them on the GitHub Releases page.

Open Source Transparency

View GitHub Source

Environment Guide

Uninstall Info

Run pip uninstall lmcache. Manually remove any configuration files if present.

No Extra Dependencies

Ready to use after download. No additional runtime required.

Project Info

LicenseApache-2.0

Last Updated2026-06-13T06:25:29Z

GitHub Repository Official Website

Similar Projects

LocalAI

LocalAI is the open-source AI engine to run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required. Drop-in API compatibility with OpenAI, Anthropic, and ElevenLabs.

daily_stock_analysis

An open-source AI stock analysis system for A/H/US markets that generates daily decision dashboards and pushes them to WeChat Work, Feishu, Telegram, Discord, Slack, or email. Deploy via GitHub Actions for free.

ollama

Ollama lets you download, run, and manage large language models locally. One command, multiple platforms, endless possibilities.