OpenSource-Hub
L

LMCache

SHA-256
8.7k stars·AI Productivity·SHA-256 checksum verified

A vendor-neutral KV cache management layer that accelerates LLM inference by making KV caches persistent, reusable, and observable across engines, reducing TTFT and improving throughput.

A vendor-neutral KV cache layer that makes LLM caches persistent, reusable, and observable, speeding up inference across engines.

Core Features

  • Engine-independent deployment: KV cache survives inference engine crashes, with no fate-sharing
  • Tiered offloading and reuse: Offload KV cache to CPU, local disk, Redis, etc., and reuse across requests and sessions
  • Non-prefix KV reuse: Reuse cached blocks at arbitrary positions via CacheBlend, beyond prefix caching limitations
  • Production observability: Exposes Kubernetes metrics, token-level cache hits, request-level performance
  • Pluggable backends: Supports CPU, SSD, Redis/Valkey, S3, Mooncake, GDS, and RDMA/TCP transport

What It Can't Do

  • Default backend uses CPU memory; ensure sufficient RAM for your workload
  • Non-prefix reuse (CacheBlend) may increase computational overhead; evaluate for your use case
  • Uninstallation does not delete cached data; manually remove storage directories if needed

Use Cases

  • Multi-turn agentic workloads and long-context conversations, reducing repeated prefill computation
  • Retrieval-Augmented Generation (RAG) systems, caching knowledge embeddings to lower TTFT

Detailed Introduction

LMCache is a KV cache management layer designed to supercharge LLM inference by transforming temporary KV caches into persistent, reusable AI-native knowledge. Unlike built-in KV cache in frameworks like vLLM or SGLang, LMCache operates engine-independently, enabling cache reuse across crashes, sessions, and multiple serving engines. It supports tiered offloading (CPU, SSD, Redis, etc.), non-prefix KV reuse via CacheBlend, and production-level observability (Kubernetes metrics, token-level hits). With pluggable storage backends and multi-node P2P sharing, it reduces time-to-first-token (TTFT) and boosts throughput for long-context agentic tasks, multi-turn conversations, and RAG workloads. Its vendor-neutral design allows easy switching between inference engines and storage vendors without losing cached data.

Tags

LLMKV Cache推理加速缓存管理AI Infrastructure

Getting Started

1

Download installer

Click the button above to download the installer for your system

2

Install the software

Double-click the downloaded installer and follow the prompts

3

Install via pip: pip install lmcache

4

Configure a storage backend (e.g., CPU memory) and enable LMCache plugin in your serving engine (vLLM, SGLang)

5

Start inference service; LMCache manages KV cache transparently

Install Guide
  1. Install via pip: pip install lmcache
  2. Configure a storage backend (e.g., CPU memory) and enable LMCache plugin in your serving engine (vLLM, SGLang)
  3. Start inference service; LMCache manages KV cache transparently
File Integrity

SHA-256 checksum verified

Checksum extracted from GitHub official Release page

SHA256 Checksum

a8d251fa10e8e8e0df91eeef056d473929f38ac7ad8d771c6fbe656da228ca89

This checksum is extracted from the GitHub Release page. Verify file integrity after download.

All SHA-256 checksums on this platform are extracted from the project's official GitHub Release page, without any modification. You can independently verify them on the GitHub Releases page.

Open Source Transparency

View GitHub Source
Environment Guide

Uninstall Info

Run pip uninstall lmcache. Manually remove any configuration files if present.

No Extra Dependencies

Ready to use after download. No additional runtime required.

Project Info
LicenseApache-2.0
Last Updated2026-06-13T06:25:29Z
GitHub RepositoryOfficial Website

Similar Projects