LMCache
SHA-256A vendor-neutral KV cache management layer that accelerates LLM inference by making KV caches persistent, reusable, and observable across engines, reducing TTFT and improving throughput.
Smart Download
Download Download Version
v0.4.7 · 12.7 MB
A vendor-neutral KV cache layer that makes LLM caches persistent, reusable, and observable, speeding up inference across engines.
Core Features
- Engine-independent deployment: KV cache survives inference engine crashes, with no fate-sharing
- Tiered offloading and reuse: Offload KV cache to CPU, local disk, Redis, etc., and reuse across requests and sessions
- Non-prefix KV reuse: Reuse cached blocks at arbitrary positions via CacheBlend, beyond prefix caching limitations
- Production observability: Exposes Kubernetes metrics, token-level cache hits, request-level performance
- Pluggable backends: Supports CPU, SSD, Redis/Valkey, S3, Mooncake, GDS, and RDMA/TCP transport
What It Can't Do
- •Default backend uses CPU memory; ensure sufficient RAM for your workload
- •Non-prefix reuse (CacheBlend) may increase computational overhead; evaluate for your use case
- •Uninstallation does not delete cached data; manually remove storage directories if needed
Use Cases
- Multi-turn agentic workloads and long-context conversations, reducing repeated prefill computation
- Retrieval-Augmented Generation (RAG) systems, caching knowledge embeddings to lower TTFT
Detailed Introduction
LMCache is a KV cache management layer designed to supercharge LLM inference by transforming temporary KV caches into persistent, reusable AI-native knowledge. Unlike built-in KV cache in frameworks like vLLM or SGLang, LMCache operates engine-independently, enabling cache reuse across crashes, sessions, and multiple serving engines. It supports tiered offloading (CPU, SSD, Redis, etc.), non-prefix KV reuse via CacheBlend, and production-level observability (Kubernetes metrics, token-level hits). With pluggable storage backends and multi-node P2P sharing, it reduces time-to-first-token (TTFT) and boosts throughput for long-context agentic tasks, multi-turn conversations, and RAG workloads. Its vendor-neutral design allows easy switching between inference engines and storage vendors without losing cached data.
Tags
Getting Started
Install the software
Double-click the downloaded installer and follow the prompts
Install via pip: pip install lmcache
Configure a storage backend (e.g., CPU memory) and enable LMCache plugin in your serving engine (vLLM, SGLang)
Start inference service; LMCache manages KV cache transparently
- Install via pip: pip install lmcache
- Configure a storage backend (e.g., CPU memory) and enable LMCache plugin in your serving engine (vLLM, SGLang)
- Start inference service; LMCache manages KV cache transparently
SHA-256 checksum verified
Checksum extracted from GitHub official Release page
SHA256 Checksum
a8d251fa10e8e8e0df91eeef056d473929f38ac7ad8d771c6fbe656da228ca89This checksum is extracted from the GitHub Release page. Verify file integrity after download.
All SHA-256 checksums on this platform are extracted from the project's official GitHub Release page, without any modification. You can independently verify them on the GitHub Releases page.
Open Source Transparency
View GitHub SourceUninstall Info
Run pip uninstall lmcache. Manually remove any configuration files if present.
No Extra Dependencies
Ready to use after download. No additional runtime required.
Similar Projects
LocalAI
LocalAI is the open-source AI engine to run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required. Drop-in API compatibility with OpenAI, Anthropic, and ElevenLabs.
daily_stock_analysis
An open-source AI stock analysis system for A/H/US markets that generates daily decision dashboards and pushes them to WeChat Work, Feishu, Telegram, Discord, Slack, or email. Deploy via GitHub Actions for free.
ollama
Ollama lets you download, run, and manage large language models locally. One command, multiple platforms, endless possibilities.