𝐍𝐞𝐰 𝐢𝐧 𝐋𝐌𝐂𝐚𝐜𝐡𝐞: 𝐋𝟐 𝐚𝐝𝐚𝐩𝐭𝐞𝐫 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐂𝐋𝐈. You can now benchmark throughput of an L2 cache adapter directly without starting an inference engine or an LMCache MP server for all of its base operations (store / lookup / load). The command only requires the adapter’s backing storage to be reachable, making it easier to test and compare L2 backends before plugging them into a full serving workflow. Try it with the L2 backend that best fits your workflow, whether that’s local filesystem, Redis, S3, or any other adapter. Read more and start testing: https://lnkd.in/gqu_ydcQ #AI #inference #LMCache #KVCache
LMCache Lab
Software Development
Chicago, IL 3,984 followers
Open-Source for 10X better LLM inference w. vLLM Production Stack + LMCache
About us
Open-source large-scale LLM serving solutions to democratize LLM Inference.
- Website
-
https://github.com/LMCache/LMCache
External link for LMCache Lab
- Industry
- Software Development
- Company size
- 201-500 employees
- Headquarters
- Chicago, IL
- Type
- Nonprofit
Locations
-
Primary
Get directions
Chicago, IL, US
Employees at LMCache Lab
Updates
-
Quantization delays HBM pressure, but does not remove it. KV cache offloading still matters! With AMD ATOM + LMCache CPU offload, it showed 2.4× lower median TTFT and 59% more completed requests compared with HBM-only serving on real multi-turn agentic workloads. https://lnkd.in/gW8Te8CQ
When your inference baseline is already optimized, does KV cache offloading still help? We tested this with AMD ATOM — the high-performance vLLM plugin built on AITER kernels — combined with LMCache on MI300X. The workload: 739 real Claude Code agentic traces against MiniMax-M2.5 (456B MoE, FP8) on 2× MI300X. The short answer: yes. Even with ATOM's FP8 KV cache halving memory usage, HBM still fills up under real agentic workloads. Key findings: → ATOM + LMCache CPU: 2.4× lower median TTFT, 59% more completed requests vs HBM-only (32 users, 100K context) → Adding NVMe as L3: p95 TTFT drops another 41% — the long tail compresses dramatically → Backend matters: the default LMCache PyPI wheel uses a Python fallback on ROCm that made it 1.7× slower than baseline. Source-building with BUILD_WITH_HIP=1 is mandatory. This is a sequel to our previous LMCache benchmark (vanilla vLLM). The progression: stock vLLM → ATOM → ATOM + CPU offload → ATOM + CPU + NVMe shows that each tier unlocks capacity the previous one couldn't reach. Full writeup with architecture, configs, and charts: https://lnkd.in/gNZDiuFn #AMD #ROCm #vLLM #KVCache #LMCache #ATOM #AgenticAI
-
-
Congrats to Tensormesh for the funding! Tensormesh is among the major contributors to #LMCache. The investment from CoreWeave, NVIDIA and AMD (among others) testifies to the important role LMCache plays in AI infra today and tomorrow. BTW, Tensormesh is hiring engineers (full-time, part-time or spare-time) to work on LMCache! Shoot an email to hiring@tensormesh.ai if you are interested.
Today, we’re excited to announce that Tensormesh has raised $20M in new funding from investors including AMD Ventures, CoreWeave, NVentures (NVIDIA), Valley Capital Partners, and Laude Ventures, bringing our total funding to $24.5M. Alongside this milestone, Tensormesh Inference is now generally available. As AI applications move into production, inference costs are becoming harder to ignore. Agentic workflows repeatedly reprocess the same prompts, context, conversation history, and tool definitions, driving up API costs on work that has already been done. Tensormesh helps eliminate that waste with caching-accelerated inference. Built on the team’s work behind LMCache Lab, Tensormesh Inference helps AI application teams reuse computed KV cache state, reducing redundant computation, improving latency, and lowering API costs by up to 10x. We’re also introducing $0 cached input tokens across all Tensormesh serverless deployments, so teams only pay when input tokens need to be processed, not when they can be served from cache. We’re grateful to our investors, customers, advisors, and open-source community for supporting our mission to make AI inference faster, more efficient, and more transparent. Read the full announcement to learn how Tensormesh is redefining the economics of AI inference. Press Release: https://lnkd.in/gCzhDnVR
-
-
𝐂𝐚𝐥𝐥𝐢𝐧𝐠 𝐚𝐥𝐥 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 𝐮𝐬𝐞𝐫𝐬 — 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐌𝐏 𝐦𝐨𝐝𝐞 𝐧𝐨𝐰 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐛𝐞𝐲𝐨𝐧𝐝 𝐂𝐔𝐃𝐀! On non-CUDA devices, LMCache MP can now use ZMQ (instead of CUDA IPC) to send the KV bytes. LMCache MP mode uses CUDA IPC, but this is not available on non-CUDA devices. To remove that limitation, community contributor 𝐡𝐥𝐢𝐧𝟗𝟗 added a 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 transfer path for CPU, XPU, HPU, and other non-CUDA environments. Since these devices do not support CUDA IPC, the worker sends the actual KV bytes over the message queue instead: 𝑔𝑎𝑡ℎ𝑒𝑟 𝑝𝑎𝑔𝑒𝑑 𝐾𝑉 -> 𝐶𝑃𝑈 𝑐ℎ𝑢𝑛𝑘𝑠 -> 𝑠𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑤𝑖𝑡ℎ 𝑝𝑖𝑐𝑘𝑙𝑒 -> 𝑠𝑒𝑛𝑑 𝑏𝑦𝑡𝑒𝑠 𝑜𝑣𝑒𝑟 𝑍𝑀𝑄 -> 𝑑𝑒𝑠𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑒𝑟𝑣𝑒𝑟 -> 𝑤𝑟𝑖𝑡𝑒 𝑡𝑜 𝐿1 On CUDA devices, LMCache continues to use the existing CUDA IPC path, where the worker sends a lightweight handle and the server reads the worker’s GPU memory directly: 𝑤𝑜𝑟𝑘𝑒𝑟 𝑝𝑎𝑔𝑒𝑑 𝐾𝑉 (𝐺𝑃𝑈) -> 𝐿𝑀𝐶𝑎𝑐ℎ𝑒 𝑟𝑒𝑎𝑑𝑠 𝑣𝑖𝑎 𝐶𝑈𝐷𝐴 𝐼𝑃𝐶 -> 𝐺𝑃𝑈 𝑠𝑡𝑎𝑔𝑖𝑛𝑔 𝑏𝑢𝑓𝑓𝑒𝑟 -> 𝐿1 𝑐𝑎𝑐ℎ𝑒 (𝐶𝑃𝑈 𝑅𝐴𝑀) In both paths, ZMQ serves as the control channel and carries messages such as REGISTER, PREPARE_STORE, and COMMIT_STORE. Compared with the CUDA path, the non-CUDA path adds two CPU-side copies, but 𝐞𝐱𝐭𝐞𝐧𝐝𝐬 𝐌𝐏 𝐦𝐨𝐝𝐞 𝐭𝐨 𝐧𝐨𝐧-𝐂𝐔𝐃𝐀 𝐞𝐧𝐯𝐢𝐫𝐨𝐧𝐦𝐞𝐧𝐭𝐬. #AI #inference #LMCache #KVCache
-
-
New blog: 𝐖𝐡𝐞𝐧 𝐎𝐩𝐞𝐧 𝐒𝐨𝐮𝐫𝐜𝐞 𝐌𝐞𝐞𝐭𝐬 𝐎𝐩𝐞𝐧 𝐒𝐨𝐮𝐫𝐜𝐞 — 𝐀 𝐉𝐨𝐢𝐧𝐭 𝐄𝐟𝐟𝐨𝐫𝐭 𝐁𝐞𝐭𝐰𝐞𝐞𝐧 𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐚𝐧𝐝 𝐌𝐨𝐨𝐧𝐜𝐚𝐤𝐞 The story starts with the LMCache community building the foundation: the native connector framework, dynamic plugin loading, and the MooncakeStore L2 plugin path for MP mode. The Mooncake community then helped optimize the RDMA path step by step, adding L1 memory preregistration, batch operations, and dedicated worker lanes for different cache operations. Under Mooncake RDMA, 𝐭𝐡𝐢𝐬 𝐰𝐨𝐫𝐤𝐞𝐫-𝐥𝐚𝐧𝐞 𝐝𝐞𝐬𝐢𝐠𝐧 𝐫𝐞𝐝𝐮𝐜𝐞𝐝 𝐥𝐨𝐨𝐤𝐮𝐩 𝐩𝟗𝟗 𝐟𝐫𝐨𝐦 𝟏𝟔.𝟖 𝐦𝐬 𝐭𝐨 𝟎.𝟒𝟖 𝐦𝐬! This was not a one-sided integration. LMCache brought the MP framework and native connector abstraction and Mooncake brought deep storage and RDMA expertise. Together, the two communities built a stronger L2 KV cache integration for distributed LLM inference systems. Huge thanks to maobaolong, fangchizheng, chunxiaozheng, and everyone in both communities who helped make this happen! Read the full story in English and Chinese: https://lnkd.in/gXuXmND3 https://lnkd.in/gSn_3KqH #AI #inference #LMCache #KVCache
-
PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. In Prefill-Decode Disaggregation, a single LLM request is split across two types of nodes. A prefill node reads the prompt and produces the KV cache, while a decode node consumes that KV cache to generate tokens. The KV cache needs to move from the prefill node to the decode node over the network, typically through RDMA. In LMCache, the component responsible for moving these KV chunks is called the PDBackend. Before the asynchronous PDBackend, LMCache’s prefill workers sent KV cache chunks one at a time and waited for each transfer to finish before continuing. This worked for simple cases, but under chunked prefill, where a long prompt is split into multiple KV transfers, concurrent requests could deadlock. The new fully asynchronous PDBackend moves KV transfer off the critical path. Instead of blocking on each network transfer, the prefill worker can hand off KV chunks in the background and continue processing the next prompt. On the receiver side, LMCache also reserves enough buffer space for the whole request before the transfer starts, so each admitted request has enough room to finish. This update is a great community effort from LMCache. As Prefill-Decode Disaggregation becomes more widely used, improvements like async PDBackend are essential for making KV cache transfer more reliable and scalable. Thank you to everyone in the LMCache community who helped shape, review, and harden this update! #AI #inference #LMCache #KVCache
-
-
𝐓𝐡𝐞 𝐎𝐩𝐞𝐧𝐀𝐈-𝐜𝐨𝐦𝐩𝐚𝐭𝐢𝐛𝐥𝐞 𝐀𝐏𝐈 𝐢𝐬 𝐛𝐞𝐜𝐨𝐦𝐢𝐧𝐠 𝐭𝐡𝐞 𝐈𝐏𝐯𝟒 𝐨𝐟 𝐋𝐋𝐌 𝐬𝐲𝐬𝐭𝐞𝐦𝐬. ⏳ At the top: agents, RAG, chatbots, tools, and workflows. In the middle: the OpenAI-compatible API. Below: routing, batching, scheduling, KV cache, inference engines, and hardware. It's the familiar network-layering pattern, where IPv4 acts as the narrow waist between everything above and below it. This shared interface lets applications run across providers while inference backends optimize underneath. But the narrow waist also creates a blind spot. Once modern LLM applications cross the API boundary, much of their structure becomes just a sequence of tokens. That's why the next wave of LLM infrastructure may depend on what happens below the API: better scheduling, better cache reuse, and better AI-native memory systems. 📖 Read more in Junchen Jiang's new blog, "𝐎𝐩𝐞𝐧𝐀𝐈 𝐀𝐏𝐈 𝐈𝐬 𝐭𝐡𝐞 𝐍𝐞𝐰 𝐈𝐏𝐯𝟒": https://lnkd.in/gxB_qpnU #AI #Inference #LMCache #KVCache #network
-
𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐯𝟎.𝟒.𝟓 𝐢𝐬 𝐨𝐮𝐭! 🎉 This release was a massive community effort. A huge shoutout to our 34 contributors who shipped 119 commits! Whether you submitted a PR, reported an issue, or joined the discussions—thank you for making this possible! 🎉 𝐖𝐡𝐚𝐭'𝐬 𝐍𝐞𝐰: 🔹 DeepSeek V4 support 🔹 TensorRT-LLM integration 🔹 Hardware support for AMD ROCm & Intel HPU 🔹 SERDE support for pluggable KV cache transformations ⚠️ 𝐇𝐞𝐚𝐝𝐬-𝐮𝐩: Our default CUDA wheel is now cu13 (cu12.9 is still available). 📖 𝐑𝐞𝐚𝐝 𝐭𝐡𝐞 𝐟𝐮𝐥𝐥 𝐫𝐞𝐥𝐞𝐚𝐬𝐞 𝐧𝐨𝐭𝐞𝐬: https://lnkd.in/gHVqZuRt Thank you for being such an incredible community. We can't wait to see what we build together next! #AI #LLM #Inference #LMCache
-
-
𝐋𝐌𝐂𝐚𝐜𝐡𝐞 𝐧𝐨𝐰 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬 𝐓𝐞𝐧𝐬𝐨𝐫𝐑𝐓-𝐋𝐋𝐌, alongside vLLM and SGLang! 🎉 With this integration, TensorRT-LLM can use LMCache for KV cache lookup, retrieve, and store during the request lifecycle. In our recommended 𝐦𝐮𝐥𝐭𝐢𝐩𝐫𝐨𝐜𝐞𝐬𝐬 (𝐌𝐏) 𝐦𝐨𝐝𝐞, the engines talk to a standalone LMCache server, enabling shared KV cache management across multiple TRT-LLM workers on the same node. The main engineering difference is TensorRT-LLM’s KV memory layout. Unlike vLLM and SGLang, which commonly expose KV cache in a layer-oriented layout, TensorRT-LLM packs multiple layers within shared KV cache blocks for efficient GPU access. LMCache now understands this packed layout and can efficiently read and write TensorRT-LLM KV cache. This brings LMCache’s KV reuse and multi-tier cache capabilities to TensorRT-LLM, connecting it to LMCache’s broad ecosystem. Start here: https://lnkd.in/gTM5i5yV Explore validated recipes for models deployed with TensorRT-LLM and help us expand coverage — a great starting PR for new contributors: https://lnkd.in/geXP5zzB #AI #LLM #Inference #LMCache #NVIDIA
-
-
LMCache now supports DeepSeek V4 in multi-process(MP) mode! Huge thanks to the community and everyone who helped make this happen. 🎉 So what makes DeepSeek V4 different? Most models LMCache supports today use a uniform KV cache layout across layers. For example, Llama, Qwen, and GPT-OSS typically use the same block_size across the model. DeepSeek V4 is different. It uses multiple KV cache formats in the same model(illustrative example: actual numbers may differ): Layers 1–50: block_size = 16 — regular attention, uncompressed Layers 51–58: block_size = 4 — compressor caches Layers 59–60: block_size = 1 — indexer caches You can find more details on this in our technical deep dive blog: https://lnkd.in/gKfrhNc7 On the LMCache side, MP mode now tracks the KV layout per layer group instead of assuming one packed format across all layers. This lets LMCache read and transfer KV cache from the correct memory locations, while models with uniform layouts continue to use the original fast path. If you are running DSV4 on vLLM and want to use LMCache for KV reuse or offloading, the MP path is ready to try! #AI #LLM #Inference #LMCache #DeepSeek
-