vLLM’s cover photo
vLLM

vLLM

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

About us

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

Website
https://github.com/vllm-project/vllm
Industry
Software Development
Company size
51-200 employees
Type
Nonprofit

Employees at vLLM

Updates

  • View organization page for vLLM

    7,301 followers

    Love this: a community contributor built vLLM Playground to make inferencing visible, interactive, and experiment-friendly. From visual config toggles to automatic command generation, from GPU/M-chip support to GuideLLM benchmarking + LLMCompressor integration — it brings the whole vLLM lifecycle into one unified UX. Huge kudos to micyang for this thoughtful, polished contribution. 🔗 https://lnkd.in/eMSCp_pW

  • View organization page for vLLM

    7,301 followers

    Running multi-node vLLM on Ray can be complicated: different roles, env vars, and SSH glue to keep things together. The new `ray symmetric-run` command lets you run the same entrypoint on every node while Ray handles cluster startup, coordination, and teardown for you. Deep dive + examples: https://lnkd.in/g4sBV_ai

    View profile for Richard Liaw

    Anyscale

    Ray and vLLM have worked closely together to improve the large model interactive development experience! Spinning up multi-node vLLM with Ray on interactive environments can be tedious, requiring users to juggle separate commands for different nodes, breaking the “single symmetric entrypoint” mental model that many users expect. Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster. This makes it really easy to spawn vLLM servers with multi-node models on HPC setups or in using parallel ssh tools like mpssh. Check out the blog: https://lnkd.in/gniPWzge Thanks for Kaichao You for the collaboration!

    • No alternative text description for this image
  • View organization page for vLLM

    7,301 followers

    🇲🇾 Malaysia vLLM Day is 5 days away! vLLM Malaysia Day — 2 Dec 2025 📍 ILHAM Tower, Kuala Lumpur We are bringing the vLLM and @LMCache community together with Embedded LLM, AMD, Red Hat, and WEKA to advance open, production-grade AI across ASEAN. The Lineup: - The State of vLLM & LMCache: Insights straight from vLLM Maintainer Tun Jian Tan (Embedded LLM). - Hardware Optimization: High-Performance Serving on ROCm with Seung Rok Jung (AI/Compute Architect, AMD). - Production Stories: vLLM Semantic Router with Cheng Bin Tham (Red Hat) and Breaking the Memory Wall with Ronald Pereira (WEKA). - Deep Tech Panel: Leaders from Amazon Web Services (AWS), Malaysia Digital Economy Corporation (MDEC), 500 Global, and Foong Chee Mun (CEO, YTL AI Labs). - National AI: A look at MaLLaM and Malaysia's LLM ecosystem with Khalil Nooh (CEO, Mesolitica). Whether you are optimizing inference or building sovereign models, this is the place to be. 🎟 Secure your spot: https://lnkd.in/gwFgvMyH

  • View organization page for vLLM

    7,301 followers

    FP8 RL on consumer GPUs just got a boost 🔥 Thrilled to team up with Daniel Han from Unsloth AI and TorchAO to bring FP8 GRPO to vLLM: ~1.4× faster RL inference, 60% less VRAM, 12× longer context, and Qwen3-1.7B fitting in 5GB VRAM.

    View profile for Daniel Han

    Co-founder @ Unsloth AI

    You can now run FP8 reinforcement learning on consumer GPUs! ⚡ DeepSeek-R1 demonstrated the power of FP8 GRPO. Now you can try it at home on just a 5GB GPU with Unsloth AI. • Qwen3-14B FP8 GRPO works on 24GB VRAM. Qwen3-1.7B works on 5GB. • We collabed with PyTorch TorchAO to make Unsloth FP8 RL inference via vLLM ~1.4× faster than FP16 • Unsloth uses 60% less VRAM and enables 12× longer context vs. other implementations • Works on any NVIDIA GeForce RTX 40, 50 series and H100, B200 etc. GPUs ⭐ Blog: https://lnkd.in/gC7-fpx8 Qwen3-8B FP8 GRPO Colab notebook: https://lnkd.in/gn7rpUp6

    • No alternative text description for this image
  • View organization page for vLLM

    7,301 followers

    We’re seeing a noticeable rise in compact, high-quality OCR models across the open-source ecosystem — a promising direction for real-world document understanding, edge deployment, and multimodal AI pipelines. Tencent’s HunyuanOCR is a strong example of this trend: a 1B-parameter end-to-end OCR model delivering SOTA results on OCRBench and OmniDocBench while covering a wide range of practical tasks — from text spotting (street view, handwriting, art text) to tables, formulas, video subtitles, and multilingual photo translation. To support this growing class of lightweight vision models, vLLM now offers Day-0 support for HunyuanOCR, enabling developers to run it efficiently out-of-the-box with familiar vLLM APIs. Documentation: https://lnkd.in/eizE_g6S Github: https://lnkd.in/eCFriuam

  • View organization page for vLLM

    7,301 followers

    🚀 vLLM Talent Pool is Open! As LLM adoption accelerates, vLLM has become the mainstream inference engine used across major cloud providers (AWS, Google Cloud, Azure, Alibaba Cloud, ByteDance, Tencent, Baidu…) and leading model labs (DeepSeek, Moonshot, Qwen…). To meet the strong demand from top companies, the vLLM community is now collecting resumes year-round and helping with referrals (internships & full-time). If you have experience in any of the following areas, we’d love to hear from you: • RL frameworks & algorithms for LLMs • Tool calling, MCP, Harmony format, OpenAI/Anthropic API • Structured output / constraint decoding • High-performance kernels: attention, GEMM, sampling, sorting • CUTLASS / CUTE DSL / TileLang • Distributed systems: Ray, multiprocessing • vLLM + Kubernetes • Tensor / expert / context parallelism • NCCL, DeepEP, NVSHMEM, RDMA, NVLink • Prefill/Decode separation, KV-cache transport • Speculative decoding (Eagle, MTP, …) • MoE optimization • KV-cache memory management (hybrid models, prefix caching) • Multimodal inference (audio/image/video/text) • LoRA • Rust / Go / C++ / Python serving stacks • Attention mechanisms (MLA, MQA, SWA, linear attention) • Position encodings (RoPE, mRoPE) • Model architectures (DeepSeek, Qwen, etc.) • Embedding model support • torch.compile integration …or any other LLM inference engineering experience. Bonus points if you have: • Implemented core features in vLLM • Contributed to vLLM integrations (verl, OpenRLHF, Unsloth, LlamaFactory…) • Written widely-shared technical blogs on vLLM 💰 Compensation: Highly competitive, with no upper limit for exceptional inference engineers. 📍 Locations: Major cities in the US (SF Bay Area, etc.) Major cities in China (Beijing / Shanghai / Shenzhen / Guangzhou / Chengdu…) 📨 Apply: Send your resume to talentpool@vllm.ai (Sending your resume means you agree to share it with partner companies.) 🌱 Join the vLLM community: Slack: Apply at http://slack.vllm.ai Chinese community (WeChat): Add vllm_project with your name & affiliation Let’s build easy, fast, and cheap LLM serving for everyone — together! ⚡

  • View organization page for vLLM

    7,301 followers

    🚀 Red Hat AI has just open-sourced a full suite of high-quality speculator models, and the integration with vLLM makes this a meaningful step forward for speculative decoding across the ecosystem. With Speculators + vLLM, developers now get a clean, standardized path from draft models all the way to real production workloads—no custom glue code, no retraining of full models, and measurable speedups in real applications. What Red Hat AI released: • Speculator models for Llama, Qwen, and gpt-oss • 1.5×–2.5× speedups in typical workloads, with peaks above 4× • A unified interface that standardizes algorithms, draft models, and configuration • Full, reproducible training workflows • Seamless deployment through vllm serve <model_stub> • GuideLLM benchmarking tools for realistic latency / throughput evaluation • A roadmap focused on stronger verifiers and scalable speculator training This collaboration helps move speculative decoding from research prototype to a first-class production technique for accelerating inference. Excited to see what the community builds on top of this. The future of open-source inference keeps getting faster and more accessible. Speculators repo link: https://lnkd.in/euPNU2pM Blog: https://lnkd.in/e5uZc6SB

  • View organization page for vLLM

    7,301 followers

    🎉 vLLM v0.11.2 is out! This release focuses on things the community cares about most — smoother scaling, more predictable performance, and wider model support. 1456 commits from 449 contributors (184 new!) made this possible. 💛 Here are a few improvements you'll feel in real workloads 👇 1️⃣ More predictable performance Batch-invariant torch.compile + a sturdier async scheduler → steadier latency under mixed and bursty workloads. 2️⃣ Easier distributed setups A stronger scheduler + KV ecosystem → prefix cache, connectors, and multi-node flows become more reliable. 3️⃣ Free speedups on newer GPUs DeepGEMM & FlashInfer improvements → better throughput on Hopper & Blackwell with zero code changes. 4️⃣ Simpler client integrations Anthropic-style /v1/messages → more clients and tools “just work” with vllm serve. 5️⃣ Wider model support, fewer edge cases Fixes across MoE, multimodal, quantization, CPU/ROCm, and transformers backends → more models behave consistently out of the box. There’s a lot more under the hood — see the full release notes for all fixes, models, and performance updates: https://lnkd.in/e8NrF8yW Thank you to everyone who filed issues, reviewed PRs, ran benchmarks, and helped shape this release. vLLM grows because the community does. Easy, fast, and cheap LLM serving for everyone. 🩵

  • View organization page for vLLM

    7,301 followers

    Need to customize vLLM? Don't fork it. 🔌 vLLM's plugin system lets you inject surgical modifications without maintaining a fork or monkey-patching entire modules. Blog by Dhruvil Bhatt from AWS SageMaker 👇 Why plugins > forks: • vLLM releases every 2 weeks with 100s of PRs merged • Forks require constant rebasing & conflict resolution • Monkey patches break on every vLLM upgrade How it works: • Use VLLMPatch[TargetClass] for precise, class-level mods • Register via vllm.general_plugins entry point   • Control patches with env vars (VLLM_CUSTOM_PATCHES) • Version-guard with @min_vllm_version decorator Example: Add priority scheduling to vLLM's scheduler in ~20 lines. One Docker image serves multiple models with different patches enabled via environment variables. The plugin loads in ALL vLLM processes (main, workers, GPU/CPU) before any inference starts—ensuring consistent behavior across distributed setups. Read the full implementation guide with code examples: https://lnkd.in/e4U_xeFa

  • View organization page for vLLM

    7,301 followers

    🚀 Docker Model Runner now integrates vLLM! High-throughput LLM inference is now available with the same Docker workflow devs already use. ▸ Native safetensors support ▸ Automatic routing between llama.cpp (GGUF) and vLLM (safetensors) ▸ Works from laptops → clusters, with one CLI Bringing easy, fast, and affordable LLM serving to the Docker ecosystem. #vLLM #Docker #AIInfra #ModelRunner #OpenSource #AI

    If there's one refrain from me, it's "we've got more coming..." From yesterday's share on Docker Model Runner, here's a big power up we've launched today -> Docker is the only technology that let's you easily switch local inference providers without changing your workflow (or logic). It's actually magically easy to now use llama.cpp on your laptop, and vLLM on "big" platforms (e.g., NVIDIA-based systems), based on what model you want to load. And... we've got more coming... https://lnkd.in/gkwHHCPg

Similar pages