Releases: FastFlowLM/FastFlowLM
🦃 FastFlowLM v0.9.21 — Thanksgiving LLaMA Turbocharge Release
Happy Thanksgiving!
Today we’re dropping one of our biggest speed upgrades ever for LLaMA and DeepSeek models (our first batch of models) — just in time for the holiday break. Fire up your Ryzen™ AI NPU and enjoy some seriously boosted performance. 🔥
🔄 1. Quantization Upgrade
- All models migrated from AWQ to Q4_1
- Better LLM accuracy and quality.
⚡ 2. Massive Decoding Speedup
llama3.2:1b: ~50% faster decoding, reaching 66 tpsllama3.2:3b: ~40% faster decoding, reaching 28 tpsllama3.1:8b: ~40% faster decoding, reaching 13 tpsdeepseek-r1:8b: ~40% faster decoding, reaching 13 tps
🚀 3. Prefill Phase Optimized
- Slight improvements to prefill speed of all above, especially impactful for large context initializations.
🎙️ 4. Standalone Whisper ASR Server
You can now serve Whisper (OpenAI’s ASR model) as a standalone model for speech transcription — or pair it with GPU LLMs in a hybrid pipeline.
Use either:
flm serve -a 1or
flm serve --asr 1This release wraps up a bundle of performance gifts for LLaMA models on FastFlowLM.
Thank you for being part of the FastFlowLM journey — and happy Thanksgiving! 🦃🔥
🚀 FastFlowLM v0.9.20 — Massive Decoding Speed Boosts for GPT-OSS & Gemma3
FastFlowLM v0.9.20 introduces substantial performance improvements across multiple model families, with special focus on decoding efficiency.
⚡ Performance Improvements
🔸 1. GPT-OSS Models
- Decoding speed of
gpt-oss:20bandgpt-oss-safeguard:20bare reaching ~19 tokens/sec and are over 60% faster at 1K context length.
🔸 2. Gemma3 Models
gemma3:4breaching ~19 tokens/sec and enjoys over ~20% decoding speed boost at 1K context length.gemma3:1b(reaching ~43 tokens/sec)gemma3:270m(reaching ~79 tokens/sec; Note that this model is experimental)
This release is focused on raw speed—making FastFlowLM even more efficient for both high-capacity and portable deployments.
🐛 Fixes Qwen3-VL-Instruct Bug
Qwen3VL-IT embedding bug resolved
- Fixed an issue where the Qwen3VL-IT weight was not correctly uploaded to hugging face.
🚀 FastFlowLM v0.9.18 — Major Prefill Boost for GPT-OSS & Full LFM2 Integration
✨ What’s New
🔥 1. Significant Prefill Acceleration for GPT-OSS Models
FastFlowLM’s execution pipeline has undergone deep optimization, resulting in significant improvements to prefill performance — including faster time to first token (TTFT) and more efficient long-context ingestion.
Currently, these enhancements apply to:
- gpt-oss:20b
- gpt-oss-safeguard:20b
⬇️ A model redownload is required.
💧 2. Full Support for LFM2-1.2B (Liquid Foundation Model)
FastFlowLM now supports the LFM2-1.2B model (LiquidAI): first hybrid LLM architecture on AMD AMD Ryzen AI NPU, powered by very efficient FLM's linear attention engine.
Try it:
- CLI mode
flm run lfm2:1.2b- Serve mode
flm serve lfm2:1.2bThis release delivers a major prefill speed boost for GPT-OSS models and introduces full support for the LFM2-1.2B Liquid Foundation Model.
🚀 FastFlowLM v0.9.17 — Faster Vision Understanding for Qwen3VL, Improved Memory Management for GPT-OSS, and New OpenAI Safeguard Model
FastFlowLM v0.9.17 brings performance boosts to vision models, improves compatibility for systems with limited RAM, and adds a safer GPT-OSS variant.
✨ What’s New
⚡ 1. Faster Vision Understanding for Qwen3VL
- Speedup for image embedding with Qwen3VL models.
- Require redownloading the model
🧠 2. GPT-OSS Optimization for Low-RAM Machines
- Improved memory management for
gpt-oss:20b. Increase the chance to run on 32GB system (note: NPU can access <50% of total RAM). - Speedup for decoding at long context lengths.
- Driver version higher than
32.0.203.304is required - Require redownloading the model
🛡️ 3. New Model: gpt-oss-safeguard:20b
- A safety-enhanced, instruction (prompt)-tuned variant of
gpt-oss:20b. - Optimized for assistant-style tasks with safer responses.
- Run with:
flm run gpt-oss-sg:20b
⚠️ 4. API Focus: OpenAI-Compatible
- Discontinuing Ollama API support.
- Focusing on the OpenAI-compatible API.
This release enhances speed and system flexibility, while expanding FastFlowLM’s model lineup with secure, large-scale assistants.
🚀 FastFlowLM v0.9.16 — Qwen3-VL arrives!
✨ What’s included
🖼️ New vision model: Qwen3-VL-4B-Instruct
- Fully offline on AMD Ryzen™ AI NPU
- Optimized for lightweight, fast, vision-capable edge inference
Try it in two ways 👇
💻 CLI
Run model
flm run qwen3vl-it:4bThen type
/input <image_path> <prompt>🌐 In serve mode
Start flm server first:
flm serve qwen3vl-it:4band then interact with Open WebUI or any openai compatible client!
🌟 Summary
FastFlowLM v0.9.16 adds Qwen3-VL-4B-Instruct with full offline vision on AMD Ryzen AI — faster, smaller, and ready for edge.
🚀 FastFlowLM v0.9.15: New Embedding Capabilities & API Integration
🔎 1. New Model: EmbeddingGemma-300m
The first Embedding model on FLM:
- Runs fully offline on AMD Ryzen™ AI NPU
- Supports chunk sizes up to 2048 tokens
Try it out:
- Start flm in server mode with embedding model enabled:
flm serve gemma3:4b --embed 1 # Load embedding model in background, with concurrent LLM loading (gemma3:4b).
⚠️ Note: Embedding model is not allowed in CLI mode.
🌐 2. OpenAI-Compatible Embedding API: v1/embeddings
FastFlowLM now supports the OpenAI v1/embeddings endpoint making it easy to integrate embedding model into any OpenAI-compatible client or UI.
How to use:
- Start FLM server with embedding model enabled:
# serve
flm serve gemma3:4b --embed 1 # Load embedding model in background, with concurrent LLM loading (gemma3:4b).- Send file(s) to:
POST /v1/embeddings
via any OpenAI client or Open WebUI.
Examples: Test in OpenAI client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:52625/v1", # FastFlowLM's local API endpoint
api_key="flm", # Dummy key (FastFlowLM doesn't require authentication)
)
resp = client.embeddings.create(
model="embed-gemma",
input="Hi, everyone!"
)
print(resp.data[0].embedding)Example: Open WebUI
- Follow Open WebUI setup guide.
- In the bottom-left corner, click
Usericon, then selectSettings. - In the bottom panel, open
Admin Settings. - In the left sidebar, navigate to Documents.
- Set Embedding Model Engine to OpenAI.
- Enter:
-- API Base URL:http://host.docker.internal:52625/v1
-- API KEY:flm(any value works)
-- Embedding Model:embed-gemma:300m - Save the setting.
- Follow the RAG + FastFlowLM example to launch your Local Private Database with RAG all powered by FLM.
🙏 Acknowledgement
Special thanks to julienM77 for contributing the message normalizer that improves handling of corrupted user messages.
🌟 Summary
FastFlowLM v0.9.15 introduces offline embedding with embedding gemma and support for OpenAI's v1/embeddings API. Just start the server with --embed 1 and you're ready to build local, private, and intelligent applications.
🚀 FastFlowLM v0.9.14: New ASR Capabilities & API Integration
✨ What’s New
🎙️ 1. New Model: whisper-large-v3-turbo (by OpenAI)
The first Automatic Speech Recognition (ASR) model on FLM:
- Runs fully offline on AMD Ryzen™ AI NPU
- Multilingual audio recognition
- Supports MP3, WAV, OGG and M4A formats
- Lightweight footprint — only 900MB memory
Try it out:
- Start flm in CLI mode with ASR enabled:
# CLI
flm run gemma3:4b --asr 1 # Load the ASR model (whisper-v3:turbo) in the background, with concurrent LLM loading (gemma3:4b).- Type (replace
filename.mp3with your audio file path):
/input "path\to\audio_sample.mp3" summarize it
🌐 2. OpenAI-Compatible ASR API: v1/audio/transcriptions API
FastFlowLM now supports the OpenAI v1/audio/transcriptions endpoint — making it easy to integrate ASR into any OpenAI-compatible client or UI.
How to use:
- Start your FLM server with ASR enabled:
# serve
flm serve gemma3:4b --asr 1 # # Load the ASR model (whisper-v3:turbo) in the background, with concurrent LLM loading (gemma3:4b).- Send audio to:
POST /v1/audio/transcriptions
via any OpenAI client or Open WebUI.
Examples: OpenAI client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:52625/v1", # FastFlowLM's local API endpoint
api_key="flm", # Dummy key (FastFlowLM doesn’t require authentication)
)
with open("audio.mp3", "rb") as f:
resp = client.audio.transcriptions.create(
model="whisper-v3",
file=f,
)
print(resp.text)Example: Open WebUI
1. Follow Open WebUI setup guide.
2. In the bottom-left corner, click User icon, then select Settings.
3. In the bottom panel, open Admin Settings.
4. In the left sidebar, navigate to Audio.
5. Set Speech-to-Text Engine to OpenAI.
6. Enter:
-- API Base URL: http://host.docker.internal:52625/v1
-- API KEY: flm (any value works)
-- STT Model: whisper-large-v3-turbo (type in the model name; can be different)
7. Save the setting.
8. You're ready to upload audio files! (Choose an LLM to load and use concurrently)
🌟 Summary
FastFlowLM v0.9.14 introduces offline ASR with whisper-large-v3-turbo and support for OpenAI’s v1/audio/transcriptions API, making speech-to-text integration seamless across clients and WebUI. Just start the server with --asr 1 and you're ready to transcribe.
⚡ FastFlowLM v0.9.13 — GPT-OSS Speed Boost, Harmony Chat Template, and Ollama `/api/show` Support
FastFlowLM v0.9.13 delivers performance improvements, OpenAI Harmony support, and greater compatibility with Ollama-style endpoints.
🚀 Highlights
🔹 1. GPT-OSS:20b Prefill Speed Boost
- Up to 20% faster prefill for
gpt-oss:20b - Reduces latency for long prompts and improves responsiveness
- v0.9.13 auto-downloads the updated model from HuggingFace.
🔹 2. OpenAI Harmony Integration
- Added support for OpenAI’s Harmony Chat Template
- Enables richer multi-turn dialogue and structured generation capabilities
🔹 3. /api/show Endpoint (Ollama Compatibility)
- FastFlowLM now supports
/api/show - Compatible with tools that rely on Ollama API conventions
🔹 4. /v1/completions (legacy OpenAI API) is now supported
- OpenAI do not recommend to use it for new app
⚙️ Optional Performance Tip
🆕 New NPU Driver Available
Upgrading to the latest AMD Ryzen AI NPU driver (v32.0.203.304) may boost prefill and decoding performance by an additional 5–10%.
📄 Download Link
Note: You will need an AMD account to access the driver.
This release makes FastFlowLM faster, more OpenAI-compatible, and ready for broader integration into multimodal workflows and edge deployments.
🚀 FastFlowLM v0.9.12 — GPT-OSS 20B, CORS Defaults, Cleaner Lists
✨ What’s New
🧠 1. New Model: gpt-oss:20b
Introducing the first-ever MoE (Mixture of Experts) model to run natively on AMD Ryzen™ AI NPUs, and also the first MoE model released by FastFlowLM (FLM).
gpt-oss:20b is a fast, open-source MoE model by OpenAI — powered by FLM’s NPU-native MoE engine with MXFP4 support, delivering high throughput and power efficiency optimized for AMD NPUs.
- Runs fully offline on AMD Ryzen™ AI NPU
- Supports reasoning effort controls in both CLI and Server mode
Try it:
# CLI
flm run gpt-oss:20b
# Server
flm serve gpt-oss:20bSet reasoning effort (CLI):
# CLI
flm run gpt-oss:20b
/set r-eff medium📝 NOTE
- Memory Requirements
⚠️ Note: Runninggpt-oss:20bmay need a system with > 32 GB RAM. The model itself uses ~15.1 GB of memory in FLM, and there is an internal cap (~15.6 GB) from on NPU memory allocation enforced by AMD/Microsoft, which makes only about half of the total system RAM available to the NPU. On 32 GB machines, it sometimes works sometimes not, so we recommend more RAM for a smooth experience.
🌐 2. Cross-Origin Resource Sharing (CORS)
CORS lets browser apps hosted on a different origin call your FLM server safely.
- Enable CORS
flm serve --cors 1- Disable CORS
flm serve --cors 0
⚠️ Default: CORS is enabled.
🔒 Security tip: Disable CORS (or restrict at your proxy) if your server is exposed beyond localhost.
🔌 3. Default Server Port Change
The default port has moved from 11434 → 52625 to reduce conflicts.
Check or override anytime:
# Show current effective port
flm port
# Use a custom port for this session only
flm serve llama3.2:1b --port 8000
flm serve llama3.2:1b -p 8000💡
--port(or-p) affects this run only and does not change your system defaults.
📃 4 Improved flm list
Cleaner output, filters, and a quiet mode.
Common uses:
# Default view (pretty, with icons)
flm list
# Quiet view (no emoji / minimal)
flm list --quiet
# Show everything
flm list --filter all --quiet
# Only models already installed
flm list --filter installed --quiet
# Only models not yet installed
flm list --filter not-installed --quiet
🌟 Summary
This release supports gpt-oss:20b which is a state-of-the-art MoE model, enables CORS by default (toggleable), changes the default server port to 52625, and improves flm list with quieter, filterable output.