Skip to content

Releases: FastFlowLM/FastFlowLM

🦃 FastFlowLM v0.9.21 — Thanksgiving LLaMA Turbocharge Release

24 Nov 13:05

Choose a tag to compare

Happy Thanksgiving!

Today we’re dropping one of our biggest speed upgrades ever for LLaMA and DeepSeek models (our first batch of models) — just in time for the holiday break. Fire up your Ryzen™ AI NPU and enjoy some seriously boosted performance. 🔥


🔄 1. Quantization Upgrade

  • All models migrated from AWQ to Q4_1
  • Better LLM accuracy and quality.

⚡ 2. Massive Decoding Speedup

  • llama3.2:1b: ~50% faster decoding, reaching 66 tps
  • llama3.2:3b: ~40% faster decoding, reaching 28 tps
  • llama3.1:8b: ~40% faster decoding, reaching 13 tps
  • deepseek-r1:8b: ~40% faster decoding, reaching 13 tps

🚀 3. Prefill Phase Optimized

  • Slight improvements to prefill speed of all above, especially impactful for large context initializations.

🎙️ 4. Standalone Whisper ASR Server

You can now serve Whisper (OpenAI’s ASR model) as a standalone model for speech transcription — or pair it with GPU LLMs in a hybrid pipeline.

Use either:

flm serve -a 1

or

flm serve --asr 1

This release wraps up a bundle of performance gifts for LLaMA models on FastFlowLM.

Thank you for being part of the FastFlowLM journey — and happy Thanksgiving! 🦃🔥

🚀 FastFlowLM v0.9.20 — Massive Decoding Speed Boosts for GPT-OSS & Gemma3

20 Nov 16:04
88615a7

Choose a tag to compare

FastFlowLM v0.9.20 introduces substantial performance improvements across multiple model families, with special focus on decoding efficiency.


⚡ Performance Improvements

🔸 1. GPT-OSS Models

  • Decoding speed of gpt-oss:20b and gpt-oss-safeguard:20b are reaching ~19 tokens/sec and are over 60% faster at 1K context length.

🔸 2. Gemma3 Models

  • gemma3:4b reaching ~19 tokens/sec and enjoys over ~20% decoding speed boost at 1K context length.
  • gemma3:1b (reaching ~43 tokens/sec)
  • gemma3:270m (reaching ~79 tokens/sec; Note that this model is experimental)

This release is focused on raw speed—making FastFlowLM even more efficient for both high-capacity and portable deployments.

🐛 Fixes Qwen3-VL-Instruct Bug

14 Nov 22:10
35bce23

Choose a tag to compare

Qwen3VL-IT embedding bug resolved

  • Fixed an issue where the Qwen3VL-IT weight was not correctly uploaded to hugging face.

🚀 FastFlowLM v0.9.18 — Major Prefill Boost for GPT-OSS & Full LFM2 Integration

13 Nov 17:45
15292b0

Choose a tag to compare

✨ What’s New

🔥 1. Significant Prefill Acceleration for GPT-OSS Models

FastFlowLM’s execution pipeline has undergone deep optimization, resulting in significant improvements to prefill performance — including faster time to first token (TTFT) and more efficient long-context ingestion.

Currently, these enhancements apply to:

  • gpt-oss:20b
  • gpt-oss-safeguard:20b

⬇️ A model redownload is required.


💧 2. Full Support for LFM2-1.2B (Liquid Foundation Model)

FastFlowLM now supports the LFM2-1.2B model (LiquidAI): first hybrid LLM architecture on AMD AMD Ryzen AI NPU, powered by very efficient FLM's linear attention engine.

Try it:

  1. CLI mode
flm run lfm2:1.2b
  1. Serve mode
flm serve lfm2:1.2b

This release delivers a major prefill speed boost for GPT-OSS models and introduces full support for the LFM2-1.2B Liquid Foundation Model.

🚀 FastFlowLM v0.9.17 — Faster Vision Understanding for Qwen3VL, Improved Memory Management for GPT-OSS, and New OpenAI Safeguard Model

06 Nov 14:49
7bac50f

Choose a tag to compare

FastFlowLM v0.9.17 brings performance boosts to vision models, improves compatibility for systems with limited RAM, and adds a safer GPT-OSS variant.


✨ What’s New

⚡ 1. Faster Vision Understanding for Qwen3VL

  • Speedup for image embedding with Qwen3VL models.
  • Require redownloading the model

🧠 2. GPT-OSS Optimization for Low-RAM Machines

  • Improved memory management for gpt-oss:20b. Increase the chance to run on 32GB system (note: NPU can access <50% of total RAM).
  • Speedup for decoding at long context lengths.
  • Driver version higher than 32.0.203.304 is required
  • Require redownloading the model

🛡️ 3. New Model: gpt-oss-safeguard:20b

  • A safety-enhanced, instruction (prompt)-tuned variant of gpt-oss:20b.
  • Optimized for assistant-style tasks with safer responses.
  • Run with:
    flm run gpt-oss-sg:20b

⚠️ 4. API Focus: OpenAI-Compatible

  • Discontinuing Ollama API support.
  • Focusing on the OpenAI-compatible API.

This release enhances speed and system flexibility, while expanding FastFlowLM’s model lineup with secure, large-scale assistants.

🚀 FastFlowLM v0.9.16 — Qwen3-VL arrives!

30 Oct 14:20
fecd8b1

Choose a tag to compare

✨ What’s included

🖼️ New vision model: Qwen3-VL-4B-Instruct

  • Fully offline on AMD Ryzen™ AI NPU
  • Optimized for lightweight, fast, vision-capable edge inference

Try it in two ways 👇

💻 CLI

Run model

flm run qwen3vl-it:4b

Then type

/input <image_path> <prompt>

🌐 In serve mode

Start flm server first:

flm serve qwen3vl-it:4b

and then interact with Open WebUI or any openai compatible client!


🌟 Summary

FastFlowLM v0.9.16 adds Qwen3-VL-4B-Instruct with full offline vision on AMD Ryzen AI — faster, smaller, and ready for edge.

🚀 FastFlowLM v0.9.15: New Embedding Capabilities & API Integration

24 Oct 22:17
d5f340a

Choose a tag to compare

🔎 1. New Model: EmbeddingGemma-300m

The first Embedding model on FLM:

  • Runs fully offline on AMD Ryzen™ AI NPU
  • Supports chunk sizes up to 2048 tokens

Try it out:

  1. Start flm in server mode with embedding model enabled:
flm serve gemma3:4b --embed 1 # Load embedding model in background, with concurrent LLM loading (gemma3:4b).

⚠️ Note: Embedding model is not allowed in CLI mode.


🌐 2. OpenAI-Compatible Embedding API: v1/embeddings

FastFlowLM now supports the OpenAI v1/embeddings endpoint making it easy to integrate embedding model into any OpenAI-compatible client or UI.

How to use:

  1. Start FLM server with embedding model enabled:
# serve
flm serve gemma3:4b --embed 1 # Load embedding model in background, with concurrent LLM loading (gemma3:4b).
  1. Send file(s) to:
    POST /v1/embeddings
    via any OpenAI client or Open WebUI.

Examples: Test in OpenAI client

from openai import OpenAI

client = OpenAI(
   base_url="http://localhost:52625/v1", # FastFlowLM's local API endpoint
   api_key="flm", # Dummy key (FastFlowLM doesn't require authentication)
)

resp = client.embeddings.create(
   model="embed-gemma",
   input="Hi, everyone!"
)

print(resp.data[0].embedding)

Example: Open WebUI

  1. Follow Open WebUI setup guide.
  2. In the bottom-left corner, click User icon, then select Settings.
  3. In the bottom panel, open Admin Settings.
  4. In the left sidebar, navigate to Documents.
  5. Set Embedding Model Engine to OpenAI.
  6. Enter:
    -- API Base URL: http://host.docker.internal:52625/v1
    -- API KEY: flm (any value works)
    -- Embedding Model: embed-gemma:300m
  7. Save the setting.
  8. Follow the RAG + FastFlowLM example to launch your Local Private Database with RAG all powered by FLM.

🙏 Acknowledgement

Special thanks to julienM77 for contributing the message normalizer that improves handling of corrupted user messages.


🌟 Summary

FastFlowLM v0.9.15 introduces offline embedding with embedding gemma and support for OpenAI's v1/embeddings API. Just start the server with --embed 1 and you're ready to build local, private, and intelligent applications.

🚀 FastFlowLM v0.9.14: New ASR Capabilities & API Integration

17 Oct 20:53
fbefef7

Choose a tag to compare

✨ What’s New

🎙️ 1. New Model: whisper-large-v3-turbo (by OpenAI)

The first Automatic Speech Recognition (ASR) model on FLM:

  • Runs fully offline on AMD Ryzen™ AI NPU
  • Multilingual audio recognition
  • Supports MP3, WAV, OGG and M4A formats
  • Lightweight footprint — only 900MB memory

Try it out:

  1. Start flm in CLI mode with ASR enabled:
# CLI
flm run gemma3:4b --asr 1 # Load the ASR model (whisper-v3:turbo) in the background, with concurrent LLM loading (gemma3:4b).
  1. Type (replace filename.mp3 with your audio file path):
/input "path\to\audio_sample.mp3" summarize it

🌐 2. OpenAI-Compatible ASR API: v1/audio/transcriptions API

FastFlowLM now supports the OpenAI v1/audio/transcriptions endpoint — making it easy to integrate ASR into any OpenAI-compatible client or UI.

How to use:

  1. Start your FLM server with ASR enabled:
# serve
flm serve gemma3:4b --asr 1 # # Load the ASR model (whisper-v3:turbo) in the background, with concurrent LLM loading (gemma3:4b).
  1. Send audio to:
    POST /v1/audio/transcriptions
    via any OpenAI client or Open WebUI.

Examples: OpenAI client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:52625/v1",  # FastFlowLM's local API endpoint
    api_key="flm",  # Dummy key (FastFlowLM doesn’t require authentication)
)

with open("audio.mp3", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="whisper-v3",
        file=f,
    )
    print(resp.text)

Example: Open WebUI
1. Follow Open WebUI setup guide.
2. In the bottom-left corner, click User icon, then select Settings.
3. In the bottom panel, open Admin Settings.
4. In the left sidebar, navigate to Audio.
5. Set Speech-to-Text Engine to OpenAI.
6. Enter:
-- API Base URL: http://host.docker.internal:52625/v1
-- API KEY: flm (any value works)
-- STT Model: whisper-large-v3-turbo (type in the model name; can be different)
7. Save the setting.
8. You're ready to upload audio files! (Choose an LLM to load and use concurrently)


🌟 Summary

FastFlowLM v0.9.14 introduces offline ASR with whisper-large-v3-turbo and support for OpenAI’s v1/audio/transcriptions API, making speech-to-text integration seamless across clients and WebUI. Just start the server with --asr 1 and you're ready to transcribe.

⚡ FastFlowLM v0.9.13 — GPT-OSS Speed Boost, Harmony Chat Template, and Ollama `/api/show` Support

10 Oct 18:06
ad40d17

Choose a tag to compare

FastFlowLM v0.9.13 delivers performance improvements, OpenAI Harmony support, and greater compatibility with Ollama-style endpoints.


🚀 Highlights

🔹 1. GPT-OSS:20b Prefill Speed Boost

  • Up to 20% faster prefill for gpt-oss:20b
  • Reduces latency for long prompts and improves responsiveness
  • v0.9.13 auto-downloads the updated model from HuggingFace.

🔹 2. OpenAI Harmony Integration

  • Added support for OpenAI’s Harmony Chat Template
  • Enables richer multi-turn dialogue and structured generation capabilities

🔹 3. /api/show Endpoint (Ollama Compatibility)

  • FastFlowLM now supports /api/show
  • Compatible with tools that rely on Ollama API conventions

🔹 4. /v1/completions (legacy OpenAI API) is now supported

  • OpenAI do not recommend to use it for new app

⚙️ Optional Performance Tip

🆕 New NPU Driver Available
Upgrading to the latest AMD Ryzen AI NPU driver (v32.0.203.304) may boost prefill and decoding performance by an additional 5–10%.
📄 Download Link

Note: You will need an AMD account to access the driver.


This release makes FastFlowLM faster, more OpenAI-compatible, and ready for broader integration into multimodal workflows and edge deployments.

🚀 FastFlowLM v0.9.12 — GPT-OSS 20B, CORS Defaults, Cleaner Lists

02 Oct 16:28
68d7192

Choose a tag to compare

✨ What’s New

🧠 1. New Model: gpt-oss:20b

Introducing the first-ever MoE (Mixture of Experts) model to run natively on AMD Ryzen™ AI NPUs, and also the first MoE model released by FastFlowLM (FLM).

gpt-oss:20b is a fast, open-source MoE model by OpenAI — powered by FLM’s NPU-native MoE engine with MXFP4 support, delivering high throughput and power efficiency optimized for AMD NPUs.

  • Runs fully offline on AMD Ryzen™ AI NPU
  • Supports reasoning effort controls in both CLI and Server mode

Try it:

# CLI
flm run gpt-oss:20b
# Server
flm serve gpt-oss:20b

Set reasoning effort (CLI):

# CLI
flm run gpt-oss:20b
/set r-eff medium

📝 NOTE

  • Memory Requirements
    ⚠️ Note: Running gpt-oss:20b may need a system with > 32 GB RAM. The model itself uses ~15.1 GB of memory in FLM, and there is an internal cap (~15.6 GB) from on NPU memory allocation enforced by AMD/Microsoft, which makes only about half of the total system RAM available to the NPU. On 32 GB machines, it sometimes works sometimes not, so we recommend more RAM for a smooth experience.

🌐 2. Cross-Origin Resource Sharing (CORS)

CORS lets browser apps hosted on a different origin call your FLM server safely.

  • Enable CORS
flm serve --cors 1
  • Disable CORS
flm serve --cors 0

⚠️ Default: CORS is enabled.
🔒 Security tip: Disable CORS (or restrict at your proxy) if your server is exposed beyond localhost.


🔌 3. Default Server Port Change

The default port has moved from 1143452625 to reduce conflicts.

Check or override anytime:

# Show current effective port
flm port

# Use a custom port for this session only
flm serve llama3.2:1b --port 8000
flm serve llama3.2:1b -p 8000

💡 --port (or -p) affects this run only and does not change your system defaults.


📃 4 Improved flm list

Cleaner output, filters, and a quiet mode.

Common uses:

# Default view (pretty, with icons)
flm list

# Quiet view (no emoji / minimal)
flm list --quiet

# Show everything
flm list --filter all --quiet

# Only models already installed
flm list --filter installed --quiet

# Only models not yet installed
flm list --filter not-installed --quiet

🌟 Summary

This release supports gpt-oss:20b which is a state-of-the-art MoE model, enables CORS by default (toggleable), changes the default server port to 52625, and improves flm list with quieter, filterable output.