Releases · FastFlowLM/FastFlowLM

24 Nov 13:05

v0.9.21

fe1aa00

🦃 FastFlowLM v0.9.21 — Thanksgiving LLaMA Turbocharge Release Latest

Latest

Happy Thanksgiving!

Today we’re dropping one of our biggest speed upgrades ever for LLaMA and DeepSeek models (our first batch of models) — just in time for the holiday break. Fire up your Ryzen™ AI NPU and enjoy some seriously boosted performance. 🔥

🔄 1. Quantization Upgrade

All models migrated from AWQ to Q4_1
Better LLM accuracy and quality.

⚡ 2. Massive Decoding Speedup

llama3.2:1b: ~50% faster decoding, reaching 66 tps
llama3.2:3b: ~40% faster decoding, reaching 28 tps
llama3.1:8b: ~40% faster decoding, reaching 13 tps
deepseek-r1:8b: ~40% faster decoding, reaching 13 tps

🚀 3. Prefill Phase Optimized

Slight improvements to prefill speed of all above, especially impactful for large context initializations.

🎙️ 4. Standalone Whisper ASR Server

You can now serve Whisper (OpenAI’s ASR model) as a standalone model for speech transcription — or pair it with GPU LLMs in a hybrid pipeline.

Use either:

flm serve -a 1

flm serve --asr 1

This release wraps up a bundle of performance gifts for LLaMA models on FastFlowLM.

Thank you for being part of the FastFlowLM journey — and happy Thanksgiving! 🦃🔥

Assets 3

20 Nov 16:04

FastFlowLM

v0.9.20

88615a7

🚀 FastFlowLM v0.9.20 — Massive Decoding Speed Boosts for GPT-OSS & Gemma3

FastFlowLM v0.9.20 introduces substantial performance improvements across multiple model families, with special focus on decoding efficiency.

⚡ Performance Improvements

🔸 1. GPT-OSS Models

Decoding speed of gpt-oss:20b and gpt-oss-safeguard:20b are reaching ~19 tokens/sec and are over 60% faster at 1K context length.

🔸 2. Gemma3 Models

gemma3:4b reaching ~19 tokens/sec and enjoys over ~20% decoding speed boost at 1K context length.
gemma3:1b (reaching ~43 tokens/sec)
gemma3:270m (reaching ~79 tokens/sec; Note that this model is experimental)

This release is focused on raw speed—making FastFlowLM even more efficient for both high-capacity and portable deployments.

Assets 3

14 Nov 22:10

ngdxzy

v0.9.19

35bce23

🐛 Fixes Qwen3-VL-Instruct Bug

Qwen3VL-IT embedding bug resolved

Fixed an issue where the Qwen3VL-IT weight was not correctly uploaded to hugging face.

Assets 3

13 Nov 17:45

FastFlowLM

v0.9.18

15292b0

🚀 FastFlowLM v0.9.18 — Major Prefill Boost for GPT-OSS & Full LFM2 Integration

✨ What’s New

🔥 1. Significant Prefill Acceleration for GPT-OSS Models

FastFlowLM’s execution pipeline has undergone deep optimization, resulting in significant improvements to prefill performance — including faster time to first token (TTFT) and more efficient long-context ingestion.

Currently, these enhancements apply to:

gpt-oss:20b
gpt-oss-safeguard:20b

⬇️ A model redownload is required.

💧 2. Full Support for LFM2-1.2B (Liquid Foundation Model)

FastFlowLM now supports the LFM2-1.2B model (LiquidAI): first hybrid LLM architecture on AMD AMD Ryzen AI NPU, powered by very efficient FLM's linear attention engine.

Try it:

CLI mode

flm run lfm2:1.2b

Serve mode

flm serve lfm2:1.2b

This release delivers a major prefill speed boost for GPT-OSS models and introduces full support for the LFM2-1.2B Liquid Foundation Model.

Assets 3

06 Nov 14:49

FastFlowLM

v0.9.17

7bac50f

🚀 FastFlowLM v0.9.17 — Faster Vision Understanding for Qwen3VL, Improved Memory Management for GPT-OSS, and New OpenAI Safeguard Model

FastFlowLM v0.9.17 brings performance boosts to vision models, improves compatibility for systems with limited RAM, and adds a safer GPT-OSS variant.

✨ What’s New

⚡ 1. Faster Vision Understanding for Qwen3VL

Speedup for image embedding with Qwen3VL models.
Require redownloading the model

🧠 2. GPT-OSS Optimization for Low-RAM Machines

Improved memory management for gpt-oss:20b. Increase the chance to run on 32GB system (note: NPU can access <50% of total RAM).
Speedup for decoding at long context lengths.
Driver version higher than 32.0.203.304 is required
Require redownloading the model

🛡️ 3. New Model: `gpt-oss-safeguard:20b`

A safety-enhanced, instruction (prompt)-tuned variant of gpt-oss:20b.
Optimized for assistant-style tasks with safer responses.
Run with:
```
flm run gpt-oss-sg:20b
```

⚠️ 4. API Focus: OpenAI-Compatible

Discontinuing Ollama API support.
Focusing on the OpenAI-compatible API.

This release enhances speed and system flexibility, while expanding FastFlowLM’s model lineup with secure, large-scale assistants.

Assets 3

30 Oct 14:20

FastFlowLM

v0.9.16

fecd8b1

🚀 FastFlowLM v0.9.16 — Qwen3-VL arrives!

✨ What’s included

🖼️ New vision model: `Qwen3-VL-4B-Instruct`

Fully offline on AMD Ryzen™ AI NPU
Optimized for lightweight, fast, vision-capable edge inference

Try it in two ways 👇

💻 CLI

Run model

flm run qwen3vl-it:4b

Then type

/input <image_path> <prompt>

🌐 In serve mode

Start flm server first:

flm serve qwen3vl-it:4b

and then interact with Open WebUI or any openai compatible client!

🌟 Summary

FastFlowLM v0.9.16 adds Qwen3-VL-4B-Instruct with full offline vision on AMD Ryzen AI — faster, smaller, and ready for edge.

Assets 3

24 Oct 22:17

FastFlowLM

v0.9.15

d5f340a

🚀 FastFlowLM v0.9.15: New Embedding Capabilities & API Integration

🔎 1. New Model: `EmbeddingGemma-300m`

The first Embedding model on FLM:

Runs fully offline on AMD Ryzen™ AI NPU
Supports chunk sizes up to 2048 tokens

Try it out:

Start flm in server mode with embedding model enabled:

flm serve gemma3:4b --embed 1 # Load embedding model in background, with concurrent LLM loading (gemma3:4b).

⚠️ Note: Embedding model is not allowed in CLI mode.

🌐 2. OpenAI-Compatible Embedding API: `v1/embeddings`

FastFlowLM now supports the OpenAI v1/embeddings endpoint making it easy to integrate embedding model into any OpenAI-compatible client or UI.

How to use:

Start FLM server with embedding model enabled:

# serve
flm serve gemma3:4b --embed 1 # Load embedding model in background, with concurrent LLM loading (gemma3:4b).

Send file(s) to:
POST /v1/embeddings
via any OpenAI client or Open WebUI.

Examples: Test in OpenAI client

from openai import OpenAI

client = OpenAI(
   base_url="http://localhost:52625/v1", # FastFlowLM's local API endpoint
   api_key="flm", # Dummy key (FastFlowLM doesn't require authentication)
)

resp = client.embeddings.create(
   model="embed-gemma",
   input="Hi, everyone!"
)

print(resp.data[0].embedding)

Example: Open WebUI

Follow Open WebUI setup guide.
In the bottom-left corner, click User icon, then select Settings.
In the bottom panel, open Admin Settings.
In the left sidebar, navigate to Documents.
Set Embedding Model Engine to OpenAI.
Enter:
-- API Base URL: http://host.docker.internal:52625/v1
-- API KEY: flm (any value works)
-- Embedding Model: embed-gemma:300m
Save the setting.
Follow the RAG + FastFlowLM example to launch your Local Private Database with RAG all powered by FLM.

🙏 Acknowledgement

Special thanks to julienM77 for contributing the message normalizer that improves handling of corrupted user messages.

🌟 Summary

FastFlowLM v0.9.15 introduces offline embedding with embedding gemma and support for OpenAI's v1/embeddings API. Just start the server with --embed 1 and you're ready to build local, private, and intelligent applications.

Assets 3

17 Oct 20:53

FastFlowLM

v0.9.14

fbefef7

🚀 FastFlowLM v0.9.14: New ASR Capabilities & API Integration

✨ What’s New

🎙️ 1. New Model: whisper-large-v3-turbo (by OpenAI)

The first Automatic Speech Recognition (ASR) model on FLM:

Runs fully offline on AMD Ryzen™ AI NPU
Multilingual audio recognition
Supports MP3, WAV, OGG and M4A formats
Lightweight footprint — only 900MB memory

Try it out:

Start flm in CLI mode with ASR enabled:

# CLI
flm run gemma3:4b --asr 1 # Load the ASR model (whisper-v3:turbo) in the background, with concurrent LLM loading (gemma3:4b).

Type (replace filename.mp3 with your audio file path):

/input "path\to\audio_sample.mp3" summarize it

🌐 2. OpenAI-Compatible ASR API: `v1/audio/transcriptions API`

FastFlowLM now supports the OpenAI v1/audio/transcriptions endpoint — making it easy to integrate ASR into any OpenAI-compatible client or UI.

How to use:

Start your FLM server with ASR enabled:

# serve
flm serve gemma3:4b --asr 1 # # Load the ASR model (whisper-v3:turbo) in the background, with concurrent LLM loading (gemma3:4b).

Send audio to:
POST /v1/audio/transcriptions
via any OpenAI client or Open WebUI.

Examples: OpenAI client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:52625/v1",  # FastFlowLM's local API endpoint
    api_key="flm",  # Dummy key (FastFlowLM doesn’t require authentication)
)

with open("audio.mp3", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="whisper-v3",
        file=f,
    )
    print(resp.text)

Example: Open WebUI
1. Follow Open WebUI setup guide.
2. In the bottom-left corner, click User icon, then select Settings.
3. In the bottom panel, open Admin Settings.
4. In the left sidebar, navigate to Audio.
5. Set Speech-to-Text Engine to OpenAI.
6. Enter:
-- API Base URL: http://host.docker.internal:52625/v1
-- API KEY: flm (any value works)
-- STT Model: whisper-large-v3-turbo (type in the model name; can be different)
7. Save the setting.
8. You're ready to upload audio files! (Choose an LLM to load and use concurrently)

🌟 Summary

FastFlowLM v0.9.14 introduces offline ASR with whisper-large-v3-turbo and support for OpenAI’s v1/audio/transcriptions API, making speech-to-text integration seamless across clients and WebUI. Just start the server with --asr 1 and you're ready to transcribe.

Assets 3

10 Oct 18:06

FastFlowLM

v0.9.13

ad40d17

⚡ FastFlowLM v0.9.13 — GPT-OSS Speed Boost, Harmony Chat Template, and Ollama `/api/show` Support

FastFlowLM v0.9.13 delivers performance improvements, OpenAI Harmony support, and greater compatibility with Ollama-style endpoints.

🚀 Highlights

🔹 1. GPT-OSS:20b Prefill Speed Boost

Up to 20% faster prefill for gpt-oss:20b
Reduces latency for long prompts and improves responsiveness
v0.9.13 auto-downloads the updated model from HuggingFace.

🔹 2. OpenAI Harmony Integration

Added support for OpenAI’s Harmony Chat Template
Enables richer multi-turn dialogue and structured generation capabilities

🔹 3. `/api/show` Endpoint (Ollama Compatibility)

FastFlowLM now supports /api/show
Compatible with tools that rely on Ollama API conventions

🔹 4. `/v1/completions` (legacy OpenAI API) is now supported

OpenAI do not recommend to use it for new app

⚙️ Optional Performance Tip

🆕 New NPU Driver Available
Upgrading to the latest AMD Ryzen AI NPU driver (v32.0.203.304) may boost prefill and decoding performance by an additional 5–10%.
📄 Download Link

Note: You will need an AMD account to access the driver.

This release makes FastFlowLM faster, more OpenAI-compatible, and ready for broader integration into multimodal workflows and edge deployments.

Assets 3

02 Oct 16:28

FastFlowLM

v0.9.12

68d7192

🚀 FastFlowLM v0.9.12 — GPT-OSS 20B, CORS Defaults, Cleaner Lists

✨ What’s New

🧠 1. New Model: `gpt-oss:20b`

Introducing the first-ever MoE (Mixture of Experts) model to run natively on AMD Ryzen™ AI NPUs, and also the first MoE model released by FastFlowLM (FLM).

gpt-oss:20b is a fast, open-source MoE model by OpenAI — powered by FLM’s NPU-native MoE engine with MXFP4 support, delivering high throughput and power efficiency optimized for AMD NPUs.

Runs fully offline on AMD Ryzen™ AI NPU
Supports reasoning effort controls in both CLI and Server mode

Try it:

# CLI
flm run gpt-oss:20b
# Server
flm serve gpt-oss:20b

Set reasoning effort (CLI):

# CLI
flm run gpt-oss:20b
/set r-eff medium

📝 NOTE

Memory Requirements
⚠️ Note: Running gpt-oss:20b may need a system with > 32 GB RAM. The model itself uses ~15.1 GB of memory in FLM, and there is an internal cap (~15.6 GB) from on NPU memory allocation enforced by AMD/Microsoft, which makes only about half of the total system RAM available to the NPU. On 32 GB machines, it sometimes works sometimes not, so we recommend more RAM for a smooth experience.

🌐 2. Cross-Origin Resource Sharing (CORS)

CORS lets browser apps hosted on a different origin call your FLM server safely.

Enable CORS

flm serve --cors 1

Disable CORS

flm serve --cors 0

⚠️ Default: CORS is enabled.
🔒 Security tip: Disable CORS (or restrict at your proxy) if your server is exposed beyond localhost.

🔌 3. Default Server Port Change

The default port has moved from 11434 → 52625 to reduce conflicts.

Check or override anytime:

# Show current effective port
flm port

# Use a custom port for this session only
flm serve llama3.2:1b --port 8000
flm serve llama3.2:1b -p 8000

💡 --port (or -p) affects this run only and does not change your system defaults.

📃 4 Improved `flm list`

Cleaner output, filters, and a quiet mode.

Common uses:

# Default view (pretty, with icons)
flm list

# Quiet view (no emoji / minimal)
flm list --quiet

# Show everything
flm list --filter all --quiet

# Only models already installed
flm list --filter installed --quiet

# Only models not yet installed
flm list --filter not-installed --quiet

🌟 Summary

This release supports gpt-oss:20b which is a state-of-the-art MoE model, enables CORS by default (toggleable), changes the default server port to 52625, and improves flm list with quieter, filterable output.

Assets 3

Releases: FastFlowLM/FastFlowLM

🦃 FastFlowLM v0.9.21 — Thanksgiving LLaMA Turbocharge Release

🔄 1. Quantization Upgrade

⚡ 2. Massive Decoding Speedup

🚀 3. Prefill Phase Optimized

🎙️ 4. Standalone Whisper ASR Server

Uh oh!

🚀 FastFlowLM v0.9.20 — Massive Decoding Speed Boosts for GPT-OSS & Gemma3

⚡ Performance Improvements

🔸 1. GPT-OSS Models

🔸 2. Gemma3 Models

Uh oh!

🐛 Fixes Qwen3-VL-Instruct Bug

Uh oh!

🚀 FastFlowLM v0.9.18 — Major Prefill Boost for GPT-OSS & Full LFM2 Integration

✨ What’s New

🔥 1. Significant Prefill Acceleration for GPT-OSS Models

💧 2. Full Support for LFM2-1.2B (Liquid Foundation Model)

Uh oh!

🚀 FastFlowLM v0.9.17 — Faster Vision Understanding for Qwen3VL, Improved Memory Management for GPT-OSS, and New OpenAI Safeguard Model

✨ What’s New

⚡ 1. Faster Vision Understanding for Qwen3VL

🧠 2. GPT-OSS Optimization for Low-RAM Machines

🛡️ 3. New Model: gpt-oss-safeguard:20b

⚠️ 4. API Focus: OpenAI-Compatible

Uh oh!

🚀 FastFlowLM v0.9.16 — Qwen3-VL arrives!

✨ What’s included

🖼️ New vision model: Qwen3-VL-4B-Instruct

🌟 Summary

Uh oh!

🚀 FastFlowLM v0.9.15: New Embedding Capabilities & API Integration

🔎 1. New Model: EmbeddingGemma-300m

🌐 2. OpenAI-Compatible Embedding API: v1/embeddings

Examples: Test in OpenAI client

Example: Open WebUI

🙏 Acknowledgement

🌟 Summary

Uh oh!

🚀 FastFlowLM v0.9.14: New ASR Capabilities & API Integration

✨ What’s New

🎙️ 1. New Model: whisper-large-v3-turbo (by OpenAI)

🌐 2. OpenAI-Compatible ASR API: v1/audio/transcriptions API

🌟 Summary

Uh oh!

⚡ FastFlowLM v0.9.13 — GPT-OSS Speed Boost, Harmony Chat Template, and Ollama `/api/show` Support

🚀 Highlights

🔹 1. GPT-OSS:20b Prefill Speed Boost

🔹 2. OpenAI Harmony Integration

🔹 3. /api/show Endpoint (Ollama Compatibility)

🔹 4. /v1/completions (legacy OpenAI API) is now supported

⚙️ Optional Performance Tip

Uh oh!

🚀 FastFlowLM v0.9.12 — GPT-OSS 20B, CORS Defaults, Cleaner Lists

✨ What’s New

🧠 1. New Model: gpt-oss:20b

🌐 2. Cross-Origin Resource Sharing (CORS)

🔌 3. Default Server Port Change

📃 4 Improved flm list

🌟 Summary

Uh oh!

🛡️ 3. New Model: `gpt-oss-safeguard:20b`

🖼️ New vision model: `Qwen3-VL-4B-Instruct`

🔎 1. New Model: `EmbeddingGemma-300m`

🌐 2. OpenAI-Compatible Embedding API: `v1/embeddings`

🌐 2. OpenAI-Compatible ASR API: `v1/audio/transcriptions API`

🔹 3. `/api/show` Endpoint (Ollama Compatibility)

🔹 4. `/v1/completions` (legacy OpenAI API) is now supported

🧠 1. New Model: `gpt-oss:20b`

📃 4 Improved `flm list`