FLUX.2-klein-4B Pure C Implementation

This program generates images from text prompts (and optionally from other images) using the FLUX.2-klein-4B model from Black Forest Labs. It can be used as a library as well, and is implemented entirely in C, with zero external dependencies beyond the C standard library. MPS and BLAS acceleration are optional but recommended.

Quick Start

# Build (choose your backend)
make mps       # Apple Silicon (fastest)
# or: make blas    # Intel Mac / Linux with OpenBLAS
# or: make generic # Pure C, no dependencies

# Download the model (~16GB) - pick one:
./download_model.sh                      # using curl
# or: pip install huggingface_hub && python download_model.py

# Generate an image
./flux -d flux-klein-model -p "A woman wearing sunglasses" -o output.png

That's it. No Python runtime or CUDA toolkit required at inference time.

Example Output

Generated with: ./flux -d flux-klein-model -p "A picture of a woman in 1960 America. Sunglasses. ASA 400 film. Black and White." -W 512 -H 512 -o woman.png

Image-to-Image Example

Generated with: ./flux -i antirez.png -o antirez_to_drawing.png -p "make it a drawing" -d flux-klein-model

Features

Zero dependencies: Pure C implementation, works standalone. BLAS optional for ~30x speedup (Apple Accelerate on macOS, OpenBLAS on Linux)
Metal GPU acceleration: Automatic on Apple Silicon Macs. Performance matches PyTorch's optimized MPS pipeline
Runs where Python can't: Memory-mapped weights (default) enable inference on 8GB RAM systems where the Python ML stack cannot run FLUX.2 at all
Text-to-image: Generate images from text prompts
Image-to-image: Transform existing images guided by prompts
Multi-reference: Combine multiple reference images (e.g., -i car.png -i beach.png for "car on beach")
Integrated text encoder: Qwen3-4B encoder built-in, no external embedding computation needed
Memory efficient: Automatic encoder release after encoding (~8GB freed)
Memory-mapped weights: Enabled by default. Reduces peak memory from ~16GB to ~4-5GB. Fastest mode on MPS; BLAS users with plenty of RAM may prefer --no-mmap for faster inference
Size-independent seeds: Same seed produces similar compositions at different resolutions. Explore at 256×256, then render at 512×512 with the same seed
Terminal image display: watch the resulting image without leaving your terminal (Ghostty, Kitty, or iTerm2).

Terminal Image Display

Display generated images directly in your terminal with --show, or watch the denoising process step-by-step with --show-steps:

# Display final image in terminal (auto-detects Kitty/Ghostty/iTerm2)
./flux -d flux-klein-model -p "a cute robot" -o robot.png --show

# Display each denoising step (slower, but interesting to watch)
./flux -d flux-klein-model -p "a cute robot" -o robot.png --show-steps

Requires a terminal supporting the Kitty graphics protocol (such as Kitty or Ghostty), or iTerm2. Terminal type is auto-detected from environment variables.

Usage

Text-to-Image

./flux -d flux-klein-model -p "A fluffy orange cat sitting on a windowsill" -o cat.png

Image-to-Image

Transform an existing image based on a prompt:

./flux -d flux-klein-model -p "oil painting style" -i photo.png -o painting.png

FLUX.2 uses in-context conditioning for image-to-image generation. Unlike traditional approaches that add noise to the input image, FLUX.2 passes the reference image as additional tokens that the model can attend to during generation. This means:

The model "sees" your input image and uses it as a reference
The prompt describes what you want the output to look like
Results tend to preserve the composition while applying the described transformation

Tips for good results:

Use descriptive prompts that describe the desired output, not instructions
Good: "oil painting of a woman with sunglasses, impressionist style"
Less good: "make it an oil painting" (instructional prompts may work less well)

Super Resolution: Since the reference image can be a different size than the output, you can use img2img for upscaling:

./flux -d flux-klein-model -i small.png -W 1024 -H 1024 -o big.png -p "Create an exact copy of the input image."

The model will generate a higher-resolution version while preserving the composition and details of the input.

Multi-Reference Generation

Combine elements from multiple reference images:

./flux -d flux-klein-model -i car.png -i beach.png -p "a sports car on the beach" -o result.png

Each reference image is encoded separately and passed to the transformer with different positional embeddings (T=10, T=20, T=30, ...). The model attends to all references during generation, allowing it to combine elements from each.

Example:

Reference 1: A red sports car
Reference 2: A tropical beach with palm trees
Prompt: "combine the two images"
Result: A red sports car on a tropical beach

You can specify up to 16 reference images with multiple -i flags. The prompt guides how the references are combined.

Interactive CLI Mode

Start without -p to enter interactive mode:

./flux -d flux-klein-model

Generate images by typing prompts. Each image gets a $N reference ID:

flux> a red sports car
Done -> /tmp/flux-.../image-0001.png (ref $0)

flux> a tropical beach
Done -> /tmp/flux-.../image-0002.png (ref $1)

flux> $0 $1 combine them
Generating 256x256 (multi-ref, 2 images)...
Done -> /tmp/flux-.../image-0003.png (ref $2)

Prompt syntax:

prompt - text-to-image
512x512 prompt - set size inline
$ prompt - img2img with last image
$N prompt - img2img with reference $N
$0 $3 prompt - multi-reference (combine images)

Commands: !help, !save, !load, !seed, !size, !steps, !explore, !show, !quit

Command Line Options

Required:

-d, --dir PATH        Path to model directory
-p, --prompt TEXT     Text prompt for generation
-o, --output PATH     Output image path (.png or .ppm)

Generation options:

-W, --width N         Output width in pixels (default: 256)
-H, --height N        Output height in pixels (default: 256)
-s, --steps N         Sampling steps (default: 4)
-S, --seed N          Random seed for reproducibility

Image-to-image options:

-i, --input PATH      Reference image (can be specified multiple times)

Output options:

-q, --quiet           Silent mode, no output
-v, --verbose         Show detailed config and timing info
    --show            Display image in terminal (auto-detects Kitty/Ghostty/iTerm2)
    --show-steps      Display each denoising step (slower)

Other options:

-m, --mmap            Memory-mapped weights (default, fastest on MPS)
    --no-mmap         Disable mmap, load all weights upfront
-e, --embeddings PATH Load pre-computed text embeddings (advanced)
-h, --help            Show help

Reproducibility

The seed is always printed to stderr, even when random:

$ ./flux -d flux-klein-model -p "a landscape" -o out.png
Seed: 1705612345
...
Saving... out.png 256x256 (0.1s)

To reproduce the same image, use the printed seed:

$ ./flux -d flux-klein-model -p "a landscape" -o out.png -S 1705612345

PNG Metadata

Generated PNG images include metadata with the seed and model information, so you can always recover the seed even if you didn't save the terminal output:

# Using exiftool
exiftool image.png | grep flux

# Using Python/PIL
python3 -c "from PIL import Image; print(Image.open('image.png').info)"

# Using ImageMagick
identify -verbose image.png | grep -A1 "Properties:"

The following metadata fields are stored:

flux:seed - The random seed used for generation
flux:model - The model name (FLUX.2-klein-4B)
Software - Program identifier

Building

Choose a backend when building:

make            # Show available backends
make generic    # Pure C, no dependencies (slow)
make blas       # BLAS acceleration (~30x faster)
make mps        # Apple Silicon Metal GPU (fastest, macOS only)

Recommended:

macOS Apple Silicon: make mps
macOS Intel: make blas
Linux with OpenBLAS: make blas
Linux without OpenBLAS: make generic

For make blas on Linux, install OpenBLAS first:

# Ubuntu/Debian
sudo apt install libopenblas-dev

# Fedora
sudo dnf install openblas-devel

Other targets:

make clean      # Clean build artifacts
make info       # Show available backends for this platform

Testing

Run the test suite to verify your build produces correct output:

make test        # Run all 3 tests
make test-quick  # Run only the quick 64x64 test

The tests compare generated images against reference images in test_vectors/. A test passes if the maximum pixel difference is within tolerance (to allow for minor floating-point variations across platforms).

Test cases:

Test	Size	Steps	Purpose
Quick	64×64	2	Fast txt2img sanity check
Full	512×512	4	Validates txt2img at larger resolution
img2img	256×256	4	Validates image-to-image transformation

You can also run the test script directly for more options:

python3 run_test.py --help
python3 run_test.py --quick          # Quick test only
python3 run_test.py --flux-binary ./flux --model-dir /path/to/model

Model Download

Download the model weights (~16GB) from HuggingFace using one of these methods:

Option 1: Shell script (requires curl)

./download_model.sh

Option 2: Python script (requires huggingface_hub)

pip install huggingface_hub
python download_model.py

Both download the same files to ./flux-klein-model:

VAE (~300MB)
Transformer (~4GB)
Qwen3-4B Text Encoder (~8GB)
Tokenizer

How Fast Is It?

Benchmarks on Apple M3 Max (128GB RAM), generating a 4-step image.

The MPS implementation matches the PyTorch optimized pipeline performance, providing better speed for small image sizes.

Size	C (MPS)	PyTorch (MPS)
256x256	5.6s	11s
512x512	9.1s	13s
1024x1024	26s	25s

Notes:

All times measured as wall clock, including model loading, no warmup. PyTorch times exclude library import overhead (~5-10s) to be fair.
The C BLAS backend (CPU) is not shown.
The make generic backend (pure C, no BLAS) is approximately 30x slower than BLAS and not included in benchmarks.
The fastest implementation for Metal remains the Draw Things app that can produce a 1024x1024 image in just 14 seconds!

Resolution Limits

Maximum resolution: 1792x1792 pixels. The model produces good results up to this size; beyond this resolution image quality degrades significantly (this is a model limitation, not an implementation issue).

Minimum resolution: 64x64 pixels.

Dimensions should be multiples of 16 (the VAE downsampling factor).

Model Architecture

FLUX.2-klein-4B is a rectified flow transformer optimized for fast inference:

Component	Architecture
Transformer	5 double blocks + 20 single blocks, 3072 hidden dim, 24 attention heads
VAE	AutoencoderKL, 128 latent channels, 8x spatial compression
Text Encoder	Qwen3-4B, 36 layers, 2560 hidden dim

Inference steps: This is a distilled model that produces good results with exactly 4 sampling steps.

Memory Requirements

With mmap (default):

Phase	Memory
Text encoding	~2GB (layers loaded on-demand)
Diffusion	~1-2GB (blocks loaded on-demand)
Peak	~4-5GB

With --no-mmap (all weights in RAM):

Phase	Memory
Text encoding	~8GB (encoder weights)
Diffusion	~8GB (transformer ~4GB + VAE ~300MB + activations)
Peak	~16GB (if encoder not released)

The text encoder is automatically released after encoding, reducing peak memory during diffusion. If you generate multiple images with different prompts, the encoder reloads automatically.

Memory-Mapped Weights (Default)

Memory-mapped weight loading is enabled by default. Use --no-mmap to disable and load all weights upfront.

./flux -d flux-klein-model -p "A cat" -o cat.png           # mmap (default)
./flux -d flux-klein-model -p "A cat" -o cat.png --no-mmap # load all upfront

How it works: Instead of loading all model weights into RAM upfront, mmap keeps the safetensors files memory-mapped and loads weights on-demand:

Text encoder (Qwen3): Each of the 36 transformer layers (~400MB each) is loaded, processed, and immediately freed. Only ~2GB stays resident instead of ~8GB.
Denoising transformer: Each of the 5 double-blocks (~300MB) and 20 single-blocks (~150MB) is loaded on-demand and freed after use. Only ~200MB of shared weights stays resident instead of ~4GB.

This reduces peak memory from ~16GB to ~4-5GB, making inference possible on 16GB RAM systems where the Python ML stack cannot run FLUX.2 at all.

Performance varies by backend:

MPS (Apple Silicon): mmap is the fastest mode. The model stores weights in bf16 format, and MPS uses them directly via zero-copy pointers into the memory-mapped region. No conversion overhead, and the kernel handles paging efficiently.
BLAS (CPU): mmap is slightly slower but uses much less RAM. BLAS requires f32 weights, so each block must be converted from bf16→f32 on every step (25 blocks × 4 steps = 100 conversions). With --no-mmap, this conversion happens once at startup. Recommendation: If you have 32GB+ RAM and use BLAS, try --no-mmap for faster inference. If RAM is limited, mmap lets you run at all.
Generic (pure C): Same tradeoffs as BLAS, but slower overall.

C Library API

The library can be integrated into your own C/C++ projects. Link against libflux.a and include flux.h.

Text-to-Image Generation

Here's a complete program that generates an image from a text prompt:

#include "flux.h"
#include <stdio.h>

int main(void) {
    /* Load the model. This loads VAE, transformer, and text encoder. */
    flux_ctx *ctx = flux_load_dir("flux-klein-model");
    if (!ctx) {
        fprintf(stderr, "Failed to load model: %s\n", flux_get_error());
        return 1;
    }

    /* Configure generation parameters. Start with defaults and customize. */
    flux_params params = FLUX_PARAMS_DEFAULT;
    params.width = 512;
    params.height = 512;
    params.seed = 42;  /* Use -1 for random seed */

    /* Generate the image. This handles text encoding, diffusion, and VAE decode. */
    flux_image *img = flux_generate(ctx, "A fluffy orange cat in a sunbeam", &params);
    if (!img) {
        fprintf(stderr, "Generation failed: %s\n", flux_get_error());
        flux_free(ctx);
        return 1;
    }

    /* Save to file. Format is determined by extension (.png or .ppm). */
    flux_image_save(img, "cat.png");
    printf("Saved cat.png (%dx%d)\n", img->width, img->height);

    /* Clean up */
    flux_image_free(img);
    flux_free(ctx);
    return 0;
}

Compile with:

gcc -o myapp myapp.c -L. -lflux -lm -framework Accelerate  # macOS
gcc -o myapp myapp.c -L. -lflux -lm -lopenblas              # Linux

Image-to-Image Transformation

Transform an existing image guided by a text prompt using in-context conditioning:

#include "flux.h"
#include <stdio.h>

int main(void) {
    flux_ctx *ctx = flux_load_dir("flux-klein-model");
    if (!ctx) return 1;

    /* Load the input image */
    flux_image *photo = flux_image_load("photo.png");
    if (!photo) {
        fprintf(stderr, "Failed to load image\n");
        flux_free(ctx);
        return 1;
    }

    /* Set up parameters. Output size defaults to input size. */
    flux_params params = FLUX_PARAMS_DEFAULT;
    params.seed = 123;

    /* Transform the image - describe the desired output */
    flux_image *painting = flux_img2img(ctx, "oil painting of the scene, impressionist style",
                                         photo, &params);
    flux_image_free(photo);  /* Done with input */

    if (!painting) {
        fprintf(stderr, "Transformation failed: %s\n", flux_get_error());
        flux_free(ctx);
        return 1;
    }

    flux_image_save(painting, "painting.png");
    printf("Saved painting.png\n");

    flux_image_free(painting);
    flux_free(ctx);
    return 0;
}

### Generating Multiple Images

When generating multiple images with different seeds but the same prompt, you can avoid reloading the text encoder:

```c
flux_ctx *ctx = flux_load_dir("flux-klein-model");
flux_params params = FLUX_PARAMS_DEFAULT;
params.width = 256;
params.height = 256;

/* Generate 5 variations with different seeds */
for (int i = 0; i < 5; i++) {
    flux_set_seed(1000 + i);

    flux_image *img = flux_generate(ctx, "A mountain landscape at sunset", &params);

    char filename[64];
    snprintf(filename, sizeof(filename), "landscape_%d.png", i);
    flux_image_save(img, filename);
    flux_image_free(img);
}

flux_free(ctx);

Note: The text encoder (~8GB) is automatically released after the first generation to save memory. It reloads automatically if you use a different prompt.

Error Handling

All functions that can fail return NULL on error. Use flux_get_error() to get a description:

flux_ctx *ctx = flux_load_dir("nonexistent-model");
if (!ctx) {
    fprintf(stderr, "Error: %s\n", flux_get_error());
    /* Prints something like: "Failed to load VAE - cannot generate images" */
    return 1;
}

API Reference

Core functions:

flux_ctx *flux_load_dir(const char *model_dir);   /* Load model, returns NULL on error */
void flux_free(flux_ctx *ctx);                     /* Free all resources */

flux_image *flux_generate(flux_ctx *ctx, const char *prompt, const flux_params *params);
flux_image *flux_img2img(flux_ctx *ctx, const char *prompt, const flux_image *input,
                          const flux_params *params);

Image handling:

flux_image *flux_image_load(const char *path);     /* Load PNG or PPM */
int flux_image_save(const flux_image *img, const char *path);  /* 0=success, -1=error */
int flux_image_save_with_seed(const flux_image *img, const char *path, int64_t seed);  /* Save with metadata */
flux_image *flux_image_resize(const flux_image *img, int new_w, int new_h);
void flux_image_free(flux_image *img);

Utilities:

void flux_set_seed(int64_t seed);                  /* Set RNG seed for reproducibility */
const char *flux_get_error(void);                  /* Get last error message */
void flux_release_text_encoder(flux_ctx *ctx);     /* Manually free ~8GB (optional) */

Parameters

typedef struct {
    int width;              /* Output width in pixels (default: 256) */
    int height;             /* Output height in pixels (default: 256) */
    int num_steps;          /* Denoising steps, use 4 for klein (default: 4) */
    int64_t seed;           /* Random seed, -1 for random (default: -1) */
} flux_params;

/* Initialize with sensible defaults */
#define FLUX_PARAMS_DEFAULT { 256, 256, 4, -1 }

Debugging

Comparing with Python Reference

When debugging img2img issues, the --debug-py flag allows you to run the C implementation with exact inputs saved from a Python reference script. This isolates whether differences are due to input preparation (VAE encoding, text encoding, noise generation) or the transformer itself.

Setup:

Set up the Python environment:

python -m venv flux_env
source flux_env/bin/activate
pip install torch diffusers transformers safetensors einops huggingface_hub

Clone the flux2 reference (for the model class):

git clone https://github.com/black-forest-labs/flux flux2

Run the Python debug script to save inputs:

python debug/debug_img2img_compare.py

This saves to /tmp/:

py_noise.bin - Initial noise tensor
py_ref_latent.bin - VAE-encoded reference image
py_text_emb.bin - Text embeddings from Qwen3

Run C with the same inputs:

./flux -d flux-klein-model --debug-py -W 256 -H 256 --steps 4 -o /tmp/c_debug.png

Compare the outputs visually or numerically.

What this helps diagnose:

If C and Python produce identical outputs with identical inputs, any differences in normal operation are due to input preparation (VAE, text encoder, RNG)
If outputs differ even with identical inputs, the issue is in the transformer or sampling implementation

Debug Scripts

The debug/ directory contains Python scripts for comparing C and Python implementations:

debug_img2img_compare.py - Full img2img comparison with step-by-step statistics
debug_rope_img2img.py - Verify RoPE position encoding matches between C and Python

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
debug		debug
images		images
test_vectors		test_vectors
.gitignore		.gitignore
IMPLEMENTATION_NOTES.md		IMPLEMENTATION_NOTES.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
download_model.py		download_model.py
download_model.sh		download_model.sh
embcache.c		embcache.c
embcache.h		embcache.h
flux.c		flux.c
flux.h		flux.h
flux_cli.c		flux_cli.c
flux_cli.h		flux_cli.h
flux_image.c		flux_image.c
flux_kernels.c		flux_kernels.c
flux_kernels.h		flux_kernels.h
flux_metal.h		flux_metal.h
flux_metal.m		flux_metal.m
flux_qwen3.c		flux_qwen3.c
flux_qwen3.h		flux_qwen3.h
flux_qwen3_tokenizer.c		flux_qwen3_tokenizer.c
flux_safetensors.c		flux_safetensors.c
flux_safetensors.h		flux_safetensors.h
flux_sample.c		flux_sample.c
flux_shaders.metal		flux_shaders.metal
flux_tokenizer.c		flux_tokenizer.c
flux_transformer.c		flux_transformer.c
flux_vae.c		flux_vae.c
linenoise.c		linenoise.c
linenoise.h		linenoise.h
main.c		main.c
png_compare.c		png_compare.c
run_test.py		run_test.py
terminals.c		terminals.c
terminals.h		terminals.h

License

alessaba/flux2.c

Folders and files

Latest commit

History

Repository files navigation

FLUX.2-klein-4B Pure C Implementation

Quick Start

Example Output

Image-to-Image Example

Features

Terminal Image Display

Usage

Text-to-Image

Image-to-Image

Multi-Reference Generation

Interactive CLI Mode

Command Line Options

Reproducibility

PNG Metadata

Building

Testing

Model Download

How Fast Is It?

Resolution Limits

Model Architecture

Memory Requirements

Memory-Mapped Weights (Default)

C Library API

Text-to-Image Generation

Image-to-Image Transformation

Error Handling

API Reference

Parameters

Debugging

Comparing with Python Reference

Debug Scripts

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages