This program generates images from text prompts (and optionally from other images) using the FLUX.2-klein-4B model from Black Forest Labs. It can be used as a library as well, and is implemented entirely in C, with zero external dependencies beyond the C standard library. MPS and BLAS acceleration are optional but recommended.
# Build (choose your backend)
make mps # Apple Silicon (fastest)
# or: make blas # Intel Mac / Linux with OpenBLAS
# or: make generic # Pure C, no dependencies
# Download the model (~16GB) - pick one:
./download_model.sh # using curl
# or: pip install huggingface_hub && python download_model.py
# Generate an image
./flux -d flux-klein-model -p "A woman wearing sunglasses" -o output.pngThat's it. No Python runtime or CUDA toolkit required at inference time.
Generated with: ./flux -d flux-klein-model -p "A picture of a woman in 1960 America. Sunglasses. ASA 400 film. Black and White." -W 512 -H 512 -o woman.png
Generated with: ./flux -i antirez.png -o antirez_to_drawing.png -p "make it a drawing" -d flux-klein-model
- Zero dependencies: Pure C implementation, works standalone. BLAS optional for ~30x speedup (Apple Accelerate on macOS, OpenBLAS on Linux)
- Metal GPU acceleration: Automatic on Apple Silicon Macs. Performance matches PyTorch's optimized MPS pipeline
- Runs where Python can't: Memory-mapped weights (default) enable inference on 8GB RAM systems where the Python ML stack cannot run FLUX.2 at all
- Text-to-image: Generate images from text prompts
- Image-to-image: Transform existing images guided by prompts
- Multi-reference: Combine multiple reference images (e.g.,
-i car.png -i beach.pngfor "car on beach") - Integrated text encoder: Qwen3-4B encoder built-in, no external embedding computation needed
- Memory efficient: Automatic encoder release after encoding (~8GB freed)
- Memory-mapped weights: Enabled by default. Reduces peak memory from ~16GB to ~4-5GB. Fastest mode on MPS; BLAS users with plenty of RAM may prefer
--no-mmapfor faster inference - Size-independent seeds: Same seed produces similar compositions at different resolutions. Explore at 256×256, then render at 512×512 with the same seed
- Terminal image display: watch the resulting image without leaving your terminal (Ghostty, Kitty, or iTerm2).
Display generated images directly in your terminal with --show, or watch the denoising process step-by-step with --show-steps:
# Display final image in terminal (auto-detects Kitty/Ghostty/iTerm2)
./flux -d flux-klein-model -p "a cute robot" -o robot.png --show
# Display each denoising step (slower, but interesting to watch)
./flux -d flux-klein-model -p "a cute robot" -o robot.png --show-stepsRequires a terminal supporting the Kitty graphics protocol (such as Kitty or Ghostty), or iTerm2. Terminal type is auto-detected from environment variables.
./flux -d flux-klein-model -p "A fluffy orange cat sitting on a windowsill" -o cat.pngTransform an existing image based on a prompt:
./flux -d flux-klein-model -p "oil painting style" -i photo.png -o painting.pngFLUX.2 uses in-context conditioning for image-to-image generation. Unlike traditional approaches that add noise to the input image, FLUX.2 passes the reference image as additional tokens that the model can attend to during generation. This means:
- The model "sees" your input image and uses it as a reference
- The prompt describes what you want the output to look like
- Results tend to preserve the composition while applying the described transformation
Tips for good results:
- Use descriptive prompts that describe the desired output, not instructions
- Good:
"oil painting of a woman with sunglasses, impressionist style" - Less good:
"make it an oil painting"(instructional prompts may work less well)
Super Resolution: Since the reference image can be a different size than the output, you can use img2img for upscaling:
./flux -d flux-klein-model -i small.png -W 1024 -H 1024 -o big.png -p "Create an exact copy of the input image."The model will generate a higher-resolution version while preserving the composition and details of the input.
Combine elements from multiple reference images:
./flux -d flux-klein-model -i car.png -i beach.png -p "a sports car on the beach" -o result.pngEach reference image is encoded separately and passed to the transformer with different positional embeddings (T=10, T=20, T=30, ...). The model attends to all references during generation, allowing it to combine elements from each.
Example:
- Reference 1: A red sports car
- Reference 2: A tropical beach with palm trees
- Prompt: "combine the two images"
- Result: A red sports car on a tropical beach
You can specify up to 16 reference images with multiple -i flags. The prompt guides how the references are combined.
Start without -p to enter interactive mode:
./flux -d flux-klein-modelGenerate images by typing prompts. Each image gets a $N reference ID:
flux> a red sports car
Done -> /tmp/flux-.../image-0001.png (ref $0)
flux> a tropical beach
Done -> /tmp/flux-.../image-0002.png (ref $1)
flux> $0 $1 combine them
Generating 256x256 (multi-ref, 2 images)...
Done -> /tmp/flux-.../image-0003.png (ref $2)
Prompt syntax:
prompt- text-to-image512x512 prompt- set size inline$ prompt- img2img with last image$N prompt- img2img with reference $N$0 $3 prompt- multi-reference (combine images)
Commands: !help, !save, !load, !seed, !size, !steps, !explore, !show, !quit
Required:
-d, --dir PATH Path to model directory
-p, --prompt TEXT Text prompt for generation
-o, --output PATH Output image path (.png or .ppm)
Generation options:
-W, --width N Output width in pixels (default: 256)
-H, --height N Output height in pixels (default: 256)
-s, --steps N Sampling steps (default: 4)
-S, --seed N Random seed for reproducibility
Image-to-image options:
-i, --input PATH Reference image (can be specified multiple times)
Output options:
-q, --quiet Silent mode, no output
-v, --verbose Show detailed config and timing info
--show Display image in terminal (auto-detects Kitty/Ghostty/iTerm2)
--show-steps Display each denoising step (slower)
Other options:
-m, --mmap Memory-mapped weights (default, fastest on MPS)
--no-mmap Disable mmap, load all weights upfront
-e, --embeddings PATH Load pre-computed text embeddings (advanced)
-h, --help Show help
The seed is always printed to stderr, even when random:
$ ./flux -d flux-klein-model -p "a landscape" -o out.png
Seed: 1705612345
...
Saving... out.png 256x256 (0.1s)
To reproduce the same image, use the printed seed:
$ ./flux -d flux-klein-model -p "a landscape" -o out.png -S 1705612345
Generated PNG images include metadata with the seed and model information, so you can always recover the seed even if you didn't save the terminal output:
# Using exiftool
exiftool image.png | grep flux
# Using Python/PIL
python3 -c "from PIL import Image; print(Image.open('image.png').info)"
# Using ImageMagick
identify -verbose image.png | grep -A1 "Properties:"The following metadata fields are stored:
flux:seed- The random seed used for generationflux:model- The model name (FLUX.2-klein-4B)Software- Program identifier
Choose a backend when building:
make # Show available backends
make generic # Pure C, no dependencies (slow)
make blas # BLAS acceleration (~30x faster)
make mps # Apple Silicon Metal GPU (fastest, macOS only)Recommended:
- macOS Apple Silicon:
make mps - macOS Intel:
make blas - Linux with OpenBLAS:
make blas - Linux without OpenBLAS:
make generic
For make blas on Linux, install OpenBLAS first:
# Ubuntu/Debian
sudo apt install libopenblas-dev
# Fedora
sudo dnf install openblas-develOther targets:
make clean # Clean build artifacts
make info # Show available backends for this platformRun the test suite to verify your build produces correct output:
make test # Run all 3 tests
make test-quick # Run only the quick 64x64 testThe tests compare generated images against reference images in test_vectors/. A test passes if the maximum pixel difference is within tolerance (to allow for minor floating-point variations across platforms).
Test cases:
| Test | Size | Steps | Purpose |
|---|---|---|---|
| Quick | 64×64 | 2 | Fast txt2img sanity check |
| Full | 512×512 | 4 | Validates txt2img at larger resolution |
| img2img | 256×256 | 4 | Validates image-to-image transformation |
You can also run the test script directly for more options:
python3 run_test.py --help
python3 run_test.py --quick # Quick test only
python3 run_test.py --flux-binary ./flux --model-dir /path/to/modelDownload the model weights (~16GB) from HuggingFace using one of these methods:
Option 1: Shell script (requires curl)
./download_model.shOption 2: Python script (requires huggingface_hub)
pip install huggingface_hub
python download_model.pyBoth download the same files to ./flux-klein-model:
- VAE (~300MB)
- Transformer (~4GB)
- Qwen3-4B Text Encoder (~8GB)
- Tokenizer
Benchmarks on Apple M3 Max (128GB RAM), generating a 4-step image.
The MPS implementation matches the PyTorch optimized pipeline performance, providing better speed for small image sizes.
| Size | C (MPS) | PyTorch (MPS) |
|---|---|---|
| 256x256 | 5.6s | 11s |
| 512x512 | 9.1s | 13s |
| 1024x1024 | 26s | 25s |
Notes:
- All times measured as wall clock, including model loading, no warmup. PyTorch times exclude library import overhead (~5-10s) to be fair.
- The C BLAS backend (CPU) is not shown.
- The
make genericbackend (pure C, no BLAS) is approximately 30x slower than BLAS and not included in benchmarks. - The fastest implementation for Metal remains the Draw Things app that can produce a 1024x1024 image in just 14 seconds!
Maximum resolution: 1792x1792 pixels. The model produces good results up to this size; beyond this resolution image quality degrades significantly (this is a model limitation, not an implementation issue).
Minimum resolution: 64x64 pixels.
Dimensions should be multiples of 16 (the VAE downsampling factor).
FLUX.2-klein-4B is a rectified flow transformer optimized for fast inference:
| Component | Architecture |
|---|---|
| Transformer | 5 double blocks + 20 single blocks, 3072 hidden dim, 24 attention heads |
| VAE | AutoencoderKL, 128 latent channels, 8x spatial compression |
| Text Encoder | Qwen3-4B, 36 layers, 2560 hidden dim |
Inference steps: This is a distilled model that produces good results with exactly 4 sampling steps.
With mmap (default):
| Phase | Memory |
|---|---|
| Text encoding | ~2GB (layers loaded on-demand) |
| Diffusion | ~1-2GB (blocks loaded on-demand) |
| Peak | ~4-5GB |
With --no-mmap (all weights in RAM):
| Phase | Memory |
|---|---|
| Text encoding | ~8GB (encoder weights) |
| Diffusion | ~8GB (transformer ~4GB + VAE ~300MB + activations) |
| Peak | ~16GB (if encoder not released) |
The text encoder is automatically released after encoding, reducing peak memory during diffusion. If you generate multiple images with different prompts, the encoder reloads automatically.
Memory-mapped weight loading is enabled by default. Use --no-mmap to disable and load all weights upfront.
./flux -d flux-klein-model -p "A cat" -o cat.png # mmap (default)
./flux -d flux-klein-model -p "A cat" -o cat.png --no-mmap # load all upfrontHow it works: Instead of loading all model weights into RAM upfront, mmap keeps the safetensors files memory-mapped and loads weights on-demand:
- Text encoder (Qwen3): Each of the 36 transformer layers (~400MB each) is loaded, processed, and immediately freed. Only ~2GB stays resident instead of ~8GB.
- Denoising transformer: Each of the 5 double-blocks (~300MB) and 20 single-blocks (~150MB) is loaded on-demand and freed after use. Only ~200MB of shared weights stays resident instead of ~4GB.
This reduces peak memory from ~16GB to ~4-5GB, making inference possible on 16GB RAM systems where the Python ML stack cannot run FLUX.2 at all.
Performance varies by backend:
-
MPS (Apple Silicon): mmap is the fastest mode. The model stores weights in bf16 format, and MPS uses them directly via zero-copy pointers into the memory-mapped region. No conversion overhead, and the kernel handles paging efficiently.
-
BLAS (CPU): mmap is slightly slower but uses much less RAM. BLAS requires f32 weights, so each block must be converted from bf16→f32 on every step (25 blocks × 4 steps = 100 conversions). With
--no-mmap, this conversion happens once at startup. Recommendation: If you have 32GB+ RAM and use BLAS, try--no-mmapfor faster inference. If RAM is limited, mmap lets you run at all. -
Generic (pure C): Same tradeoffs as BLAS, but slower overall.
The library can be integrated into your own C/C++ projects. Link against libflux.a and include flux.h.
Here's a complete program that generates an image from a text prompt:
#include "flux.h"
#include <stdio.h>
int main(void) {
/* Load the model. This loads VAE, transformer, and text encoder. */
flux_ctx *ctx = flux_load_dir("flux-klein-model");
if (!ctx) {
fprintf(stderr, "Failed to load model: %s\n", flux_get_error());
return 1;
}
/* Configure generation parameters. Start with defaults and customize. */
flux_params params = FLUX_PARAMS_DEFAULT;
params.width = 512;
params.height = 512;
params.seed = 42; /* Use -1 for random seed */
/* Generate the image. This handles text encoding, diffusion, and VAE decode. */
flux_image *img = flux_generate(ctx, "A fluffy orange cat in a sunbeam", ¶ms);
if (!img) {
fprintf(stderr, "Generation failed: %s\n", flux_get_error());
flux_free(ctx);
return 1;
}
/* Save to file. Format is determined by extension (.png or .ppm). */
flux_image_save(img, "cat.png");
printf("Saved cat.png (%dx%d)\n", img->width, img->height);
/* Clean up */
flux_image_free(img);
flux_free(ctx);
return 0;
}Compile with:
gcc -o myapp myapp.c -L. -lflux -lm -framework Accelerate # macOS
gcc -o myapp myapp.c -L. -lflux -lm -lopenblas # LinuxTransform an existing image guided by a text prompt using in-context conditioning:
#include "flux.h"
#include <stdio.h>
int main(void) {
flux_ctx *ctx = flux_load_dir("flux-klein-model");
if (!ctx) return 1;
/* Load the input image */
flux_image *photo = flux_image_load("photo.png");
if (!photo) {
fprintf(stderr, "Failed to load image\n");
flux_free(ctx);
return 1;
}
/* Set up parameters. Output size defaults to input size. */
flux_params params = FLUX_PARAMS_DEFAULT;
params.seed = 123;
/* Transform the image - describe the desired output */
flux_image *painting = flux_img2img(ctx, "oil painting of the scene, impressionist style",
photo, ¶ms);
flux_image_free(photo); /* Done with input */
if (!painting) {
fprintf(stderr, "Transformation failed: %s\n", flux_get_error());
flux_free(ctx);
return 1;
}
flux_image_save(painting, "painting.png");
printf("Saved painting.png\n");
flux_image_free(painting);
flux_free(ctx);
return 0;
}
### Generating Multiple Images
When generating multiple images with different seeds but the same prompt, you can avoid reloading the text encoder:
```c
flux_ctx *ctx = flux_load_dir("flux-klein-model");
flux_params params = FLUX_PARAMS_DEFAULT;
params.width = 256;
params.height = 256;
/* Generate 5 variations with different seeds */
for (int i = 0; i < 5; i++) {
flux_set_seed(1000 + i);
flux_image *img = flux_generate(ctx, "A mountain landscape at sunset", ¶ms);
char filename[64];
snprintf(filename, sizeof(filename), "landscape_%d.png", i);
flux_image_save(img, filename);
flux_image_free(img);
}
flux_free(ctx);Note: The text encoder (~8GB) is automatically released after the first generation to save memory. It reloads automatically if you use a different prompt.
All functions that can fail return NULL on error. Use flux_get_error() to get a description:
flux_ctx *ctx = flux_load_dir("nonexistent-model");
if (!ctx) {
fprintf(stderr, "Error: %s\n", flux_get_error());
/* Prints something like: "Failed to load VAE - cannot generate images" */
return 1;
}Core functions:
flux_ctx *flux_load_dir(const char *model_dir); /* Load model, returns NULL on error */
void flux_free(flux_ctx *ctx); /* Free all resources */
flux_image *flux_generate(flux_ctx *ctx, const char *prompt, const flux_params *params);
flux_image *flux_img2img(flux_ctx *ctx, const char *prompt, const flux_image *input,
const flux_params *params);Image handling:
flux_image *flux_image_load(const char *path); /* Load PNG or PPM */
int flux_image_save(const flux_image *img, const char *path); /* 0=success, -1=error */
int flux_image_save_with_seed(const flux_image *img, const char *path, int64_t seed); /* Save with metadata */
flux_image *flux_image_resize(const flux_image *img, int new_w, int new_h);
void flux_image_free(flux_image *img);Utilities:
void flux_set_seed(int64_t seed); /* Set RNG seed for reproducibility */
const char *flux_get_error(void); /* Get last error message */
void flux_release_text_encoder(flux_ctx *ctx); /* Manually free ~8GB (optional) */typedef struct {
int width; /* Output width in pixels (default: 256) */
int height; /* Output height in pixels (default: 256) */
int num_steps; /* Denoising steps, use 4 for klein (default: 4) */
int64_t seed; /* Random seed, -1 for random (default: -1) */
} flux_params;
/* Initialize with sensible defaults */
#define FLUX_PARAMS_DEFAULT { 256, 256, 4, -1 }When debugging img2img issues, the --debug-py flag allows you to run the C implementation with exact inputs saved from a Python reference script. This isolates whether differences are due to input preparation (VAE encoding, text encoding, noise generation) or the transformer itself.
Setup:
- Set up the Python environment:
python -m venv flux_env
source flux_env/bin/activate
pip install torch diffusers transformers safetensors einops huggingface_hub- Clone the flux2 reference (for the model class):
git clone https://github.com/black-forest-labs/flux flux2- Run the Python debug script to save inputs:
python debug/debug_img2img_compare.pyThis saves to /tmp/:
py_noise.bin- Initial noise tensorpy_ref_latent.bin- VAE-encoded reference imagepy_text_emb.bin- Text embeddings from Qwen3
- Run C with the same inputs:
./flux -d flux-klein-model --debug-py -W 256 -H 256 --steps 4 -o /tmp/c_debug.png- Compare the outputs visually or numerically.
What this helps diagnose:
- If C and Python produce identical outputs with identical inputs, any differences in normal operation are due to input preparation (VAE, text encoder, RNG)
- If outputs differ even with identical inputs, the issue is in the transformer or sampling implementation
The debug/ directory contains Python scripts for comparing C and Python implementations:
debug_img2img_compare.py- Full img2img comparison with step-by-step statisticsdebug_rope_img2img.py- Verify RoPE position encoding matches between C and Python
MIT


