A local, no-server client for generating embeddings — designed for semantic search and vector comparison in AI systems such as RAG (Retrieval-Augmented Generation). I wrote a Python class that runs Qwen3-Embedding-0.6B directly — no API calls, no embedding server, no extra moving parts. Why? In most RAG or semantic search projects, I had two options: 1. Pay for external APIs (latency + cost + internet dependency) 2. Spin up a dedicated embedding server (overhead for small-to-medium projects) I wanted something dead simple that just works. What this client does: - Auto-detects GPU, CPU, or Apple Silicon (MPS) - FP16 and 4-bit quantization support - Built-in caching for repeated texts - Batched inference (3-5x faster than one-by-one) - Async interface for FastAPI - Cosine similarity and ranking helpers out of the box The result? ~75% less memory with 4-bit, and batch processing that actually scales. "I've put the code on GitHub (link in comments). Would love your thoughts — not just on bugs, but on the architecture itself: - Is batching + caching the right trade-off for most RAG workloads? - Could the async interface be simplified further? - Any obvious optimization I'm missing for CPU-only deployments? Genuinely looking for a second pair of eyes from someone who's done this before." #python #nlp #LLM #SemanticSearch #rag #embeddings #opensource #AI
Batching plus caching is the right baseline, but for embedding workloads the bigger lever is async prefetch with a bounded queue — you can hide ~80% of inference latency on a CPU-only path by keeping the queue warm during chunk extraction. On 4-bit Qwen3-Embedding-0.6B specifically, watch out for cosine-similarity scale shift after quantization; we had to recalibrate threshold cutoffs on retrieval after switching from FP16. Did you sanity-check downstream NDCG between FP16 and 4-bit?
https://github.com/asgrdev/embedding_generator