Local Embedding Client for RAG and Semantic Search

This title was summarized by AI from the post below.

3w Edited

A local, no-server client for generating embeddings — designed for semantic search and vector comparison in AI systems such as RAG (Retrieval-Augmented Generation). I wrote a Python class that runs Qwen3-Embedding-0.6B directly — no API calls, no embedding server, no extra moving parts. Why? In most RAG or semantic search projects, I had two options: 1. Pay for external APIs (latency + cost + internet dependency) 2. Spin up a dedicated embedding server (overhead for small-to-medium projects) I wanted something dead simple that just works. What this client does: - Auto-detects GPU, CPU, or Apple Silicon (MPS) - FP16 and 4-bit quantization support - Built-in caching for repeated texts - Batched inference (3-5x faster than one-by-one) - Async interface for FastAPI - Cosine similarity and ranking helpers out of the box The result? ~75% less memory with 4-bit, and batch processing that actually scales. "I've put the code on GitHub (link in comments). Would love your thoughts — not just on bugs, but on the architecture itself: - Is batching + caching the right trade-off for most RAG workloads? - Could the async interface be simplified further? - Any obvious optimization I'm missing for CPU-only deployments? Genuinely looking for a second pair of eyes from someone who's done this before." #python #nlp #LLM #SemanticSearch #rag #embeddings #opensource #AI

3 Comments

Asghar Asghari 3w

https://github.com/asgrdev/embedding_generator

Aditya Sakpal 3w

Batching plus caching is the right baseline, but for embedding workloads the bigger lever is async prefetch with a bounded queue — you can hide ~80% of inference latency on a CPU-only path by keeping the queue warm during chunk extraction. On 4-bit Qwen3-Embedding-0.6B specifically, watch out for cosine-similarity scale shift after quantization; we had to recalibrate threshold cutoffs on retrieval after switching from FP16. Did you sanity-check downstream NDCG between FP16 and 4-bit?

See more comments

To view or add a comment, sign in

More Relevant Posts

khushal khare
2w Edited
Report this post
Most AI applications send every request to the largest LLM. That increases cost, latency, and unnecessary compute usage for simple tasks. So I built an ML-based LLM Router that dynamically selects the most suitable model for each query. The system uses: • sentence embeddings • an ML classifier • confidence-based routing • an LLM judge fallback for uncertain decisions Architecture Flow: Query → Embedding Model → ML Classifier → Confidence Check → Model Routing Routing Logic: → Small LLM for lightweight/simple tasks → Large LLM for complex or high-risk reasoning → LLM judge fallback when classifier confidence is low I also added production-oriented safeguards: • API key authentication • prompt injection filtering • environment-based secret management • input validation and request filtering For observability and evaluation, the router tracks: • selected model • routing confidence • latency • routing strategy • evaluation results • Prometheus metrics Tech Stack: Python, FastAPI, Pydantic, Sentence Transformers, Scikit-learn, Groq API, Llama 3, Prometheus, Docker, Pandas One interesting part of the project was moving from an earlier rule-based routing approach to an ML-based classifier architecture, making the routing system more adaptive and scalable. The goal was simple: use the right model for the right task instead of sending everything to the biggest model. GitHub: https://lnkd.in/dAP2Cr8E #AI #ArtificialIntelligence #LLM #LLMOps #GenerativeAI #MachineLearning #MLOps #FastAPI #Python #DeepLearning #DataScience #SystemDesign #AIEngineering #SoftwareEngineering #BackendDevelopment #CloudComputing #Docker #Prometheus #OpenSource #TechInnovation #AgenticAI #MLEngineering #NLP #LargeLanguageModels #AIInfrastructure #ScikitLearn #Pydantic #APIs #TechProjects #Developer
Like Comment
To view or add a comment, sign in
S. Pratham
4w
Report this post
I built a tool that reads a video with AI so you don't have to.. drop any video file into it, and it gives you a full written breakdown. not a transcript. an actual analysis — what was said, what was shown, key moments, the tone, who was speaking, a chronological narrative. the kind of notes you'd write if you watched it three times and paid close attention. and if you have a specific question, just pass it as a prompt. "what tools were mentioned?" "summarize the main argument." "what products appear in this video?" it'll answer it with evidence pulled from the video itself. here's how you can use it right now: run a video through CLI and get a multi-page summary. import it as a Python module inside your own project. or if your machine can't handle it, there's a Google Colab notebook linked below — free T4 GPU, 16GB VRAM, no setup needed. some things worth knowing: it doesn't sample frames every N seconds like most tools do. it detects actual scene changes using histogram and SSIM comparison, so it only picks the frames that actually matter. it profiles your GPU at runtime and calculates its own batch size. everything runs locally — no API keys, no cloud, nothing leaves your device. i built it because i wanted to understand what's inside a video without watching it. turns out that's a genuinely hard problem to solve cleanly. repo: https://lnkd.in/dmey5a6c colab: https://lnkd.in/djfj26Yq #buildinpublic #python #AI #opensource #MachineLearning — S. Pratham
Like Comment
To view or add a comment, sign in
Omotayo Aina
1mo
Report this post
𝟴𝟴𝟱,𝟳𝟲𝟴 𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀. 𝗧𝗿𝗮𝗶𝗻𝗲𝗱 𝗶𝗻 𝟵.𝟳 𝗺𝗶𝗻𝘂𝘁𝗲𝘀. 𝘃𝗮𝗹_𝗹𝗼𝘀𝘀 = 𝟭.𝟮𝟵𝟱 That is what a nano-scale character-level model on TinyStories looks like when you train it in JAX. Over the last few weeks I ported Andrej Karpathy's NanoChat architecture from PyTorch to JAX/Flax NNX as part of the AI GDE TPU Sprint 2026 with Google TPU Research Cloud compute. The repo is about 12,400 lines across source and scripts. 𝗪𝗵𝗮𝘁 𝗝𝗔𝗫 𝗴𝗲𝘁𝘀 𝗿𝗶𝗴𝗵𝘁: XLA compiles once. The first training step costs about 35 seconds. After that, steady-state runs at 290ms per step with zero Python interpreter overhead between steps. The same code runs on GPU and TPU without any device-specific branches. Pure functions make experiment reproducibility a non-issue: same seed, same result, on the same hardware. 𝗪𝗵𝗮𝘁 𝗝𝗔𝗫 𝗴𝗲𝘁𝘀 𝘄𝗿𝗼𝗻𝗴: No vLLM. No Flash Attention 3. Debugging inside jit means reading XLA stack traces instead of Python tracebacks. The PyTorch ecosystem is simply deeper for production inference right now. 𝗪𝗵𝗮𝘁 𝘁𝗵𝗲 𝗽𝗿𝗼𝗷𝗲𝗰𝘁 𝗰𝗼𝘃𝗲𝗿𝘀: • Five model presets from nano (885K params, confirmed from training) to xlarge • Five NanoChat-specific architecture components: logit softcap, Muon optimizer with Newton-Schulz orthogonalization, Value Embeddings, Smear/Backout token mixing, and depth-aware initialization • Chinchilla-style power law fitting across model sizes and compute budgets • A working streaming chat UI served via FastAPI with Server-Sent Events The full post walks through each component with side-by-side PyTorch and JAX code, the training numbers, and an honest verdict on when to use JAX for research versus when PyTorch is still the right call. 𝗖𝗼𝗱𝗲: https://lnkd.in/etaA9JNz 𝗣𝗼𝘀𝘁: https://lnkd.in/e_hNgjWf #TPUSprint #AIGDE #JAX #MachineLearning #Python #AI #GoogleCloud

I Rebuilt Karpathy's NanoChat in JAX. Here's What XLA Gets Right and What It Gets Dead Wrong. dev.to

1 Comment
Like Comment
To view or add a comment, sign in
Tulasi prasad
1mo
Report this post
Just pushed a RAG deep-dive repository to GitHub — and I want to break down exactly what's inside. Retrieval-Augmented Generation is one of the most practical skills in GenAI right now. Instead of hoping an LLM "knows" the answer, RAG makes it retrieve the right context first — then respond. Less hallucination. More accuracy. So I built a modular repo that walks through every core component using LangChain: Document Loaders — ingest data from PDFs, CSVs, web pages, text files & directories Text Splitters — chunk documents intelligently with overlap for better context continuity Vector Stores — embed chunks and store them in FAISS / Chroma for similarity search Retrievers — query smarter using MultiQuery, ContextualCompression, ParentDocument & SelfQuery retrievers Each module is a standalone Jupyter notebook — pick up any stage and run it independently. Stack: LangChain · FAISS · Chroma · HuggingFace Embeddings · OpenAI · Python This is part of my ongoing GenAI learning roadmap — covering Transformers → Prompt Engineering → RAG → Fine-tuning. Repo link : https://lnkd.in/gJ4YvTiu #RAG #LangChain #GenerativeAI #LLM #MachineLearning #NLP #Python #OpenToWork #DataScience #AIEngineering LangChain
Like Comment
To view or add a comment, sign in
PRITAM DAS
3w
Report this post
I’ve been diving deep into the architecture that powers the modern AI revolution. Today, I finished implementing a GPT-style Transformer Block from scratch using PyTorch! Understanding the "how" behind Large Language Models (LLMs) is one thing, but coding the components manually really brings the math to life. What’s inside my implementation? Multi-Head Self-Attention: Implementing the Scaled Dot-Product Attention mechanism to allow the model to focus on different parts of the input sequence simultaneously. Residual Connections: Essential for training deep networks by creating "highways" for gradient flow, preventing the vanishing gradient problem. Pre-Norm Architecture: Using LayerNorm before the attention and feed-forward blocks (the GPT-2/3 standard) to ensure training stability. Causal Masking: Using a triangular matrix (tril) to ensure the model only looks at past tokens—crucial for generative tasks. The Tech Stack Language: Python Framework: PyTorch Concept: Attention is All You Need (Vaswani et al.) It’s one thing to use an API, but building the blocks yourself gives you a much deeper appreciation for the engineering hurdles behind models like GPT-4 and Llama. Next step: Putting these blocks together into a full Decoder-only model and training it on some text data! #MachineLearning #DeepLearning #PyTorch #AI #GenerativeAI #Transformer #LLM #CodingJourney #Python
Like Comment
To view or add a comment, sign in
Sujeet Hole
3w
Report this post
Built an AI backend that thinks before it responds. Not just a wrapper around an LLM — it actually classifies your intent first, then decides what to do. Send it "2 + 2" → math engine handles it Send it "explain recursion" → explanation pipeline kicks in Send it "how's your day" → conversational flow takes over All routing is handled by a LangGraph state machine. No if-else spaghetti. Stack: → FastAPI for the API layer → LangGraph for intent routing → Groq (LLaMA 3.3-70b) for LLM inference → PostgreSQL for storing chat history per user Hardest part? Getting the intent classifier to not hallucinate random words and break the router . Next: JWT auth + persistent memory across sessions. GitHub: https://lnkd.in/dFq_nkVR #FastAPI #LangGraph #Python #BackendDevelopment #AI #OpenToWork
Like Comment
To view or add a comment, sign in
Maksym Begal
2w
Report this post
Why PyTorch Lightning is the Gold Standard for Modern Deep Learning If you’ve ever found yourself drowning in hundreds of lines of boilerplate code—manually managing GPU transfers, checkpoints, and logging—you know the struggle of a DL Engineer. PyTorch Lightning was built to solve exactly that. Unlike raw PyTorch, Lightning doesn’t change your logic—it structures it. Here is why it has become an essential tool in my workflow: Key Advantages That Change the Game: • Separation of Concerns: You focus on the "Science" (model architecture and data in the LightningModule), while Lightning handles the "Engineering" (training loops, validation, and testing). • Seamless Scalability: Moving from CPU to multi-GPU or TPU clusters requires changing just one line in the Trainer. No more manual .to(device) calls. • Built-in Best Practices: • Automated logging (TensorBoard, WandB, MLFlow). • Native support for Precision (FP16/BF16) and Gradient Accumulation. • Easy integration of Early Stopping and Checkpointing. • Readability & Reproducibility: Because the code follows a strict structure, your projects become instantly readable for teammates and production-ready from day one. • Robust Ecosystem: It’s more than a library; it’s a foundation for tools like Lightning Flash and Fabric, making the path from research to production much shorter. The Bottom Line: PyTorch Lightning makes your code cleaner, your development faster, and your models more scalable. It is the perfect bridge between an experimental prototype and a production-grade solution. What’s your take: do you prefer the total control of raw PyTorch or the streamlined efficiency of Lightning? Let’s discuss in the comments! #MachineLearning #DeepLearning #PyTorch #PyTorchLightning #AI #DataScience #Python #MLOps #ComputerVision #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Corbenic AI

2 followers
3w
Report this post
Announcing Merlin: The Zero-Cost Deduplication Engine for LLM Inference Today, Corbenic AI is releasing the technical validation for Merlin our deterministic, byte-exact deduplication engine built specifically to optimize large language model context windows. LLM infrastructure is currently plagued by compute bloat. Merlin solves this at the bare-metal layer by removing redundant context before it ever hits the model, drastically cutting inference costs and reducing "Time To First Token" (TTFT). We didn't build a complex middleware or a Python wrapper. We built a deeply optimized, zero-dependency C++ engine designed for enterprise scale. The Architecture & Benchmarks: ⚡ 1.10µs Latency: Mathematical invisibility. Processing time is orders of magnitude below typical inference budgets. 📦 3.8MB Footprint: A statically linked C++ binary that drops seamlessly next to your inference proxy. 🚀 190 GB/s Throughput: Built on a shared-nothing arena architecture to hit physical hardware limits. 🛡️ 100% Lossless: Byte-exact accuracy with zero measurable quality degradation in the LLM output. Corbenic AI is officially transitioning from technical validation into enterprise pilots. If you are scaling AI infrastructure and need to optimize your prefill compute, we are ready to deploy. Read the full technical methodology and benchmark data in our published preprint: 👉 https://lnkd.in/e8_KhBj5 #CorbenicAI #AIInfrastructure #MachineLearning #LLMs #MLOps #Cplusplus #DeepTech

Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference zenodo.org
Like Comment
To view or add a comment, sign in
Fimijoba Micheal Oladokun
1w
Report this post
A year ago, fine-tuning a language model required expensive cloud GPU infrastructure and deep expertise in distributed training. Today you can do it on a consumer GPU in an afternoon. LoRA and QLoRA changed the economics entirely. Instead of updating billions of parameters, you train small adapter matrices that approximate the needed weight updates. A 7B parameter model that would require 80GB of GPU memory for full fine-tuning runs comfortably in 8GB with QLoRA. Here is what the workflow actually looks like. Dataset preparation is where most of the work happens and where most fine-tuning projects succeed or fail. Quality matters far more than quantity. Five hundred carefully curated instruction-response pairs consistently outperform five thousand noisy ones. The format matters too. Modern models expect chat-formatted data with system, user, and assistant turns. Malformed examples produce subtly broken models that are difficult to diagnose. The LoRA configuration controls the tradeoff between expressiveness and efficiency. Rank 16 handles most tasks well. Higher ranks add capacity at the cost of memory. The target modules, typically the attention projection layers, determine which parts of the model are adapted. The Hugging Face PEFT library handles all of this in a few lines of configuration. Training with the SFT Trainer from the trl library takes the complexity out of the training loop. Gradient accumulation simulates larger batch sizes on limited hardware. Three epochs is usually enough. Watch the validation loss. If it stops improving before three epochs, your dataset may be too small. Evaluation is where most fine-tuning guides stop too early. Validation loss tells you training progressed. It does not tell you whether the model does what you need. Test on held-out examples the model has never seen. Compare outputs side by side with the base model. For classification tasks, calculate F1. For generation tasks, evaluate manually. The improvement should be obvious, not marginal. The use cases where fine-tuning genuinely adds value are consistent output format, domain-specific writing style, and task specialization. For factual knowledge injection, RAG is almost always the better tool. Have you fine-tuned a model locally? What was the hardest part of the process? Read the full post here : https://lnkd.in/ezTrTdcD #MachineLearning #LLM #FineTuning #LocalAI #Python #HuggingFace #LoRA #DataScience #AIEngineering #OpenSource

How to Fine-Tune a Small Language Model Locally https://codewithfimi.com
Like Comment
To view or add a comment, sign in
Lakshay Arora, PhD
3w Edited
Report this post
I built RepliCode -- a multi-agent system that turns research papers into verified, executable Python code. The problem with existing paper-to-code tools: they generate plausible-looking code and judge quality with another LLM. Nobody checks if the code actually runs. RepliCode closes the loop with three specialized agents orchestrated via LangGraph: → Planner Agent — reads the paper once, extracts structure: file plan, key equations, numerical claims, dependencies → Coder Agent — generates code from retrieved chunks, executes in a sandbox, reads real tracebacks, and self-debugs iteratively → Critic Agent — independently reviews the final code against the paper and flags any deviations The pipeline: PDF → parse → chunk → hybrid retrieval (vector + BM25 + RRF) → plan → generate → execute → refine on errors → critic verification → evaluation. Tested on "Attention Is All You Need" (2017) and TabReason (2025) by RBC Borealis which is a paper the model has never seen: --> Complex queries (full Transformer) self-debug in up to 5 iterations --> Critic reports: Faithful across all tests --> Works pretty well on papers published recently (still some quirks are there :P) The key insight: execution is the only ground truth. Existing tools stop at "looks right." RepliCode stops at "runs correctly and an independent agent verified it matches the paper." Live demo: https://lnkd.in/e9TKZqvg Project Github Link: https://lnkd.in/dd4wRmmG This is Phase A — prompting-based, GPT-4o-powered. Phase B will be more challenging which aims to train a small open-weights model using RL, where the reward is simply: did the code run and reproduce the paper's numbers? Planning to share more on that soon. Meanwhile, I’d love honest feedbacks. What would you improve? What's missing? Drop a comment or DM me. I’d also like to acknowledge Krish Naik Sir. A lot of the practical intuition behind building systems like this came from learning through his YouTube content. #MachineLearning #LLM #RAG #MultiAgent #LangGraph #MLEngineering #AI #PapertoCode #GenerativeAI #ArtificialIntelligence
Like Comment
To view or add a comment, sign in

883 followers

View Profile Follow

Local Embedding Client for RAG and Semantic Search

More from this author

Smart text chunking for Embedding and RAG Systems

Explore content categories