Spent about $3.50 on a single RTX 4090 to figure out why recent papers on hybrid AR + diffusion language models keep contradicting each other. Some report that adding an autoregressive planner improves diffusion-model reasoning. Others report it degrades. Both are right, in different projections. Starting observation: on LLaDA-8B, prepending the literal string "Plan: " to a GSM8K question costs 8pp of accuracy. No plan content. Just the word. That single number forced a decomposition. Hybrid AR/DDLM reasoning fails along at least three orthogonal axes: Interface-format brittleness — how much accuracy drops from any plan-shaped scaffold, content-free or otherwise. Planner-content trust — how much the model uses upstream plan content once the prefix shape is absorbed. Sampling-diversity preservation — whether fine-tuning collapses or expands the stochastic branches that consensus mechanisms rely on. A small (r=8) prefix-robustness LoRA flattens axis 1 from 8pp damage to within 1pp. Axis 2 turns out to be capacity-dependent in opposite directions across planner sizes — previously unmeasured. Axis 3 unexpectedly expanded under format-augmented training rather than collapsing, the inverse of the standard encoder-collapse story. The consensus-distillation track was the most instructive part. A late-block LoRA designed to distill majority-vote into a single forward pass plateaued at 70.5% across a 3.25x capacity bump. Looked like architectural impossibility. It wasn't — two design errors were masking each other. Fixing both recovered accuracy to 79%, within sampling error of target. Generalizable lesson: parameter-efficient distillation of sampling-based inference mechanisms requires the surgery to match the temporal structure of the original mechanism. A plateau across capacity is not, by itself, evidence the distillation is impossible. Workshop-scope, not main-conference. Single seed, N=200, GSM8K only. Limitations flagged honestly in the appendix. Total compute under $4. If you work on hybrid AR/diffusion or parameter-efficient distillation, especially if you've seen similar prefix-shape damage on other DDLMs, I'd be interested to compare notes. #MachineLearning #LLM #DiffusionModels
More Relevant Posts
-
DeepSeek V4 Flash on 2X RTX Pro 6000 at 80+ tps (+110% throughput) Every --speculative-config mtp running on the popular DeepSeek-V4-Flash quant has been a no-op. I noticed last week. Here's what I found, and what +110% decode throughput looks like. In the DeepSeek-V4 modeling code, this regex lives in the model class: _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"] That tells Hugging Face transformers to silently drop any weight whose name starts with mtp. at load time. Result: 1,575 Multi-Token Prediction tensors disappeared on load. The community's GPTQ quantization pass never saw them. The published checkpoint shipped without them. And vLLM's speculative decoding feature — advertised as supported — ran with nothing to speculate from. I retrofitted the MTP head back in. The work: 1. Spliced upstream's MTP block back into the checkpoint 2. Ran a dedicated GPTQ pass on 768 routed-expert tensors, matching the base's W4A16 INT4 group=128 format 3. Calibrated on 17,701 MTP forward dumps captured live from the base model running on real prompts (473k tokens of calibration data) 4. Wrote vLLM patches to handle six progressive load-time errors before the checkpoint would even open Results on 2× RTX PRO 6000 Blackwell Max-Q: 53 → 85 tok/s at 524k context (+62%) ~111 tok/s at 128k context (+110%) Quality benchmarks track the base quant; output is unaffected by design because the main model verifies every speculated draft. The model and patches are public: https://lnkd.in/eMNSkJHj #deepseek #nvidia #cuda #smi
To view or add a comment, sign in
-
The real takeaway is not “35B on a 3060” — it is how cheap local inference just got A 35B MoE model running comfortably on a 12GB RTX 3060 is not just a hobbyist benchmark. It is a signal about where local AI is becoming practical. The interesting part here is not that Qwen3.6-35B-A3B-MTP can run. The interesting part is the tradeoff profile: - 32k context is realistic on 12GB VRAM - Generation lands around 43–47 tokens/sec when tuned well - q8 KV cache appears basically “free” on this setup - MTP only adds about 2% over well-tuned plain decoding - The real cliff is MoE offload: push `-ncmoe` too low and performance collapses That last point matters. The win is not “use speculative decoding and everything becomes magic.” The win is that a carefully tuned plain decoding setup is already strong enough for daily coding work on consumer hardware. For teams and developers, the practical takeaway is simple: You may not need a high-end GPU to get useful local coding assistance with large MoE models. But you do need to understand the constraint surface: - VRAM headroom matters - MoE placement matters - KV cache choices matter - Benchmarks from interactive runs can mislead - More aggressive offload is not always faster The best profile from the test was not the absolute fastest one. It was the one that kept 32k context, stayed stable, and still generated around 43 tokens/sec. That is the operator lesson. Local AI is not becoming easier because the knobs disappeared. It is becoming useful because the knobs are finally worth turning.
To view or add a comment, sign in
-
-
I just opened my first contribution to vLLM issue #43700. The short version: bitsandbytes INT8 quantization at batch=1 is 4x slower than FP16 on NVIDIA L4, despite using 3.6x less memory. I traced it through CUDA profiling to dequantization overhead adding a separate memory movement step before each matmul. Linear and matmul operations consume 34% of CUDA execution time in this workload. Attention consumes 0.2%. The workload is memory-bandwidth-bound on the linear path, so any extra memory movement in that path is expensive. The regression disappears at batch=16 because the cost gets amortized across the batch. The problem is specifically small batch sizes exactly the conditions in latency-sensitive production endpoints where people are most likely to reach for quantization to reduce memory footprint. The full benchmark across six optimization techniques on 317,486 real ShareGPT prompts is open-sourced here: https://lnkd.in/e2K4GEzu Issue: https://lnkd.in/eRirW8TQ
To view or add a comment, sign in
-
🚀 Just fine-tuned Qwen2-VL-2B to convert document images into structured Markdown — and the results are genuinely exciting! For Assignment 05, I built an end-to-end vision-language training pipeline from scratch: 📄 Dataset — Paired document images + Markdown from the Nougat dataset 🤖 Model — Qwen2-VL-2B-Instruct with LoRA adapters (rank 16, alpha 32) ⚡ Training — Dual T4 GPUs on Kaggle, fp16 mixed precision, AdamW + cosine LR scheduler 📊 Evaluation — ROUGE-1/2/L metrics + zero-shot vs fine-tuned comparison Key engineering wins: ✅ Custom ChatML dataset class with image validation ✅ Checkpoint resume + auto-push to Kaggle every 200 steps ✅ LoRA adapter merged & exported as a standalone deployable model Fine-tuning a 2B multimodal model on consumer-grade GPUs with PEFT is the future of accessible AI research. 🔥 Full pipeline + code 👇 🔗 https://lnkd.in/dyRYy_3e #GenerativeAI #MultimodalAI #FineTuning #LoRA #ComputerVision #Qwen2VL #DocumentAI #MachineLearning #DeepLearning #AIProjects #KaggleAI #PEFT
To view or add a comment, sign in
-
-
Between 7% and 34% of the bits in every trained LLM are dead weight. We measured Shannon entropy vs allocated bits across 30 open-weight models from 9 labs (0.6B to 1.4T params): - BF16: 10.6 / 16 (66%) - FP8: 6.5 / 8 (80%) - MXFP4, NVFP4, INT4 (per-element + scales): ~93% At byte-level formats the slack is all in the exponent — weight magnitudes cluster between 2⁻⁷ and 2⁻⁶ in every model we measured, and the distributions are universal: shift by the mean, rescale by the stddev, every model collapses onto one curve. Sub-byte formats finally close most of the slack, but only by factoring the per-element exponent into per-block scales. Full post: https://lnkd.in/e38QaPcW
In search of wasted bits: how much information do LLM weights carry? | Doubleword blog.doubleword.ai To view or add a comment, sign in
-
This is very very cool This means there is 7-34% of weight compressibility that you can get 'for free' in almost every single model Which directly translates to token throughput improvements & cost reductions (More on this weight & KV cache compression research to come!) Source: Doubleword Inference lab
Between 7% and 34% of the bits in every trained LLM are dead weight. We measured Shannon entropy vs allocated bits across 30 open-weight models from 9 labs (0.6B to 1.4T params): - BF16: 10.6 / 16 (66%) - FP8: 6.5 / 8 (80%) - MXFP4, NVFP4, INT4 (per-element + scales): ~93% At byte-level formats the slack is all in the exponent — weight magnitudes cluster between 2⁻⁷ and 2⁻⁶ in every model we measured, and the distributions are universal: shift by the mean, rescale by the stddev, every model collapses onto one curve. Sub-byte formats finally close most of the slack, but only by factoring the per-element exponent into per-block scales. Full post: https://lnkd.in/e38QaPcW
To view or add a comment, sign in
-
Why More Developers Are Switching to Local LLMs + Cursor AI in 2026 Dual RTX 4090 setups, private inference, zero monthly fees, and full code ownership. Here’s my complete guide on the best local LLMs and how to set up a powerful offline dev environment with Cursor. 👉 https://lnkd.in/d3f9fEct #CursorAI #LocalLLM #DeveloperTools #OfflineAI #AIForDevelopers
To view or add a comment, sign in
-
-
Multimodal embeddings and rerankers let you map text, images, audio and video into a shared space and score mixed‑modality pairs with the same Sentence Transformers API — so cross‑modal search and visual document retrieval become first‑class operations. Use an embedder for fast, precomputed retrieval and a multimodal CrossEncoder to rerank top‑k candidates for higher quality. Watch GPU/VRAM needs (VLMs can be large) and expect lower absolute cross‑modal scores due to the modality gap; rely on relative ordering. Practical implication: build scalable RAG or search pipelines by storing document embeddings (images/screenshots included) and applying a reranker only on the shortlisted items to balance latency and accuracy. How would you integrate multimodal reranking into your current retrieval stack? #multimodal #retrieval #sentenceTransformers #AIengineering Source: https://lnkd.in/gKQ-_Sug
To view or add a comment, sign in
-
Most people use AI models. Last week, I fine-tuned one locally on my own laptop. I built and fine-tuned a coding-focused LLM using: Qwen3-0.6B QLoRA Hugging Face Transformers PEFT bitsandbytes RTX 3050 Laptop GPU The interesting part wasn’t just the training. It was understanding how much real AI engineering happens BEFORE trainer.train(). Some things I learned during the process: → Raw datasets are messy → Chain-of-thought traces can actually hurt small-model fine-tuning → Data cleaning matters more than most people think → Quantization makes local AI genuinely practical → LoRA/QLoRA completely changed what consumer GPUs can do I went through: CUDA setup VRAM optimization dataset preprocessing chat template formatting LoRA adapter training inference debugging Hugging Face deployment The model was trained locally on Windows using an RTX 3050 and then published to Hugging Face. One of the coolest moments was seeing the model’s behavior actually shift after fine-tuning: more structured responses, engineering-style outputs, and copilot-like formatting. This project gave me a much deeper appreciation for: efficient fine-tuning edge AI open-source LLM ecosystems practical ML engineering workflows We’re entering a phase where running and customizing AI models locally is becoming accessible to individual developers, not just large companies. And honestly, that future feels incredibly exciting. Check It Out Yourself: https://lnkd.in/dZwfPDwn #AI #MachineLearning #LLM #HuggingFace #QLoRA #OpenSource #GenerativeAI #Python #DeepLearning #LocalLLM #Transformers #AIEngineering
To view or add a comment, sign in
-
I assumed the hard part of local image generation was fitting the model into 8GB of VRAM. Squeezing Flux Schnell onto an RTX 4060, quantizing, managing memory swaps, that's the engineering problem I signed up for. The actual hard part is getting useful output. An image is worth a thousand words, and that works against you. A human spots "something is off" in milliseconds. A slightly wrong shadow, a perspective that doesn't land, proportions that feel uncanny. Your eye is a brutal evaluator. Then there's the prompting surface. Flux has two text encoders, CLIP and T5, each responding to different aspects of the prompt. SD3.5 Medium adds a second CLIP encoder on top of that. Three prompts, each controlling something different, each interacting in ways that aren't obvious. This is a multi-variable optimization problem where the feedback signal is your own visual intuition. I tried StableDiffusion 3.5 Medium first. The output quality wasn't there, it hallucinated — Flux Schnell on the same hardware was harder to fit, but gave me noticeably better results with fewer encoders to wrestle. The biggest surprise: spatial language doesn't work the way you'd expect. "Cabin in front of a mountain" gives you unpredictable framing. "Cabin in top of the frame" works better. Objective, layout-relative instructions beat perspective-based ones. It reminded me of the gpt3.5 days. Text rendering is another known gap, not working well in both models. models double letters and words consistently. Fortunately, game assets don't need text, so I can sidestep that entirely. I'm still early in this. The prompting workflow takes more iteration than I expected, and I haven't cracked consistent output yet, but getting there. Image generation prompt tips are welcome #LocalAI #GameDev #ImageGeneration #PromptEngineering #AIEngineering
To view or add a comment, sign in
-