Scaling LLM Reasoning Using Parallel Processing

Explore top LinkedIn content from expert professionals.

Summary

Scaling LLM reasoning using parallel processing means speeding up and improving how large language models (LLMs) think and solve problems by running multiple tasks at the same time, rather than one after another. By splitting up reasoning tasks and processing them in parallel, organizations can get answers faster, handle bigger datasets, and tackle more complex challenges without needing extra computing power.

Adopt parallel techniques: Set up your LLM workflows to process batches of data simultaneously, which can dramatically reduce waiting times and allow for quicker results.
Use smarter training methods: Explore structured strategies, like Markovian reasoning or multi-query frameworks, to help your models think more efficiently and keep memory usage steady.
Pick the right model size: Choose LLMs and scaling approaches based on your task’s complexity, as smaller models with smart processing can sometimes outperform much larger ones.

Summarized by AI based on LinkedIn member posts

Eduard Parsadanyan

Guiding businesses to vertical AI productivity | Practical implementation strategist | Beyond AI hype | n8n & low-code expert

4,050 followers 1y
Report this post
𝐈 𝐜𝐮𝐭 𝐦𝐲 𝐋𝐋𝐌 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠 𝐭𝐢𝐦𝐞 𝐛𝐲 𝟏𝟒𝐱 𝐰𝐢𝐭𝐡 𝐨𝐧𝐞 𝐬𝐢𝐦𝐩𝐥𝐞 𝐜𝐡𝐚𝐧𝐠𝐞 I just witnessed the power of parallel processing with LLMs, and the results are too good not to share. Last month, I posted a video demonstrating three approaches to running LLM calls efficiently in n8n. https://lnkd.in/exJ6SdrC The third method – parallel calls to the same Basic LLM chain – delivered significant time savings on a real-world project. I needed to categorize a catalog of items with unstructured text descriptions, a perfect test case. The numbers speak for themselves. Using Gemini-2.5 Flash, each individual request averaged 3 seconds (ranging from 2.5 to 8.5 seconds per item). Running these sequentially would have taken around 4 minutes. With parallel processing? The entire batch completed in under 18 seconds. 𝐓𝐡𝐚𝐭'𝐬 𝐚𝐥𝐦𝐨𝐬𝐭 𝟏𝟒𝐱 𝐬𝐩𝐞𝐞𝐝 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭! While my case was relatively small, imagine this same optimization applied to thousands of items. The difference between minutes and seconds adds up quickly at scale, potentially turning hour-long jobs into five-minute tasks. Haven't implemented parallel processing for your LLM workflows yet? Watch the video I shared last month and grab the free workflow template. Your future self will thank you when that next big batch processing task lands on your desk.
No more previous content

No more next content
28 Comments
Like Comment
Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,490 followers 8mo
Report this post
Breaking: RAG-R1 Framework Revolutionizes How LLMs Handle External Knowledge Researchers from AWorld Team and Inclusion AI have just released RAG-R1, a groundbreaking training framework that fundamentally changes how Large Language Models interact with external knowledge sources during reasoning. The Core Innovation Traditional RAG systems suffer from a critical bottleneck: they generate only single search queries when external retrieval is needed, leading to substantial inference time and limited knowledge acquisition. RAG-R1 solves this with multi-query parallelism - enabling models to generate up to three parallel search queries simultaneously. Under the Hood Architecture The framework operates through a sophisticated two-stage training process: Stage 1: Format Learning SFT - The system generates samples integrating reasoning and search, segmented into four distinct categories. Models learn to respond in a "think-then-search" format using special tokens like <think>, <search>, and <answer> to structure their reasoning process. Stage 2: Retrieval-Augmented RL - Employs Proximal Policy Optimization with outcome-based rewards to enhance reasoning capabilities. The system implements retrieval masked loss to prevent retrieved tokens from interfering with the model's inherent reasoning abilities. Technical Breakthrough The multi-query parallelism returns results in JSON format, clearly aligning search queries with retrieved documents. This approach reduces retrieval rounds by 11.1% while maintaining comparable time per retrieval operation. Performance Impact Testing on seven question-answering benchmarks using Qwen2.5-7B-Instruct as the backbone model showed remarkable results: - Up to 13.2% improvement over strongest baselines - Significant performance gains across both general QA and multi-hop reasoning tasks - Excellent generalization across out-of-domain datasets The framework addresses the fundamental challenge of LLMs generating hallucinated or outdated responses by enabling adaptive leverage of both internal and external knowledge during the reasoning process. This represents a significant step forward in making AI systems more reliable and grounded in real-world knowledge.
No more previous content

No more next content
1 Comment
Like Comment
Yuxiong He

Head of AI Research @ Snowflake | Distinguished Scientist | Foundation Models, Agents, and AI Systems

13,589 followers 2mo
Report this post
🚀 Snowflake AI Research introduces Jacobi Forcing — a new training paradigm that transforms standard LLMs into native causal parallel decoders, achieving up to 4× wall-clock speedup while maintaining near-AR generation quality. A core bottleneck in LLM inference is serial decoding — generating one token at a time. While diffusion LLMs enable parallelism, they rely on expensive non-causal post-training that departs from the autoregressive pretraining recipe, often degrading quality and breaking KV-cache optimizations. Jacobi Forcing addresses this by training models on their own Jacobi decoding trajectories, gradually shifting autoregressive models into efficient parallel decoders while preserving their causal backbone. Highlights: ⚡ 3.8× wall-clock speedup on coding and math benchmarks 📈 4.5× more tokens accepted per forward pass with multi-block decoding and rejection recycling 🏆 Near-AR generation quality, delivering a much better speed–quality tradeoff than diffusion LLMs 🔧 Compatible with existing KV-cache-based serving systems — no draft models or architectural changes required 📄 Paper: https://lnkd.in/gbSgdzjZ 💻 Code: https://lnkd.in/gxVas7sG 📝 Blog: https://lnkd.in/gSuZVmE7 Huge thanks to Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Zhijie Deng, and Hao Zhang for this outstanding work. Great collaboration among UC San Diego, Shanghai Jiao Tong University, and Snowflake. #SnowflakeAI #SnowflakeAIResearch #LLM #LLMInference #OpenSource #AIResearch #ParallelDecoding #MachineLearning
No more previous content

No more next content
5 Comments
Like Comment
Ksenia Se

AI inferencer at Turing Post

7,076 followers 7mo
Report this post
This week signaled a shift from brute reflection to more structured, reusable, and probabilistic reasoning – making computation count. One of the top research is "The Markovian Thinker" from Mila - Quebec Artificial Intelligence Institute & Microsoft. It lets LLMs reason with a fixed-size state – compute stays the same no matter how long the reasoning chain gets. This makes RL linear-cost and memory-constant. The team introduced Delethink RL setup that trains models to be Markovian Thinkers, redesigning the environment in which they learn. In effect, it lets models reason across 96K tokens for just 7 vs. 27 H100-months. Here is how it works: 1. Delethink structures reasoning into fixed-size chunks instead of one long, ever-growing chain of thought. After each chunk, it resets the context but carries over a short textual summary from the last chunk (written by the model itself) so it can continue reasoning smoothly. 2. This design puts a limit on how much context the model ever "sees" at once (say, 8K tokens), so: - Compute cost grows linearly with thinking length (not quadratically) - Memory stays constant since old tokens are dropped - Models can reason for tens of thousands of tokens 3. Thanks to this, models become Markovian Thinkers able to: - "think" in steps - remember just enough - reason indefinitely without blowing up compute In terms of efficiency, Delethink is faster and cheaper than LongCoT-RL: • One RL step of Delethink takes 215s vs. 249s for LongCoT-RL. • It generates 8,500 vs. 6,000 tokens/sec on an H100 GPU. • At test time, Delethink keeps improving even when reasoning far beyond its training length, solving problems with 100K+ tokens when trained for 24K, whereas LongCoT-RL plateaus. The Markovian Thinker reframes reasoning itself as a probabilistic state-transition process – bridging cognitive architectures and modern LLMs. By grounding reasoning in formal efficiency guarantees, it turns reflection into structured, reusable computation that scales linearly, stays memory-constant, and keeps thinking efficiently. And that's a new wave of reasoning.

2 Comments
Like Comment
Chris Fregly

Engineering and Product Leader (AWS, Databricks, Netflix), Investor, Advisor, Friend

42,332 followers 1y
Report this post
TL;DR 🧠 Smaller LLMs outperform giants: A 1B LLM can surpass a 405B LLM on reasoning tasks like MATH-500 using compute-optimal Test-Time Scaling (TTS). 🚀 Efficiency boost: Smaller models achieve higher accuracy with 14.1× faster inference and 256× fewer FLOPS compared to larger models. 🔍 Key insight: TTS strategies depend on policy model size, Process Reward Models (PRMs), and problem difficulty. Problems & Solutions 🛑 Problem 1: Lack of systematic analysis of how policy models, PRMs, and problem difficulty affect TTS. ✅ Solution: Introduced reward-aware compute-optimal TTS to dynamically adapt strategies. 🛑 Problem 2: PRMs struggled with out-of-distribution (OOD) responses and token-length bias. ✅ Solution: Implemented absolute difficulty thresholds and PRM-Vote aggregation to improve robustness. Experiments & Setup 📚 Tasks: MATH-500 (500 problems) and AIME24 (advanced math challenges). 🤖 Models: Llama 3 (1B-405B), Qwen2.5 (0.5B-72B), and DeepSeek-R1 variants. ⚖️ Metrics: Pass@k, token efficiency, FLOPS comparison. 🔧 Ablations: PRM scoring methods (Min/Last/Avg) and voting strategies (Majority/PRM-Max/PRM-Vote). 💻 Hardware: 8×A100 GPU clusters for TTS experiments with beam width=4 and max tokens=8192. Novel Insights 🧩 Policy model size matters: Best-of-N (BoN) works well for large models, while Beam Search and DVTS excel for smaller ones. 📉 PRM limitations: Observed over-criticism, error neglect, and token-length bias in PRMs, impacting TTS performance. ⚖️ Trade-off: TTS gains diminish as policy model size increases (e.g., 154.6% gain for 1B vs. 9.5% for 72B). Improvements Over Prior Work 🚀 135× size gap: A 3B model outperforms a 405B model, improving the prior benchmark of 23×. 🔬 Enhanced PRMs: Qwen2.5-Math-PRM-72B enables 7B models to surpass o1 and DeepSeek-R1. ⏱️ Efficiency: 1B model + TTS achieves 256× fewer FLOPS compared to 405B CoT models. Key Implementation Details 🔄 Reward-aware TTS: Integrated PRM scores into a Markov Decision Process (MDP) framework for dynamic scaling. 🌳 DVTS: Parallel subtree exploration for diverse reasoning paths. 📉 Absolute difficulty bins: Replaced quantile-based thresholds with fixed Pass@1 ranges (easy: 50%-100%, medium: 10%-50%, hard: 0%-10%). Resources Paper: Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (https://lnkd.in/g55ybikb) 🤖 Models: Llama-3.2-3B-Instruct (https://lnkd.in/gnQ3d87S), Qwen2.5-Math-PRM (https://lnkd.in/gk6gMqMw). 🔧 Framework: OpenR (https://lnkd.in/gCPxPR4H) for TTS pipelines. 📊 Datasets: MATH-500 (https://lnkd.in/g4jvAzsp), PRM800K (https://lnkd.in/gEb6XE3A). 🌐 Project Page: Compute-Optimal TTS (https://lnkd.in/gVutpamZ).
No more previous content

No more next content
3 Comments
Like Comment
Sohrab Rahimi

Director, AI/ML Lead @ Google

23,837 followers 3mo
Report this post
Multi-agent debate improves reasoning, but it does so by spending more compute at inference time. More agents, more rounds, more latency. The implicit belief is that stronger reasoning requires coordination at runtime. “AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent” from CMU, William and Mary, Georgia Tech, Amazon, and UBC questions that belief. Today, teams either deploy a single model and accept its limits, or deploy a multi-agent system and accept the cost. AgentArk proposes a third path. Use multi-agent debate to generate better reasoning traces offline, then train a single model to absorb those patterns. At deployment, you run one model, not a committee. The methodology has three steps: 1. They collect debate transcripts from several teacher agents solving the same problem. These transcripts include not only final answers but revisions and corrections. 2. They filter for correct solutions and keep multiple valid reasoning paths when they differ structurally. 3. They train a student model in stages. Basic fine-tuning teaches it to reproduce reasoning traces. Data augmentation exposes it to multiple correct solution paths. The strongest method adds a process reward model that scores each reasoning step. The student is then optimized to produce steps that are logically consistent, not just answers that look right. Compared with how things are commonly done today, this is a shift from answer-level supervision to process-level supervision. Instead of training on input-output pairs alone, the model is trained on how to think through the problem. Empirically, the distilled single model consistently outperforms the base single-agent model and approaches the accuracy of the full multi-agent debate, while retaining single-agent efficiency. Gains are strongest when the supervision targets reasoning steps rather than just final answers. Two findings matter in practice: First, the strength of the process reward model has more impact than simply increasing student size. A better evaluator transfers more reasoning skill. Second, more trajectories are not automatically better. Adding large amounts of debate data without filtering does not guarantee improvement. High-quality corrective traces drive the gains. I think this study is very practical. Multi-agent systems do not need to be your serving architecture. They can be your training engine. You pay the coordination cost once during training and deploy a single model that captures much of the reasoning benefit. This saves significant cost and reduces latency. Paper: https://lnkd.in/ekuUsJQT GitHub: https://lnkd.in/emxgRvzR
No more previous content

No more next content
Like Comment
Ashutosh Hathidara

Senior ML Scientist @SAP AI | Machine Learning Researcher | Opensource Creator | Motion Graphics Designer

50,948 followers 5mo
Report this post
LLMs can be trained to think in parallel, aggregate their thoughts, and then respond with a holistic understanding. Native Parallel Reasoner (NPR) paper from NLCO Lab and BIGAI introduces 3 stage training process to accomplish exactly this, and it doesn't even require generating synthetic datasets beforehand. 📍 Stage 1: Use base instruct model to do RL with Dynamic Sampling Policy Optimization (DAPO) such that it learns on how to generate responses in the correct format. Though the model still generates the text sequentially here, the format is something which is important here. Let's call resultant model as NPR-ZERO 📍 Stage 2: First, take NPR-ZERO and generate parallel thought-answer pairs for each question. Then, filter the data to reject the incorrectly generated pairs (format & accuracy based). Finally, perform SFT on top of this data using parallel attention masking & positional embeddings. This self-distillation helps model learn how to generate in presence of these new parallel attention masks & PEs. Let's call the resultant model as NPR-BETA. 📍 Stage 3: Take NPR-BETA and run RL training with the newly introduced PAPO (Parallel-Aware Policy Optimization) which is a slight modification of DAPO for such parallel inference. The RL process looks fairly similar to GRPO at high-level but the hero here is the parallel inference & thought aggregation. The authors claim significant improvement on reasoning benchmarks. Kudos to the authors. The paper link in the first comment. 👇 #AI #LLM #RL #MachineLearning #Reasoning
No more previous content

No more next content
1 Comment
Like Comment
Jim Fan Jim Fan is an Influencer

NVIDIA Director of AI & Distinguished Scientist. Co-Lead of Project GR00T (Humanoid Robotics) & GEAR Lab. Stanford Ph.D. OpenAI's first intern. Solving Physical AGI, one motor at a time.

241,517 followers 1y
Report this post
OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
No more previous content

No more next content
178 Comments
Like Comment

Scaling LLM Reasoning Using Parallel Processing

Summary

More in Scaling AI Solutions In Enterprises

Explore categories