𝗧𝗵𝗲 "𝗟𝗼𝘀𝘁 𝗶𝗻 𝘁𝗵𝗲 𝗠𝗶𝗱𝗱𝗹𝗲" 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 𝗶𝘀 𝘁𝗵𝗲 𝗯𝗶𝗴𝗴𝗲𝘀𝘁 𝗵𝗶𝗱𝗱𝗲𝗻 𝘁𝗿𝗮𝗽 𝗶𝗻 𝗥𝗔𝗚. I am currently building the retrieval pipeline for a financial reconciliation tool (Refinely), and I’m actively researching the best ways to optimize how the LLM processes context. The issue I am planning for: When a vector database fetches a large amount of relevant financial documents, LLMs often ignore the critical data buried right in the middle of the context window. To solve this, I am looking at moving beyond basic vector search and implementing a two-step retrieval architecture. My top two considerations right now: 1. 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴: Breaking documents down by logical meaning rather than strict character counts. 2. 𝗔𝗱𝗱𝗶𝗻𝗴 𝗮 𝗖𝗿𝗼𝘀𝘀-𝗘𝗻𝗰𝗼𝗱𝗲𝗿: Using a reranker to score and re-order the retrieved chunks before they ever reach the LLM. Vector search gets the documents to the door, but traditional information retrieval principles are what actually put the right data in front of the model. For the senior AI engineers here: If you were building a high-accuracy pipeline for financial data, what approach gave you the best ROI for fixing middle-context loss? #RAG #MachineLearning #Python #FastAPI #AIArchitecture
Optimizing LLM Context for Financial Reconciliation
More Relevant Posts
-
Most RAG pipelines are slow. Not because of bad code — because of bad architecture choices. Every time your agent can't figure out what the user wants, it calls the LLM to decide. That's 2-3 seconds gone before anything useful happens. Do it twice and you've lost the user. On my last project the agent was taking 9 seconds end-to-end. That's not a product, that's a loading screen. Here's what actually fixed it: → Replaced LLM-based intent classification with a local all-MiniLM-L6-v2 model → Moved document embeddings fully local — pgvector running on the same machine, no API round trips → LLM now only touches two nodes: SQL generation and final answer synthesis Result: 9 seconds down to ~2-3 seconds. Same quality. A fraction of the API cost. The agent still handles everything — 11 nodes, two pipelines (RAG + SQL), intent routing, policy checks, query validation. Nothing was cut. Just the unnecessary LLM calls were removed. If your AI agent feels slow, the first question to ask isn't "which model should I use" — it's "which calls actually need a model at all?" Full project on GitHub if you want to see the architecture 👇 https://lnkd.in/dSW76mta #AI #RAG #AgenticAI #LLM #Python #FastAPI #BuildInPublic #GenAI #MachineLearning #SoftwareEngineering
To view or add a comment, sign in
-
I got tired of LLMs giving an essay when I just wanted code, so here is my solution. I fine-tuned Qwen2.5-Coder-14B for data science code generation - DataSci-Coder. It is a 14.8 billion parameter model trained with QLoRA on 10,795 curated code examples in under 2 hours on a single free-to-use L40S GPU. Fine-Tuned vs Base Model Results: - 93.3% instruction compliance vs 91.4% - 10/10 code only output compliance vs 6/10 - 12/12 complex tasks completed successfully (Bayesian inference, GANs, VAEs, survival analysis, stacking ensembles, etc.) The best part of this was when told "respond with code only," the base model adds explanatory text 40% of the time. My fine-tuned model follows the instruction 100% of the time! Setup and READMEs available at: GitHub: https://lnkd.in/e77bN4N7 HuggingFace: https://lnkd.in/eqjbeJns #ML #OpenSource #LLM Qwen
To view or add a comment, sign in
-
#Day150 of Consistency Week 22 of Data Structures & Algorithms, System Design Progress Update (23 Mar 2026 → 30 Mar 2026) This week was focused on **strengthening advanced problem-solving patterns and connecting them with real interview thinking**, especially around binary search, DP, graphs, and backtracking. Still following the same rule: **one problem a day, but complete clarity.** Problems practiced: Capacity to Ship Packages Within D Days (LC 1011) • Binary search on answer • Monotonic condition + simulation House Robber II (LC 213) • DP with circular constraint • Breaking into two linear subproblems Implement Trie (LC 208) • Prefix tree design • Efficient string storage and lookup Shortest Path in Binary Matrix (LC 1091) • BFS for shortest path • 8-direction traversal Number of Islands (LC 200) • DFS on grid • Connected components intuition Remove K Digits (LC 402) • Monotonic stack + greedy • Minimizing number by removing digits Combination Sum (LC 39) • Backtracking + pruning • Decision tree exploration Big takeaway this week: Binary Search on Answer → solving optimization problems efficiently DP → handling constraints like circular dependencies Trie → optimizing prefix-based operations BFS/DFS → grid problems as graph traversal Monotonic Stack → greedy removal with structure Backtracking → exploring combinations with pruning Alongside DSA, I also continued working on **System Design fundamentals**: • Practiced thinking in **scalable flows and constraints** • Focused on **breaking problems into components and trade-offs** Also continuing my **Generative AI project using LangChain, LangGraph, and Vector DB**, exploring **LLM orchestration using Google Gemini API**. The focus remains unchanged: **Clarity > Speed Patterns > Memorization Consistency > Motivation** GitHub – https://lnkd.in/d-BbV_nh #DSA #LeetCode #Java #ProblemSolving #Consistency #SystemDesign #Graph #DynamicProgramming #Backtracking #BinarySearch #SoftwareEngineering #GenerativeAI #LangChain
To view or add a comment, sign in
-
-
👻 I finished automating my market analysis process to optimize the time spent on quantitative research and systematize the search for alphas when restructuring my portfolio. My main objective was to abandon pure linear prediction. Instead, I built an asynchronous Directed Acyclic Graph (DAG) that prioritizes topological risk management and actively mitigates execution friction in the market. 1️⃣ Microservices and Data Architecture: To ensure mathematical inference does not block the analytical interface, the system operates completely decoupled: 🏗️ UI & Orchestration: Reactive frontend in Streamlit, communicating with an API Gateway in FastAPI. 🏗️ Asynchronous Processing: Celery workers handling heavy computation in the background. 🏗️ Analytical Data Warehouse: DuckDB injected for columnar persistence. The system ingests OHLCV data via the Alpaca API, automatically adjusting splits and dividends before tensors reach the model. 2️⃣ The Quantitative Engine (4-Phase DAG): The entire pipeline is evaluated through rigorous Walk-Forward (Out-of-Sample) analysis to prevent information leakage (Lookahead bias). 📑Phase A (Feature Engineering): Fractional Differentiation to achieve stationarity while preserving series memory, followed by PCA to isolate orthogonal risk factors. 📑Phase B (Directional Alpha): XGBoost models trained with a custom loss function (BCE + L2 Turnover Penalty). The algorithm mathematically penalizes excessive rotation, protecting capital from the bid-ask spread and commissions. 📑Phase C (Defense and Regimes): Hierarchical Risk Parity (HRP) to structure a topological risk mitigation base, operating in parallel with a Hidden Markov Model (HMM) that classifies macroeconomic volatility regimes. 📑Phase D (Bayesian Fusion): Mathematical integration via Black-Litterman. The model takes HRP as the conservative prior and XGBoost predictions (filtered by a Sigmoid function) as directional views, generating the optimal weights to execute in the Order Management System (OMS). I will be sharing more technical details about the design of this ecosystem. Open to feedback, debate, and connections with colleagues in financial engineering and quantitative development. #QuantitativeFinance #MachineLearning #Python #XGBoost #AlgorithmicTrading #SoftwareEngineering #DataEngineering
To view or add a comment, sign in
-
-
Why did my zero-shot snow forecasting model fall apart on the biggest storm days? ❄️ When building the inference engine for PowDay.AI, I started with Chronos-2 for zero-shot time-series forecasting. It hit 79% accuracy—impressive, but far short of my >90% MVP target. It turns out foundation models struggle with the extreme, non-linear atmospheric rivers of the Sierra Nevada. To fix this, I'm pivoting to fine-tuning. But extracting 10 years of historical NOAA GRIB2 files from AWS cold storage to align with local telemetry became a massive data engineering hurdle. I wrote a deep dive on hitting the "79% Wall," engineering a highly defensive Python ingestion pipeline to survive S3 throttling, and the reality of preparing data for foundation models. Read the full breakdown here: https://lnkd.in/gReU_KmU #MachineLearning #DataEngineering #Python #TimeSeries #AI
To view or add a comment, sign in
-
Published a small benchmark paper as a side project — LORE (LLM OCR Robustness Evaluation). It evaluates how well LLMs extract structured data from corrupted OCR text, across 1,200 samples, 3 document domains, and 4 noise tiers. Tested on Llama 3.2, Phi 3.5, Qwen 2.5 7B, and Llama 3.3 70B. Two things stood out: Field F1 is a misleading metric here. Models scoring 0.82–0.95 on F1 diverge dramatically on exact match (0.42–0.66) and correction gain (-0.88 to +0.49). F1 just checks if the right field names are present, not whether the values are correct. Instruction-tuned models handle this task noticeably better. Qwen 2.5 7B, despite being smaller than some competitors, achieved near-perfect schema validity (99.6%), the lowest hallucination rate among local models, and was the only sub-70B model with positive correction gain, suggesting that how a model is fine-tuned matters as much as size for structured extraction tasks. Paper: https://lnkd.in/g9JVcs6u GitHub: https://lnkd.in/g76dpdya
To view or add a comment, sign in
-
RAG Optimization: Decoupling Retrieval Units from Context Units 🧠⚙️ Most RAG (Retrieval-Augmented Generation) pipelines fail because they use a 1:1 ratio for indexing and generation. If you index small chunks, the LLM loses context. If you index large chunks, the vector search gets "noisy" due to semantic dilution. The solution? Parent-Child Hierarchical Indexing. The System Design: We move away from flat vector stores and move toward a Recursive Metadata Reference model. ✅Hierarchical Document Expansion: We shared the source document into Parent Nodes (K=2048 tokens). ✅Granular Sub-sampling: Each Parent is further decomposed into multiple Child Nodes (K=128 tokens) with a 10-15% overlap. ✅Asymmetric Vector Mapping: Only the Child Nodes are embedded and stored in the Vector DB (e.g., Pinecone, Milvus, Weaviate). ✅The "Pointer" Resolution: Each Child Node contains a metadata key—a parent_id—pointing to its source Parent Node stored in a separate Document Store (like MongoDB or an In-Memory Dict). The Retrieval Logic: 1.The Search: The user query hits the Vector DB. The system identifies the Top-K Child Nodes based on cosine similarity. 2. The Swap (The "Secret Sauce"): Instead of feeding these tiny child snippets to the LLM, the retriever performs a Metadata Lookup. It fetches the full Parent Nodes associated with those children. 3. The Context Injection: The LLM receives the high-density, high-context Parent text. Why does this kill "Naive RAG": ✅ Precision: Small child chunks act as "high-signal" anchors for the retriever. ✅ Recall: Parent chunks provide the "Global Context" needed to prevent hallucinations and "Lost in the Middle" errors. ✅ Efficiency: You get the low-latency benefits of small vector searches with the reasoning power of large-context windows. The takeaway: In production-grade AI, the retrieval unit should rarely be the same as the generation unit. 🛠️ #GenerativeAI #RAG #SystemDesign #LLMOps #VectorDatabase #LangChain #MachineLearning #PythonDev
To view or add a comment, sign in
-
-
Why are you using a $2.00 reasoning model to solve a $0.01 logic problem? I built a Claude Agent SDK workflow to run daily pipeline monitoring checks. It queried a database, checked thresholds, evaluated alerts, and generated a narrative report. 33 tool calls. 300K tokens. 7.5 minutes. $1.50 per run. $45/month. Then I rebuilt it as a deterministic LLM pipeline. Same queries, same thresholds, same report — but the execution path is fixed in Python. The LLM only gets called once at the end to narrate the results. Cost: $0.01 per run. $0.30/month. The agent version wasn't smarter. It was just more expensive. It spent 60% of its token budget "thinking" through database connection retries that a try-except block handles in three lines. Here's the distinction that matters: If the input is structured and the rules are known, you don't need the model to decide what to do next. You need it to interpret what happened. That's a deterministic pipeline with an LLM at the end — not an agent loop. If the input is high-variance and the path depends on what you find — "check every log for patterns, find what's declining, tell me why" — that's when an agent earns its cost. The model decides what to query next based on what it sees. That judgment is worth $3-5. The trap is confusing autonomy with automation. An agent loop where you've predefined every query, every threshold, and every output format isn't autonomous. It's a $1.50 for-loop with a reasoning tax. Infrastructure becomes brittle when you replace logic with probability. The takeaway: Before you reach for an agent SDK, ask who drives — you or the model? If you already know the path, code it. Call the LLM once at the end. Save agent loops for when the decision tree doesn't exist until the data is revealed. #AI #SoftwareEngineering #AgenticAI #DataEngineering #BuildInPublic
To view or add a comment, sign in
-
-
Yes! - LLM Agents are awesome - at generating deterministic code - to speed up business processes fast and reliable!
AI Thought Leadership | Turning AI Capability into reality Helping teams understand why AI fails in practice
Why are you using a $2.00 reasoning model to solve a $0.01 logic problem? I built a Claude Agent SDK workflow to run daily pipeline monitoring checks. It queried a database, checked thresholds, evaluated alerts, and generated a narrative report. 33 tool calls. 300K tokens. 7.5 minutes. $1.50 per run. $45/month. Then I rebuilt it as a deterministic LLM pipeline. Same queries, same thresholds, same report — but the execution path is fixed in Python. The LLM only gets called once at the end to narrate the results. Cost: $0.01 per run. $0.30/month. The agent version wasn't smarter. It was just more expensive. It spent 60% of its token budget "thinking" through database connection retries that a try-except block handles in three lines. Here's the distinction that matters: If the input is structured and the rules are known, you don't need the model to decide what to do next. You need it to interpret what happened. That's a deterministic pipeline with an LLM at the end — not an agent loop. If the input is high-variance and the path depends on what you find — "check every log for patterns, find what's declining, tell me why" — that's when an agent earns its cost. The model decides what to query next based on what it sees. That judgment is worth $3-5. The trap is confusing autonomy with automation. An agent loop where you've predefined every query, every threshold, and every output format isn't autonomous. It's a $1.50 for-loop with a reasoning tax. Infrastructure becomes brittle when you replace logic with probability. The takeaway: Before you reach for an agent SDK, ask who drives — you or the model? If you already know the path, code it. Call the LLM once at the end. Save agent loops for when the decision tree doesn't exist until the data is revealed. #AI #SoftwareEngineering #AgenticAI #DataEngineering #BuildInPublic
To view or add a comment, sign in
-