Building a production-grade LLM service from scratch — Days 2 & 3 🧵 This week I moved past AI demos and started treating LLM integration as real infrastructure. Two endpoints, and a lot of decisions about what "production-ready" actually requires. Day 2 - A properly structured FastAPI service Wrapped the Gemini API in a /chat endpoint, with the focus on doing it correctly rather than just getting output: → Async-first design (async def + await) so the server handles concurrent load without blocking → Pydantic models validating every request and response, so invalid input never reaches business logic → pydantic-settings for configuration - secrets in environment variables, never hardcoded → A clean, maintainable layout: app/, config.py, routes/, services/, schemas/ The endpoint returning a response was the easy part. The structure that makes it maintainable and testable was the real work. Day 3 - Streaming, and handling failure gracefully An 8-second wait for a full response feels broken; streaming the same response feels responsive - even though total latency is identical. That UX gap is worth engineering for. Built /chat/stream using Server-Sent Events and FastAPI's StreamingResponse, with three event types: → delta - text chunks as they're generated → usage - token counts on completion → error - upstream failures surfaced cleanly within the stream I also tested the failure path most demos ignore: what happens when a client disconnects mid-stream? Left unhandled, the server keeps generating billable tokens for a client that's no longer listening. The fix - if await request.is_disconnected(): break - propagates cancellation down to the SDK's connection to Google, so ungenerated tokens are never billed. At scale (100k requests/day, ~10% disconnect rate, 500-token responses), that's a meaningful cost saving. Finally, I built a lightweight JS demo to experience the result from the user's side - a useful reminder that latency is as much about perception as raw speed. Two days, two endpoints, and a clearer sense of the gap between a working prototype and a production service. #AI #MachineLearning #Python #FastAPI #LLM #SoftwareEngineering #BackendDevelopment
Building a Production-Ready LLM Service with FastAPI
More Relevant Posts
-
From "What happened?" to "Oh, I see!" 🚀 We’ve all been there: a sea of red text in the terminal and no clear path forward. That’s what inspired me to build Log Detective. I wanted to create a tool that acts as a bridge between raw terminal data and actual solutions. By combining a modern web interface with the power of the GitHub Copilot CLI, Log Detective identifies patterns in your logs and tells you exactly what’s going wrong. What I learned building this: 1. How to effectively pipe CLI outputs into a web application. 2. Deepening my expertise in TypeScript for both frontend and backend logic. 3. Designing an interface that makes technical debugging feel intuitive. I’m really proud of how this turned out. I’d love to hear your thoughts on how you’re using AI to improve your daily workflow! link: https://lnkd.in/epbR-Vhk https://lnkd.in/eYFJWN_A #BuildingInPublic #SoftwareEngineer #AI #LogDetective #TypeScript #TechCommunity
To view or add a comment, sign in
-
-
Documenting a 10-package React library is harder than building one. Tour Kit spans 150+ exported APIs across 10 packages, each with its own hooks, components, and TypeScript types. The core alone has 12 hooks and 30+ type exports. We benchmarked the getting-started path: 7 minutes in Vite, 9 minutes in Next.js. That "time to first working tour" metric has been the most useful quality signal for our documentation. Three patterns that made monorepo docs navigable: 1. Unified search across all packages (Orama, client-side, zero API calls) 2. 200+ cross-package links connecting 60+ doc pages 3. Package-scoped navigation mirroring the repo structure We also generate /llms.txt files so AI tools give accurate answers about our API. 88% of companies now use AI in documentation workflows (McKinsey Q4 2025). Making docs machine-readable isn't optional anymore. Full documentation hub: https://lnkd.in/gzKtkQmw #react #typescript #opensource #webdevelopment #documentation
To view or add a comment, sign in
-
𝗗𝗶𝘁𝗰𝗵𝗶𝗻𝗴 𝘁𝗵𝗲 𝗰𝗵𝗮𝘁𝗯𝗼𝘁 𝘄𝗿𝗮𝗽𝗽𝗲𝗿. 𝗞𝗼𝘇𝗮𝗸𝗘𝘆𝗲 𝗢𝗦 𝗶𝘀 𝗻𝗼𝘄 𝗮 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱, 𝗺𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝘄𝗼𝗿𝗸𝘀𝗽𝗮𝗰𝗲 (𝗣𝗵𝗮𝘀𝗲 𝟭𝟱.𝟮). The exact engineering debt I cleared in this sprint to make concurrent AI streams stable: ⚡ 𝗦𝗦𝗘 𝗗𝗲𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅𝗶𝗻𝗴: To prevent three concurrent agent streams (Chef, Auditor, Architect) from locking the Vue Virtual DOM, I wrote a custom client that demultiplexes chunks directly into isolated shallowRef buffers. High-frequency updates route directly to hardware-accelerated CSS animations. 🧠 𝗧𝗿𝗶-𝗦𝘁𝗼𝗿𝗲 𝗠𝗲𝗺𝗼𝗿𝘆 & `/𝗸𝗶𝗻𝗲𝗰`: Session context is no longer a flat string. Triggering the `/kinec` wrap-up command kicks off FastAPI BackgroundTasks to extract behavioral traits and index them into ChromaDB and an associative NetworkX graph. 📍 𝗨𝗜 𝗖𝗼𝗼𝗿𝗱𝗶𝗻𝗮𝘁𝗲 𝗣𝗲𝗿𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝗲: Fixed layout amnesia. Widget states, absolute positions (useDraggable.js), and FSM engine logic sync with PostgreSQL in real-time and hydrate strictly from localStorage on initialization. The underlying repository architecture has scaled to 388 nodes and 558 edges, validated via an automated CLI codebase audit. 🔗 Full architecture and multi-agent HUD logic are live on GitHub (link in bio). #KozakEyeOS #SystemArchitecture #AppliedAI #MultiAgentSystems #Python #FastAPI #Vue3 #GraphRAG #SSEStreaming #WebDevelopment #BuildWithGemini
To view or add a comment, sign in
-
I built an AI code review agent in ~120 lines of TypeScript. No framework. No LangChain. Just the Anthropic SDK. Here's what I learned about how agents actually work. Most people think of AI as a request-response loop: you prompt, it replies. An agent is different in exactly three ways: 1. Tools — it can call functions you define (read a file, run a command, query an API) 2. A loop — after using a tool, it sees the result and decides what to do next 3. A goal — it keeps going until it reaches a terminal state, not until the context window fills up The core pattern is embarrassingly simple: while (true) { const response = await claude(messages) if (response.stop_reason === 'end_turn') break for (const toolCall of response.tool_calls) { const result = execute(toolCall) messages.push(result) } } That's the entire agent architecture. Everything else is just engineering on top of this loop. The tricky parts in production: → Token budget: messages accumulate fast on large codebases → Max iterations guard: agents can loop unexpectedly → Tool descriptions matter more than you'd think — they're part of the prompt Full tutorial with code: https://lnkd.in/d8gVn9Q8 #AI #TypeScript #ClaudeCode #AnthropicSDK #SoftwareArchitecture #AIAgents #WebDevelopment
To view or add a comment, sign in
-
-
If you work with AI coding agents across multiple editors, you have probably noticed the same thing I have: task context is the weakest link. The agent does not remember what you asked it last session. You switch from Cursor to Claude Code and lose the thread. Important constraints (do not touch this file, defer this for later) get buried in chat history and ignored on the next run. I built Dockyard to fix this for myself. It is a local MCP server that gives agents a structured work-order queue. Each order is a markdown file with required sections: Objective, Steps, Files, Do Not Touch, Notes. The agent reads the queue, picks an order, does the work, marks it complete. Everything lives under ~/.dockyard, outside your git repo, on your machine only. One install command wires up MCP and skills for Cursor, Codex, Claude Code, VS Code, Windsurf, Gemini, and Zed. From there you can just say something like use dockyard and gh cli to get issue 43 and make appropriately sized work orders, and the agent handles the rest. Built in Node and TypeScript. npm install -g dockyard-mcp dockyard --help https://lnkd.in/eA-FMshb #MCP #AIAgents #DeveloperTools #OpenSource #LocalFirst
To view or add a comment, sign in
-
-
𝐇𝐨𝐰 𝐈 𝐬𝐥𝐚𝐬𝐡𝐞𝐝 𝐦𝐲 𝐀𝐈 𝐂𝐨𝐝𝐢𝐧𝐠 𝐜𝐨𝐬𝐭𝐬 𝐛𝐲 50% 𝐮𝐬𝐢𝐧𝐠 𝐊𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 𝐆𝐫𝐚𝐩𝐡𝐬 📉🤖 As I’ve been working a monorepo with 650+ files and 400k+ lines of code—I hit a wall. Even the most advanced AI agents started getting "lost" in the architecture, leading to "hallucinations" and massive token bills. The solution? 𝐆𝐫𝐚𝐩𝐡𝐢𝐟𝐲. Instead of letting my AI assistant "blindly" search through files, I integrated a local knowledge graph that maps the entire system's DNA. The "Graphify Bench" Results : 🚀 𝗗𝗶𝘀𝗰𝗼𝘃𝗲𝗿𝘆 𝗦𝗽𝗲𝗲𝗱: Instant architectural awareness across Bun and React. 💰 𝗧𝗼𝗸𝗲𝗻 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: Explaining a complex Auth flow dropped from ~15.7k tokens to just ~7.5k tokens. 🧠 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆: No more "searching for files that don't exist." The AI now follows extracted & inferred edges. Why this matters: In a world of "infinite" context windows, we often forget that Context != Intelligence. By structuring our codebase into a graph, we give the AI a map instead of a flashlight. The result? I’m spending less time debugging "where is this defined?" and more time building core features and dashboards. Github: https://lnkd.in/ggahrjG8 #AI #WebDev #BunJS #React #Graphify #SoftwareArchitecture #Productivity #Monorepo
To view or add a comment, sign in
-
-
🤖 Built a full-stack AI chatbot with dual LLM provider support — switch between Groq (LLaMA-3) and OpenAI (GPT-4o) at runtime without restarting the app. Here's what's inside 👇 ⚙️ Architecture: • FastAPI REST backend handles all model routing • LangChain agent framework wired to Tavily for live web search • Streamlit frontend for a smooth chat UI • Provider-agnostic design — swapping LLMs is a config change, not a rewrite 🧠 Key learning: Building abstracted LLM pipelines that aren't locked to one provider is a critical production skill. 🔗 https://lnkd.in/gyq_JBmE Drop a question below if you're working with LangChain or FastAPI! 👇 #LLM #LangChain #FastAPI #OpenAI #Groq #Python #BuildInPublic #MachineLearning
To view or add a comment, sign in
-
-
Built an AI-powered Appointment Booking Assistant that actually manages conversations instead of just responding to prompts. 🚀 This project combines conversational AI with real backend workflows — from collecting booking details to creating Google Calendar events with Meet links automatically. Some things I focused on while building it: • Designed a multi-step conversational flow using LangGraph • Built a full-stack architecture with Next.js + FastAPI • Implemented persistent chat sessions and booking state management • Added conflict checking for appointment slots • Integrated Google Calendar API for real scheduling automation • Structured the backend with scalable service-based architecture Tech Stack: FastAPI • LangGraph • LangChain • Groq LLM • SQLAlchemy • SQLite • TypeScript • Next.js 15 • Tailwind CSS One of the most interesting parts was building the orchestration logic for handling state transitions like: GREET → SERVICE_SELECTION → DETAILS_COLLECTION → CONFIRMATION → BOOKED Working on projects like this keeps reminding me that AI applications become far more powerful when they are connected with real workflows, system design, and backend engineering — not just UI wrappers around APIs. GitHub Repository: https://lnkd.in/g2i_EwvP #ArtificialIntelligence #GenerativeAI #MachineLearning #LangChain #LangGraph #FastAPI #NextJS #Python #FullStackDevelopment #LLM #SoftwareEngineering #AIEngineering
To view or add a comment, sign in
-
I had a problem. I love Claude Code. The vast plugin ecosystem, hooks, the `/usage` panel, the spinner that says "Canoodling…" while it greps your repo. The whole UX is dialed in. I also love OpenCode. It's the open-source coding agent that talks to 75+ providers via the Vercel AI SDK - Anthropic, OpenAI, Google, Bedrock, OpenRouter, local llama.cpp, anything OpenAI-shaped. No vendor lock-in. MIT license. But every time I jumped between them, I missed half the experience. Claude Code's polish OR OpenCode's freedom. Never both. So I created the love child of both. Introducing OpenCode X. What it does on top of upstream OpenCode: → Reads your existing ~/.claude/settings.json hooks and ~/.claude/plugins/installed_plugins.json. Same events, same env vars. Zero migration. Run your favorite claude code plugins (caveman, rtk, context-mode etc) natively. → Spinner verbs, /usage cost panel, push-to-background chord - the Claude Code UX bits that make the TUI feel alive. → Tokens consumed for every tool call made transparent on the TUI. → Goal system: /goal <objective> and the agent loops autonomously, calls goal_complete with evidence when done. Configurable turn + token cap. → Tool output compression: a cheap model pre-compresses big tool outputs (3 templates: extract / summarize / filter) before they hit your expensive cloud model. 30-60% token savings on bash-heavy work. → Persistent + session memory. Cache stability via stable-prefix prompt split. 3-tier context safety net. → Doom loop detection, safe parallel tool calls within single LLM round-trip. → MIT, no telemetry, bring your own keys. See the full comparison here: https://lnkd.in/dBiKf35F Would love feedback from anyone running multi-provider setups, especially if you've hit the "Claude Code is great but Anthropic-only" wall. #opensource #developertools #opencode #claudecode #codingagents
To view or add a comment, sign in
-
-
AI tools save me real time. They also waste it if you're not careful. Here's where the line actually falls: 1. Scaffolding and boilerplate Cursor handles the repetitive setup fast. New API routes, test skeletons, config files. I'm not writing that by hand anymore. 2. Refactoring familiar patterns Extracting hooks, splitting components, renaming across files. Copilot gets it right most of the time. ✓ Good diff, merge, move on. 3. Edge cases and domain logic This is where both tools fall apart. They don't know your data contracts, your validation rules, your business constraints. Generated code looks right. It isn't. 4. Cross-file context Anything requiring understanding beyond the open file is a gamble. I always run it locally before trusting it. The tool is fast. The verification is still on you. Where do you draw the line between trusting generated output and reviewing it yourself? #SoftwareEngineering #TypeScript #AIEngineering #BackendDevelopment #ReactJS
To view or add a comment, sign in
Explore related topics
- How to Build Reliable LLM Systems for Production
- Common Error Types in LLM API Integration
- Best Practices for LLM Token-Aware Input Testing
- Building Resilient LLM API Integrations
- Using LLMs as Microservices in Application Development
- Accelerate Model Deployment Using Lightweight LLM Testing
- Building Reliable LLM Agents for Knowledge Synthesis