You're a #CTO. Your board asks: "What's our ROI on AI coding tools?" Your answer: "40% of our code is AI-generated!" They respond: "So what? Are we shipping faster? Are customers happier?" Most CTOs are measuring AI impact completely wrong. Here's what some are tracking: - Percentage of AI-generated code - Developer hours saved per week - Lines of code produced - AI tool adoption rates These metrics are like measuring how fast your assembly line workers attach parts while ignoring whether your cars actually start. Here's what you SHOULD measure instead: 1. Delivered business value 2. Customer cycle time 3. Development throughput 4. Quality and reliability 5. Total cost of delivery (not just development) 6. Team satisfaction Software development isn't a typing competition—it's a complex system. If AI makes your developers 30% faster but your deployment takes 2 weeks and QA adds another week, your customer delivery improves by maybe 7%. You've speed up the wrong part. The solution: A/B test your teams. Give half your teams AI tools, measure business outcomes over 2-3 release cycles. Track what customers actually experience, not how much developers produce. Companies that measure business impact from AI will pull ahead. Those measuring vanity metrics will wonder why their expensive tools aren't moving the needle. Stop measuring how much code AI generates. Start measuring how much faster you deliver value to customers. What are you actually measuring? And is it moving your business forward? -> Follow me for more about building great tech organizations at scale. More insights in my book "All Hands on Tech"
Optimizing Technology Spending
Explore top LinkedIn content from expert professionals.
-
-
Last quarter, my AI inference costs hit $100,000 annualized. I started small. Six months earlier, I was spending $200 a month on Claude. Then I added three agent subscriptions : Codex, Gemini, & Claude Code. I was paying $600 a month. Next I started using AI to transform my todo list into my done list, increasing tasks to 31 per day. $92 daily inference invoices started arriving. Then $400 per month on browser agents. Within two quarters, my inference spend grew from $7,200 to $43,000 to over $100,000 run rate. So I migrated to an open source model. It took a weekend. The key was building the right testing loops : I had six months of historical task data, so I could replay requests through the new model & hill-climb to parity with AI agents working through the night. By Sunday evening, they performed identically. At 12% of the cost. I’m not the only one paying attention to this cost. Technology companies are adding a fourth component to engineering compensation : salary, bonus, options, & inference costs. Levels.fyi pegs the 75th percentile software engineer salary at $375k. Add $100k in inference & the fully loaded cost is $475k. That’s 21% in tokens. The question CFOs will pose : what am I getting for all this inference spend? Can I do it cheaper? If the metric for a new cloud is gross profit per GPU hour, the employee equivalent is : productive work per dollar of inference. For me, the answer is 31 tasks a day at $12k annually. The engineer still burning $100k? They’d better be 8x more productive! Will you be paid in tokens? In 2026, you likely will start to be.
-
You don't need a 2 trillion parameter model to tell you the capital of France is Paris. Be smart and route between a panel of models according to query difficulty and model specialty! New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance. Overview: Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75) The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks. They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5) (2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b) A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs. Explore different routing approaches: - Similarity-weighted (SW) ranking - Matrix factorization - BERT query classifier - Causal LLM query classifier Neat Ideas to Build From: - Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation. - Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.) - Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time. - MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc. - Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer
-
Over the last 18 months, the FinOps Foundation has seen a dramatic shift in the scope of spending that #FinOps practices manage beyond public cloud. We first explored this anticipated shift in the second edition of Cloud FinOps (pg. 401) where we shared a vision for how we expected the scope of FinOps to expand: to a world where FinOps practices are integrating costs beyond public cloud – from SaaS, to licensing, datacenter, and private cloud – for a more complete picture of cost to drive value-based decision-making across a broader scope of spending. In recent surveys, we are seeing upwards of 70% of practitioners now extending their practice beyond public cloud to other types of technology spend. To reflect this reality, the FinOps Foundation Technical Advisory Council has approved a new element in the FinOps Framework to capture the segments associated with the different types of technology cost and usage data FinOps Practitioners are managing: FinOps Scope. Read more in the new Insights article on the expanded scope of FinOps: https://lnkd.in/gPH3vQEn In some cases, especially for companies “born in the cloud,” FinOps teams are the only technology cost management team in the organization. In other cases, FinOps Practitioners are working alongside Allied Personas (ITAM/ITSM/ITFM/TBM/SAM). But in all cases, FinOps’ success in managing cloud spending has the business asking “Can FinOps keep doing what you’re doing for cloud, AND also do it for X?” While other disciplines report on cost at a chargeback level, they do this for a monthly and quarterly roll-up of financial reporting at the general ledger level. FinOps, by contrast, is leveraging extremely granular cost and usage data at levels for all stakeholders, from engineering, to architecture, to product, to finance, and to executives, enabling them to: - Make information available outside of traditional silos to empower Personas across the organization, beyond Leadership – not just the CFO and CIO. - Enable timely decision-making about technology investment choices in “fixed” and variable Scopes. - Enable collaboration between technology and business teams at the engineering and product level. - Enable Cost Aware Product Decisions by bringing cost considerations earlier into the product development lifecycle. - Optimize, modernize, and automate to create consistency and iteratively improve technology usage and cost. Applying FinOps Capabilities to additional Scopes of spending gives businesses more comprehensive visibility into their technology costs. The goal for organizations is to understand and optimize the cost of offering each individual product or service. The first step is to get complete visibility into the cost of a product or service by pulling together all types of costs associated with delivering it... Read more in the new Insights article on the expanded scope of FinOps: https://lnkd.in/gPH3vQEn
-
If you’re an AI engineer trying to optimize your LLMs for inference, here’s a quick guide for you 👇 Efficient inference isn’t just about faster hardware, it’s a multi-layered design problem. From how you compress prompts to how your memory is managed across GPUs, everything impacts latency, throughput, and cost. Here’s a structured taxonomy of inference-time optimizations for LLMs: 1. Data-Level Optimization Reduce redundant tokens and unnecessary output computation. → Input Compression: - Prompt Pruning, remove irrelevant history or system tokens - Prompt Summarization, use model-generated summaries as input - Soft Prompt Compression, encode static context using embeddings - RAG, replace long prompts with retrieved documents plus compact queries → Output Organization: - Pre-structure output to reduce decoding time and minimize sampling steps 2. Model-Level Optimization (a) Efficient Structure Design → Efficient FFN Design, use gated or sparsely-activated FFNs (e.g., SwiGLU) → Efficient Attention, FlashAttention, linear attention, or sliding window for long context → Transformer Alternates, e.g., Mamba, Reformer for memory-efficient decoding → Multi/Group-Query Attention, share keys/values across heads to reduce KV cache size → Low-Complexity Attention, replace full softmax with approximations (e.g., Linformer) (b) Model Compression → Quantization: - Post-Training, no retraining needed - Quantization-Aware Training, better accuracy, especially <8-bit → Sparsification: - Weight Pruning, Sparse Attention → Structure Optimization: - Neural Architecture Search, Structure Factorization → Knowledge Distillation: - White-box, student learns internal states - Black-box, student mimics output logits → Dynamic Inference, adaptive early exits or skipping blocks based on input complexity 3. System-Level Optimization (a) Inference Engine → Graph & Operator Optimization, use ONNX, TensorRT, BetterTransformer for op fusion → Speculative Decoding, use a smaller model to draft tokens, validate with full model → Memory Management, KV cache reuse, paging strategies (e.g., PagedAttention in vLLM) (b) Serving System → Batching, group requests with similar lengths for throughput gains → Scheduling, token-level preemption (e.g., TGI, vLLM schedulers) → Distributed Systems, use tensor, pipeline, or model parallelism to scale across GPUs My Two Cents 🫰 → Always benchmark end-to-end latency, not just token decode speed → For production, 8-bit or 4-bit quantized models with MQA and PagedAttention give the best price/performance → If using long context (>64k), consider sliding attention plus RAG, not full dense memory → Use speculative decoding and batching for chat applications with high concurrency → LLM inference is a systems problem. Optimizing it requires thinking holistically, from tokens to tensors to threads. Image inspo: A Survey on Efficient Inference for Large Language Models ---- Follow me (Aishwarya Srinivasan) for more AI insights!
-
Look at this fascinating chart from the Bureau of Labor Statistics tracking price changes from 2000 to 2022. It’s striking how certain goods and services have soared in cost—like hospital services, college tuition, and textbooks—while items such as TVs, toys, and software have become dramatically more affordable. This divergence often results from how easily technology can boost productivity. Consumer electronics, for example, benefit from rapid innovation and economies of scale. By contrast, labor-intensive sectors like healthcare and education have been harder to automate, causing costs to balloon over time. Artificial intelligence stands to change this dynamic. Machine learning and other AI tools can: 1. Automate Repetitive Tasks: From diagnostic screenings in healthcare to administrative work in higher education, AI has the potential to free up human time for high-impact tasks. 2. Enhance Efficiency: Data-driven insights can reduce waste, optimize operations, and drive down expenses—particularly in service-heavy industries. 3. Expand Access: AI-powered solutions (telehealth, online courses, intelligent tutoring systems) might increase supply and improve affordability for services that have traditionally been expensive and difficult to scale. Implications for Leaders and Professionals: • Opportunity to Innovate: As AI adoption grows in cost-heavy sectors, organizations that embrace it strategically can deliver higher-quality services at lower prices. • Skill Shifts: Tasks in project management, data analysis, and AI oversight will become even more critical to ensuring that technology actually improves outcomes rather than just cutting costs. • Future Competition: Startups and incumbents alike will be racing to apply AI in these traditionally high-cost areas, creating a competitive edge for first movers. Ultimately, charts like this remind us of how unevenly technology affects costs—and how AI offers new ways to tackle price inflation in essential services. If we harness it responsibly, we just might help bend those red lines back downward…
-
Measuring ROI in AI: What Success Really Looks Like in Enterprises I get asked this question a lot lately: “What does ROI in AI actually look like?” Not in theory. Not in a board slide. But in real enterprises trying to make this work. Here’s the uncomfortable truth: Most companies are measuring AI ROI the wrong way. They’re asking: “How many hours did Copilot save?” “Did this chatbot reduce headcount?” “Is the model cheaper than before?” That’s like judging the success of electricity by asking 👉 “How many candles did it replace?” What AI ROI isn’t AI ROI is not: A single number A one‑quarter metric A cost‑cutting exercise Or a model accuracy score Those are inputs. Not outcomes. What AI ROI actually looks like From what I’ve seen across enterprises, real AI ROI shows up in 3 quieter but more powerful ways: 1️⃣ Work changes - before cost does The first signal isn’t savings. It’s work that stops needing to happen. Example: A procurement team doesn’t “save 2 hours per report.” They stop writing reports altogether - because decisions are auto‑prepared. That’s not productivity. That’s workflow elimination. 2️⃣ Decisions get faster - and safer AI ROI often shows up as decision velocity with guardrails. Think of it like: Going from asking 10 people for opinions… to getting a grounded recommendation in minutes - with sources. When leaders trust the output and understand why it said what it said, adoption sticks. 3️⃣ Capability compounds over time This is the part most ROI models miss. AI value compounds. Month 1: A pilot works Month 3: Teams reuse patterns Month 6: Agents start orchestrating work Month 12: The organization operates differently Measuring AI ROI too early is like judging a gym membership after week one. A better question to ask Instead of “What’s the ROI of this AI tool?”, try asking: What work will disappear? What decisions will move faster? What capabilities will compound over time? And… what new risks are now controlled automatically? If you can answer those, the financial ROI usually follows. AI success isn’t about doing the same things cheaper. It’s about doing different things entirely. For those asking how enterprises are actually measuring AI success (beyond time saved), a few Microsoft perspectives worth exploring in comments 👇 Curious - how are you measuring AI success in your organization today? ***************************************************************************** Ranjani Mani #reviewswithranjani #Technology | #Books | #BeingBetter
-
𝗔𝗿𝗲 𝘆𝗼𝘂 𝗽𝗿𝗼𝗮𝗰𝘁𝗶𝘃𝗲𝗹𝘆 𝗺𝗮𝗻𝗮𝗴𝗶𝗻𝗴 𝘆𝗼𝘂𝗿 𝗦𝗼𝘂𝗿𝗰𝗲-𝘁𝗼-𝗣𝗮𝘆 𝘁𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝘆 𝗰𝗼𝘀𝘁𝘀? If not, why let savings from smart Procurement slip away due to outdated technology or suboptimal use? S2P technology plays a central role in cost management, yet many companies lack a strategic approach to continuously assess and optimise their tech stack. Companies can adopt Bain & Co’s "𝗥𝗲𝗱𝘂𝗰𝗲, 𝗥𝗲𝗽𝗹𝗮𝗰𝗲, 𝗮𝗻𝗱 𝗥𝗲𝘁𝗵𝗶𝗻𝗸" model to continuously evaluate their technology infrastructure and costs, ensuring a more optimised and sustainable cost profile. Here is the model in action for Source to Pay technology cost optimisation: ▪️ 𝗥𝗲𝗱𝘂𝗰𝗲 to recover 10 to 20% of costs through short-term actions such as - adjusting licenses to match actual usage and adoption patterns - discontinuing features or functionalities that add little value - switching off modules where business capabilities have not yet caught up Avoid over-licensing by matching user access to actual needs, ensuring modules align with Procurement’s readiness. ▪️ 𝗥𝗲𝗽𝗹𝗮𝗰𝗲 to yield 20 to 30% of savings by - transitioning to cost-optimal, flexible solutions and getting out of lock-ins - switching subscription models when premium offerings are unnecessary - consolidating overlapping tools that offer similar features For example, merge multiple eSourcing tools into a primary platform and adopt a tender-based pricing for niche auction needs. This helps to adjust the cost profile of your Source to Pay technology with the actual needs. ▪️ 𝗥𝗲𝘁𝗵𝗶𝗻𝗸 to realise up to 40% cost optimisation by: - reimagining the architecture with a modular, composable design - automating and orchestrating processes and integrating new digital tools - reevaluate the mix of best-of-breed solutions vs integrated suites A new Procurement strategy requires a fresh look at the S2P tech stack to ensure it adapts and supports growth cost-effectively, while offering flexibility through additional digital levers like AI and automation. 𝗢𝗽𝘁𝗶𝗺𝗶𝘀𝗶𝗻𝗴 𝗦𝟮𝗣 𝘁𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝘆 𝗶𝘀 𝗮 𝗰𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗷𝗼𝘂𝗿𝗻𝗲𝘆, 𝗻𝗼𝘁 𝗮 𝗼𝗻𝗲-𝘁𝗶𝗺𝗲 𝗲𝗳𝗳𝗼𝗿𝘁, especially with contractual commitments, sunk costs, and change management challenges. Rather than following IT preferences and standards, it’s about keeping technology fresh and aligned with business needs as they evolve. ❓How do you manage your S2P technology to adapt to changing business needs while maintaining cost efficiency.
-
After optimizing costs for many AI systems, I've developed a systematic approach that consistently delivers cost reductions of 60-80%. Here's my playbook, in order of least to most effort: Step 1: Optimizing Inference Throughput Start here for the biggest wins with least effort. Enabling caching (LiteLLM (YC W23), Zilliz) and strategic batch processing can reduce costs by a lot with very little effort. I have seen teams cut costs by half simply by implementing caching and batching requests that don't require real-time results. Step 2: Maximizing Token Efficiency This can give you an additional 50% cost savings. Prompt engineering, automated compression (ScaleDown), and structured outputs can cut token usage without sacrificing quality. Small changes in how you craft prompts can lead to massive savings at scale. Step 3: Model Orchestration Use routers and cascades to send prompts to the cheapest and most effective model for that prompt (OpenRouter, Martian). Why use GPT-4 for simple classification when GPT-3.5 will do? Smart routing ensures you're not overpaying for intelligence you don't need. Step 4: Self-Hosting I only suggest self-hosting for teams at scale because of the complexities involved. This requires more technical investment upfront but pays dividends for high-volume applications. The key is tackling these layers systematically. Most teams jump straight to self-hosting or model switching, but the real savings come from optimizing throughput and token efficiency first. What's your experience with AI cost optimization?
-
Caching Architecture Is the New Backbone of LLM Systems Performance, cost, and latency all depend on it. If your LLM bill is rising every month, you’re not alone. More usage More tokens More cost But here’s the catch. Most of that compute is repeated work. Same prompts Same context Same patterns And we recompute everything...every time... This is where inference caching changes the game. Not a new model Not a new architecture Just smarter reuse There are three layers that matter: 1. KV Caching - Happens inside the model - Stores attention states during generation - Prevents recomputing tokens within a request You’re already using it. You just don’t see it. 2. Prefix Caching - Extends this across requests - If your system prompt or reference context is constant, process it once → reuse it Simple rule Static content at the top Dynamic content at the end High impact. Almost zero effort 3. Semantic Caching - This is where things get interesting - Store past queries and responses - Retrieve based on meaning, not exact match In many cases, you can skip the LLM call entirely. Massive cost savings for support bots, FAQs, repeated queries. The real power comes from layering them - KV runs by default - Prefix reduces repeated context cost - Semantic avoids calls altogether Most teams focus on model quality. But in production, efficiency is what scales. Because in real systems: The cheapest token is the one you never generate.