India's Speech to Text Challenge: Why It's Far from Solved

This title was summarized by AI from the post below.

Speech to Text is a NOT a solved problem for India: Siri launched in 2011. Alexa in 2014. Google Assistant in 2016. We've moved on to agents, reasoning models, and whispers of AGI. So why am I building a Speech to Text company in 2026? Because for Indian languages — it is NOT solved. Not even close. Let's look at the data. The Government of India runs Bhashini — an excellent initiative that benchmarks ASR models and acts as custodian for language AI in India. Bhashini has a public leaderboard (https://lnkd.in/gS8NHnJU). Here are the Word Error Rates of the BEST available models: Hindi — 27% Bengali — 49% Gujarati — 47% Urdu — 52% Telugu — 54% Marathi — 58% Odia — 59% Punjabi — 64% Sanskrit — 64% Malayalam — 75% Tamil — 79% Read that again. The BEST model for Tamil gets 4 out of every 5 words wrong. For Hindi — our most resourced language — 1 in 4 words is wrong. These aren't from some obscure lab. These are from the Government of India's own platform, using models from AI4Bharat (which led to the birth of Sarvam AI) and others. Even the best Indian models are far from production-grade. You can only draw two conclusions: Either Bhashini isn't being taken seriously by model developers — they see no value in getting benchmarked by an independent government agency. Or the problem is real and by NOT formally benchmarking, companies keep it under wraps. I suspect both are true. Now let's talk cost. Math for a 200-seat call center — insurance tele-screening, bank KYC, or a government helpline: → 200 agents × ~5 hrs talk time/day × 22 working days → 22,000 hours of audio per month → At cheapest rate of ₹0.50/min → ₹6.6 Lakhs/month. ~₹79 Lakhs/year. Just for transcription that is 50-75% accurate for most Indian languages. That's not a line item. That's a crater in your Opex budget. For an MSME or a district-level government department — the unit economics simply don't work. So the problem is two-fold: Accuracy — far from production-grade for real Indian speech. Cost — prohibitive for organisations that need it most. But there's a deeper issue nobody is talking about. It's not the number of languages. Everyone knows India has 22 scheduled languages and 1,600+ mother tongues. The real issue is buried in a deceptively simple question: What IS an Indian language? Think about your last phone call. Your last WhatsApp voice note. Were you speaking "Hindi"? Were you speaking "Tamil"? Or something else entirely — something no model is trained for? I'll leave you with that question. And I'll share my answer — and what we're building to solve it — in my next post. Department of Telecommunications, Ministry of Communication & IT (INDIA), BHASHINI - (Digital India BHASHINI Division)

Sarvam and gnani.ai - Tagging you as the leaders. I know you have great STTs, is it not time to get them benchmarked ?

Thank You all for the comments and engagement. I am not a content creator - but rather I am trying to get to the bottom of what prevents modern AI from understanding Indian Voice accurately. Your comments are encouraging and it does highlight the fact - that the problem is genuine, hard and not resolved. Even though GoI is trying its level best - with Bhashni and research at IIT Madras and IISC Bangalore is on going - there is something very fundamental broken. Your comments help us understand that yes - we have a broken Language AI in India, it needs acknowledging the problem and finding a holistic solution (not photo ops, big announcements) but serious research and honesty - where are we today? I do suspect - that the solution lies in the genesis of our languages and how they are spoken. Linguists can perhaps comment more.

Like
Reply

💯 What most people miss is that 5 Indian languages are amongst the top 20 spoken languages in the world including Hindi which is 3rd most spoken language. In my experience traveling around the world and working with people across many countries, communication language and accents are one of the top determinants of trust. Trust is foundation to trade and prosperity. Agentic commerce success will depend on communication and protocols. As the population of India, approaches 18% of world population, you have certainly targeted a pivotal issue.

As you have hinted towards the end, the problem lies in the data and that problem arises from a lack of understanding of how languages (and varieties) work. Unfortunately most of the datasets are collected by vendors who have little understanding of how languages work and almost no training in linguistic fieldwork and language data collection (which is absolutely different from demographic data collection and running social surveys). We also still have no idea how to separate science from politics. The broad, "official" government position on languages is very much socio-political and when we start with taking those as absolute scientific truth (for example, Hindi-speaking states speak only Hindi or Tamil is spoken only in Tamil Nadu and Bengali only in Bengal) our datasets begin to become farther off from the reality of how languages are spoken. And our models far off from their potential.

Appreciate this post — STT for Indian languages is definitely not a solved problem. While we aren’t comparable in scale to large global players, we’re a smaller team focused specifically on Indian language call center use cases. In production deployments, our customers have been satisfied with both accuracy and cost. Our early models were domain-focused, but as we expand into more general scenarios (live speech, TV news, conversational audio), we’re seeing competitive performance within those contexts. There’s still a long journey ahead, and we’re continuously improving. From day one, we’ve also prioritized efficient on-device/mobile models(low powered devices), and in those real-world scenarios, we’re seeing strong results. The problem isn’t solved — but steady, focused progress is happening. Open benchmarking and healthy competition will move the ecosystem forward.

Saurabh Vajpayee, This is such an important problem to highlight. 👏 The gap between “global AI progress” and “real Indian speech” is massive — and the Bhashini numbers make it impossible to ignore. Accuracy + cost + code-mixing reality… that’s the real triangle. You’re right — the question isn’t how many languages India has. It’s what we actually speak in the wild. Excited to see what you’re building to solve it.

That's true. We are still waiting for some better optimized, scalable and accurate answer to this problem. Anxiously waiting for your next post.

Like
Reply

I was a QA for TTS model and faced many challenges fixing it for Indian languages. I felt that transcriptions are generated by models are not accurate because of Matras and it's pronunciation. Hence I feel lexicons needs to be fixed first .

See more comments

To view or add a comment, sign in

Explore content categories