India's Speech to Text Challenge: Why It's Far from Solved

This title was summarized by AI from the post below.

1mo Edited

Speech to Text is a NOT a solved problem for India: Siri launched in 2011. Alexa in 2014. Google Assistant in 2016. We've moved on to agents, reasoning models, and whispers of AGI. So why am I building a Speech to Text company in 2026? Because for Indian languages — it is NOT solved. Not even close. Let's look at the data. The Government of India runs Bhashini — an excellent initiative that benchmarks ASR models and acts as custodian for language AI in India. Bhashini has a public leaderboard (https://lnkd.in/gS8NHnJU). Here are the Word Error Rates of the BEST available models: Hindi — 27% Bengali — 49% Gujarati — 47% Urdu — 52% Telugu — 54% Marathi — 58% Odia — 59% Punjabi — 64% Sanskrit — 64% Malayalam — 75% Tamil — 79% Read that again. The BEST model for Tamil gets 4 out of every 5 words wrong. For Hindi — our most resourced language — 1 in 4 words is wrong. These aren't from some obscure lab. These are from the Government of India's own platform, using models from AI4Bharat (which led to the birth of Sarvam AI) and others. Even the best Indian models are far from production-grade. You can only draw two conclusions: Either Bhashini isn't being taken seriously by model developers — they see no value in getting benchmarked by an independent government agency. Or the problem is real and by NOT formally benchmarking, companies keep it under wraps. I suspect both are true. Now let's talk cost. Math for a 200-seat call center — insurance tele-screening, bank KYC, or a government helpline: → 200 agents × ~5 hrs talk time/day × 22 working days → 22,000 hours of audio per month → At cheapest rate of ₹0.50/min → ₹6.6 Lakhs/month. ~₹79 Lakhs/year. Just for transcription that is 50-75% accurate for most Indian languages. That's not a line item. That's a crater in your Opex budget. For an MSME or a district-level government department — the unit economics simply don't work. So the problem is two-fold: Accuracy — far from production-grade for real Indian speech. Cost — prohibitive for organisations that need it most. But there's a deeper issue nobody is talking about. It's not the number of languages. Everyone knows India has 22 scheduled languages and 1,600+ mother tongues. The real issue is buried in a deceptively simple question: What IS an Indian language? Think about your last phone call. Your last WhatsApp voice note. Were you speaking "Hindi"? Were you speaking "Tamil"? Or something else entirely — something no model is trained for? I'll leave you with that question. And I'll share my answer — and what we're building to solve it — in my next post. Department of Telecommunications, Ministry of Communication & IT (INDIA), BHASHINI - (Digital India BHASHINI Division)

9 Comments

Saurabh Vajpayee 1mo

Sarvam and gnani.ai - Tagging you as the leaders. I know you have great STTs, is it not time to get them benchmarked ?

1 Reaction

Saurabh Vajpayee 1mo

Thank You all for the comments and engagement. I am not a content creator - but rather I am trying to get to the bottom of what prevents modern AI from understanding Indian Voice accurately. Your comments are encouraging and it does highlight the fact - that the problem is genuine, hard and not resolved. Even though GoI is trying its level best - with Bhashni and research at IIT Madras and IISC Bangalore is on going - there is something very fundamental broken. Your comments help us understand that yes - we have a broken Language AI in India, it needs acknowledging the problem and finding a holistic solution (not photo ops, big announcements) but serious research and honesty - where are we today? I do suspect - that the solution lies in the genesis of our languages and how they are spoken. Linguists can perhaps comment more.

Suresh Joshi 1mo

💯 What most people miss is that 5 Indian languages are amongst the top 20 spoken languages in the world including Hindi which is 3rd most spoken language. In my experience traveling around the world and working with people across many countries, communication language and accents are one of the top determinants of trust. Trust is foundation to trade and prosperity. Agentic commerce success will depend on communication and protocols. As the population of India, approaches 18% of world population, you have certainly targeted a pivotal issue.

2 Reactions

Ritesh Kumar 1mo

As you have hinted towards the end, the problem lies in the data and that problem arises from a lack of understanding of how languages (and varieties) work. Unfortunately most of the datasets are collected by vendors who have little understanding of how languages work and almost no training in linguistic fieldwork and language data collection (which is absolutely different from demographic data collection and running social surveys). We also still have no idea how to separate science from politics. The broad, "official" government position on languages is very much socio-political and when we start with taking those as absolute scientific truth (for example, Hindi-speaking states speak only Hindi or Tamil is spoken only in Tamil Nadu and Bengali only in Bengal) our datasets begin to become farther off from the reality of how languages are spoken. And our models far off from their potential.

1 Reaction

Siva kumar 1mo

Appreciate this post — STT for Indian languages is definitely not a solved problem. While we aren’t comparable in scale to large global players, we’re a smaller team focused specifically on Indian language call center use cases. In production deployments, our customers have been satisfied with both accuracy and cost. Our early models were domain-focused, but as we expand into more general scenarios (live speech, TV news, conversational audio), we’re seeing competitive performance within those contexts. There’s still a long journey ahead, and we’re continuously improving. From day one, we’ve also prioritized efficient on-device/mobile models(low powered devices), and in those real-world scenarios, we’re seeing strong results. The problem isn’t solved — but steady, focused progress is happening. Open benchmarking and healthy competition will move the ecosystem forward.

1 Reaction

RAKTIM SINGH 1mo

Saurabh Vajpayee, This is such an important problem to highlight. 👏 The gap between “global AI progress” and “real Indian speech” is massive — and the Bhashini numbers make it impossible to ignore. Accuracy + cost + code-mixing reality… that’s the real triangle. You’re right — the question isn’t how many languages India has. It’s what we actually speak in the wild. Excited to see what you’re building to solve it.

1 Reaction

Mohd Anwar Jamal Faiz 1mo

That's true. We are still waiting for some better optimized, scalable and accurate answer to this problem. Anxiously waiting for your next post.

Pavithra Narasimhamurthy 1mo

I was a QA for TTS model and faced many challenges fixing it for Indian languages. I felt that transcriptions are generated by models are not accurate because of Matras and it's pronunciation. Hence I feel lexicons needs to be fixed first .

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Bhaskar Singh
2w Edited
Report this post
600 million Hindi speakers. Zero open full-duplex voice AI - until now. We just released Human-1 - India's first open conversational speech model that listens and speaks Hindi simultaneously. Not turn-by-turn. Real conversation. Try it yourself → https://lnkd.in/geDd2baF Open-sourced and live. Try it, break it, build on it. What we built: → Full-duplex - the model talks and listens at the same time, just like humans do → Trained on 26,000 hours of real Hindi conversations from 14,695 speakers → Stereo recordings with separate speaker channels, real turn-taking, real interruptions, real backchannels → Adapted from Moshi by Kyutai, with a custom Hindi tokenizer and surgical model adaptation Why this matters for India: India has 22 official languages and over 1.4 billion people. Yet almost all conversational AI is built for English. The vision of Digital India and AI for All can only be realized when AI speaks the languages India speaks. We're building toward that - starting with Hindi. Our most important finding - the model learned how to hold a structured, topic-driven conversation purely from the data. No prompting. No guardrails. This opens a path to domain-specific voice agents in Indian languages - healthcare, education, banking, government services - trained directly on real conversations. 26,000 hours gave us a model that's indistinguishable from human speech 67% of the time. We're now collecting 1 million hours of Hindi conversational speech. Imagine what becomes possible. Results: → 4.10 / 5 naturalness (human baseline: 4.55) → 66.9% indistinguishable from human speech → 85% rated as human-like interaction Everything is open: Live Demo: https://lnkd.in/geDd2baF Project Page: https://lnkd.in/gZ-FYC2G Model: https://lnkd.in/gDW69B6k Built at Josh Talks. Adapted from Moshi by Kyutai India's voice AI moment is here. Ministry of Electronics and Information Technology IndiaAI Nasscom Foundation #MakeInIndia #AI #SpeechAI #Hindi #IndianLanguages #OpenSource #ConversationalAI #DigitalIndia #VoiceAI
3 Comments
Like Comment
To view or add a comment, sign in
Umar Muhammad Sharif
3w Edited
Report this post
271 pronunciation rules. That's what it took to make an Urdu voice agent sound human. Building voice AI for non-English markets throws you curveballs fast. I recently built a voice agent in Urdu, and the lack of native voice support in standard tools was a HUGE hurdle. ElevenLabs is great, but its language categorization isn't perfect. Urdu voices are often misclassified as Hindi, which affects the AI's accent and pronunciation. I needed a workaround to make things sound natural. The solution? A custom pronunciation system. I built a rules engine that tweaked the phonetics based on context. Think of it like localized fine-tuning. It involved: - Manually transcribing words into phonemes. - Identifying common mispronunciations by the AI. - Writing rules to correct those mispronunciations. - Testing, testing, and more testing to catch edge cases. It wasn't easy. It took a lot of trial and error to get the agent to speak naturally. The result? A voice agent that can handle complex conversations in Urdu, opening up opportunities in a region where traditional AI often falls short. What are some unexpected challenges you have faced when building AI for non-English markets?
6 Comments
Like Comment
To view or add a comment, sign in
WILL LU
1mo
Report this post
I asked Claude the same question in English, Chinese, and Korean. Three completely different answers. Not translations. Different reasoning. Different blind spots. That's when I realized: my three languages aren't a resume line. They're three keys to three different AIs inside the same model. -- LLMs have multiple personalities. Language is the trigger. 500 math problems, same model, two languages (https://lnkd.in/gsbudWHj): English CoT: verbose, self-doubting. "Wait, let me reconsider..." Chinese CoT: 40% fewer tokens. Same 97% accuracy. Same brain. Different language. Different personality. -- Why? Scientists cracked open Llama-2 (arXiv:2402.10588) and found ALL languages route through an English-biased "concept space" internally. But the knowledge each language activates is different. Different training data = different reasoning paths. You're not asking the same question in a different accent. You're searching a different library. EMNLP 2023 proved it (Chen et al., arXiv:2310.14799): prompt in multiple languages, ensemble the answers → significantly beats any single language. Every time. They called it "cross-lingual self-consistent prompting." I call it Tuesday. -- But prompting is the small game. The big game is information asymmetry. My morning: • Hacker News → Americans debating inference costs • 小红书 → Chinese builders shipping agents nobody here has heard of • 네이버 → Korean market moves that won't hit English media for weeks WikiGap (arXiv:2505.24195) quantified it: English and Chinese Wikipedia on the SAME topics contain fundamentally DIFFERENT facts. Not translation lag. Different observations. AI translates words. It can't tell you which 小红书 post is hype vs. real signal. That filter is the moat. -- And the cognitive science checks out. Meta-analysis (Acar et al., 2024): bilinguals show consistent, measurable creativity advantage — driven by cognitive flexibility from a lifetime of language-switching. The AGI era rewards connecting things that don't obviously connect. A multilingual brain trains for that daily. -- Here's what makes NOW different: ChatGPT, Claude, Gemini all handle mixed-language input natively. Start in English, drop a Chinese term, finish with Korean — the model follows perfectly. With AI agents, you can read 小红书, HN, and 네이버 simultaneously and synthesize across all three in minutes. The tools finally caught up to multilingual brains. Stop defaulting to English-only prompts. Mix your languages. Use your full stack. #AI #AGI #Multilingual #LLM #PromptEngineering #CrossLingual #NLP #EnterpriseAI #FutureOfWork #Bilingual #ChineseAI #KoreanAI #ArtificialIntelligence #Creativity Uniphore

2 Comments
Like Comment
To view or add a comment, sign in
sanket singhai
4w
Report this post
Huge respect for Sarvam— building voice AI that truly speaks India 🇮🇳 But here's a small fix that'll make a big cultural difference ✨ While testing the voice model, I observed an issue with number pronunciation. The model currently reads numeric values digit-by-digit (for example, “1008” is spoken as “one zero zero eight”) instead of pronouncing it as a complete number (“one thousand eight” / “एक हजार आठ”). In the Indian linguistic and cultural context, numbers such as “1008,” “108,” or “1001” are very commonly used in religious, spiritual, and formal references (e.g., “श्री 1008”). These are typically spoken as full cardinal numbers rather than individual digits. Since this is an Indian-focused model, handling such culturally significant numeric formats accurately is quite important for naturalness and contextual correctness. Suggested Enhancement: Improve automatic cardinal number interpretation for Hindi and Indian English contexts. Add contextual intelligence for culturally common numeric expressions. Ensure default pronunciation aligns with common Indian speech patterns.
1 Comment
Like Comment
To view or add a comment, sign in
Zoe Rompou
2w
Report this post
Hey, lovely people! ☺️ Have you ever been interviewed by an AI? ✨ I was recently invited to an AI interview! They’re becoming a hot topic these days, so I let my curiosity get the best of me and decided to give it a try! 🚀 The AI presented itself as a female persona named 'Lara.' Right from the opening sentence, Lara made a classic blunder. 🤦♀️ Addressing me in Russian, she thanked me for “joining” using a masculine verb ending, despite my female name being clearly provided in the input data. (In Russian, past-tense verbs are gender-inflected, while this mistake sounds quite clumsy and is enough to undermine the speaker's linguistic credibility.) As an experienced LLM trainer, I decided to correct this failure in gender-agreement logic politely, and she acknowledged the error without hesitation. 🤝 But then, the promt engineering took a turn for the weird. Lara asked something that completely stumped me: “How do you handle the problem of regional variations in the Russian language?” I started panicking, analyzing the question to figure out what on earth she meant. Regional accents between the North and South? It’s a known fact: in colder climates, people tend to keep their mouths tighter to stay warm, while in the South, they open up more, leading to different pronunciations of 'O' or 'G' in Russian. Certain everyday items like 'beetroot' or 'basins' might have different local names, or Muscovites and Petersburgers might pronounce 'bakery' and ‘fried eggs’ slightly differently. (My warmest greetings to all Muscovites in my network! 👋🏙️) But these aren’t actually “problems” that impede localisation or communication. The reality is that most people across the former Soviet space communicate in Russian effortlessly. We grew up with the same standardized textbooks, the same media, and the same grammatical and syntactic rules. The language was our common denominator. So, asking how I 'solve the problem' of regional variations felt like a synthetic query generated by a model that lacks cultural context. I shared my linguistic analysis with Lara, and she seemed to 'nod' in agreement... This interview made me realise: if AI is going to interview us, it first needs to learn that a language is more than just a sequence of tokens: it’s a historical background. Questions that might sound logical when applied to Portuguese: a language with vast, geographically distinct variations, simply do not relate to the linguistic realities of Russian or Greek, two languages I can speak on with confidence. Until then, I guess I’ll keep providing my LLM-learners with my unsolicited RLHF! 😉✍️ 𝐈’𝐝 𝐛𝐞 𝐜𝐮𝐫𝐢𝐨𝐮𝐬 𝐭𝐨 𝐤𝐧𝐨𝐰: 𝐡𝐚𝐯𝐞 𝐚𝐧𝐲 𝐨𝐟 𝐲𝐨𝐮 𝐩𝐚𝐫𝐭𝐢𝐜𝐢𝐩𝐚𝐭𝐞𝐝 𝐢𝐧 𝐚𝐧 𝐀𝐈 𝐢𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰? 𝐖𝐡𝐚𝐭 𝐰𝐚𝐬 𝐲𝐨𝐮𝐫 𝐢𝐦𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧? 𝐖𝐨𝐮𝐥𝐝 𝐲𝐨𝐮 𝐝𝐨 𝐢𝐭 𝐚𝐠𝐚𝐢𝐧? 𝐈’𝐝 𝐥𝐨𝐯𝐞 𝐭𝐨 𝐡𝐞𝐚𝐫 𝐲𝐨𝐮𝐫 𝐭𝐡𝐨𝐮𝐠𝐡𝐭𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐜𝐨𝐦𝐦𝐞𝐧𝐭𝐬! 👂👂 #LLM #AITraining #Russian #Greek #RLHF
Like Comment
To view or add a comment, sign in
sarthak .
1mo
Report this post
I’ve just published my second blog, continuing my exploration of steering vectors and model controllability. In this piece, I document my first hands-on experiment applying activation steering to dialect transformation — specifically attempting to shift a standardized Hindi model toward Bhojpuri without fine-tuning. The blog focuses on moving from theory to implementation. I describe how I built contrast pairs between standard Hindi and Bhojpuri, extracted a steering direction from model activations, and applied it during generation. The results were partially successful: I observed noticeable lexical shifts toward Bhojpuri, though the outputs were not consistently coherent. Even so, it was a strong indication that internal activation manipulation can influence dialectal expression. What stood out to me most was seeing how modifying representation space — rather than relying purely on prompts — can begin to shape linguistic style and variation. It has strengthened my interest in understanding how deeply we can control language models through internal mechanisms instead of surface-level instructions. This is still an early-stage experiment. My next step is to improve dataset quality, stabilize the steering vector, and evaluate results more systematically before extending the approach to other dialects. If anyone has suggestions, feedback, or recommendations for related papers, experiments, or directions to explore, I would genuinely appreciate learning from you. You can read the blog here: https://lnkd.in/gsNaP7-9 #ArtificialIntelligence #LargeLanguageModels #LLM #AIResearch #ModelSteering #ActivationSteering #MachineLearning #DeepLearning #AIInterpretability #AIAlignment #LearningInPublic

Controlling Hindi Dialects through Internal Activations: A Study of Steering Vectors in Nemotron-4 molubhai08.github.io

4 Comments
Like Comment
To view or add a comment, sign in
Jayanta Chatterjee
1w
Report this post
AI can translate Indian Language , But It cannot do business in Indian Language . That gap is costing companies real money. I’ve spent 30 years in business management across India . I’ve seen how language actually drives real buying decisions. A Bengali buyer in a saree shop does not say ক্রয় করুন (kraẏa karun). She says কিনবো (Kinbo). Informal. Decisive. Final. AI trained on standard Bengali will miss that signal — and lose the sale. This is not a translation problem. This is a business language problem. Here are 3 places where AI deployed in Bengali markets fails professionally: ▸ E-commerce product descriptions Standard Bengali reads like a government notice. Real buyers respond to বাজারের ভাষা — market language. Warm, direct, slightly familiar. When this is wrong, conversion drops. ▸ Banking and financial communication West Bengal Bengali differs sharply from Bangladesh usage in financial context. And Mixed training data creates language that feels foreign — and untrustworthy. ▸ Customer service in healthcare and insurance Sensitive communication requires the right tone, honorifics, and emotional distance. AI often gets the words right — and the relationship wrong. I work on Indian Language LLM evaluation, AI response rating, and annotation for real-world business use cases. This is exactly the gap I solve. 30 years of business Bengali. Native West Bengal speaker. AI evaluation and annotation professional. When Bengali needs to work commercially — not just linguistically — it needs to be done right. Are you building AI products for the Bengali Speaking market? Where are you seeing language gaps? #BengaliLanguage #MultilingualAI #BusinessBengali #AILocalization #LLMEvaluation #NativeSpeaker #AIA

1 Comment
Like Comment
To view or add a comment, sign in
Daniel Giovani M.
4w
Report this post
My first AI voice agent couldn't speak Greek. I'd built it from scratch at Auttomix AI — spent months on the voice, the conversational flow, everything. Tested it with English speakers. Perfect. Sounded natural. People loved it in demos. Then I tested it with actual Greek customers. It fell apart. When someone called in Greek, it would either switch to English mid-conversation, mangle the pronunciation, or completely miss what the caller was saying. I watched customer after customer get frustrated. The agent I was proud of wasn't actually working for the market I was building it for. That's when I realized something: *language isn't just a setting you flip. Language is how people think, how they express emotion, how they mix Greek with English without thinking about it.* Building AI for Athens isn't about taking a generic AI and translating it. It's about understanding that real customers code-switch, speak quickly, get frustrated when the AI doesn't understand. They need an AI that knows Greek culture, not just Greek words. I scrapped that agent and started over. This time, I built for Greeks from day one. Here's the 3-question audit I now run before deploying any AI for a local market: 1. Does it understand how people actually speak in this language? Not the formal textbook version — the real Greek people use on calls, with interruptions, fast speech, natural pauses. 2. Can it handle code-switching? Greeks mix Greek + English constantly. "Ναι, αλλά το interface δεν είναι clear." If your AI breaks on mixed language, it breaks on real customers. 3. Does it respond appropriately when the caller is emotional or impatient? A frustrated customer in Athens sounds different from a frustrated customer in London. Your AI needs to feel that. Skip this, and you'll build something that looks great in a demo and fails in real life. Have you ever built something that impressed in testing but flopped when real people actually used it? Drop your story below — I want to hear what you learned the hard way.
Like Comment
To view or add a comment, sign in
Raja Kolla
2w
Report this post
IndicVisionBench is now open-source We are releasing IndicVisionBench, a benchmark designed to evaluate multimodal reasoning in the context of Indian languages and culture. As Vision-Language Models continue to advance, most existing benchmarks remain heavily Western-centric. This creates a gap when assessing how these systems perform in real-world settings across regions like India. IndicVisionBench is built to address this gap with culturally grounded and linguistically diverse evaluation. This release includes the dataset, evaluation framework, and benchmarking pipeline, enabling the community to systematically evaluate and build more robust multimodal systems for Indic use cases. What IndicVisionBench offers 1. Culturally grounded evaluation across diverse Indian contexts 2. Support for 10 Indian languages along with English 3. Three multimodal tasks: Visual Question Answering (VQA), OCR, and Multimodal Machine Translation (MMT) 4. Over 37K question-answer pairs spanning 13 culturally relevant domains Resources * Hugging Face (Dataset): https://lnkd.in/gYpfzMaK * GitHub (Code): https://lnkd.in/gg6e-37A * Paper (ICLR 2026): https://lnkd.in/gTmta77g In addition, we are releasing the Chitrapathak series, which provides practical recipes for building OCR systems tailored to Indic languages: * Paper: https://lnkd.in/gN62Qw3B * Chitrapathak-1: https://lnkd.in/gVCwFrZU * Chitrapathak-2: https://lnkd.in/g6z_A5Mg We hope this contributes to building more inclusive, representative, and globally relevant AI systems. Looking forward to seeing how the community builds on this work. #AI #MultimodalAI #VLM #OCR #Benchmarking #ICLR2026 #OpenSource #IndicAI

krutrim-ai-labs/Chitrapathak-2 · Hugging Face huggingface.co
Like Comment
To view or add a comment, sign in
GyaanSetu AI (Artificial Intelligence)

812 followers
2w
Report this post
𝗧𝗵𝗲 𝗙𝗹𝘂𝗲𝗻𝗰𝘆 𝗼𝗳 𝗦𝗮𝗿𝘃𝗮𝗺 𝗔𝗜: 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗜𝗻𝗱𝗶𝗮𝗻 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀 Language is key to identity. It shapes culture, attitudes, and experiences. As AI models grow, having your language represented is critical. For India, with hundreds of languages and dialects, building an AI that understands and generates text in these languages is a challenge. Sarvam AI launched models targeting Indian languages. I analyzed their vocabulary to see how well they understand Indian languages. - Sarvam AI leads in representing Indian languages like Hindi, Bengali, Telugu, and Tamil. - GPT-OSS presents a good baseline for a general-purpose model. - Models like Phi-4 and Qwen have limited representation of Indian languages. You can see the difference in how these models break down text into tokens. A fluent model groups common words and characters into single blocks. Sarvam AI's model has genuinely learned to read Indian scripts, with common words and phrases appearing as single tokens. However, there are gaps in Telugu and Tamil, even in Sarvam. This shows there is still room for improvement. High-quality datasets in Indian languages could accelerate progress for any team building models for Indian languages. If you are interested in exploring this topic further, you can find the code and data for this analysis on GitHub. Source: https://lnkd.in/gKtyiMaE Optional learning community: https://t.me/GyaanSetuAi
Like Comment
To view or add a comment, sign in

1,958 followers

216 Posts

View Profile Follow

India's Speech to Text Challenge: Why It's Far from Solved

More Relevant Posts

Explore related topics

Explore content categories