VoiceTextBlender introduces a novel approach to augmenting LLMs with speech capabilities through single-stage joint speech-text supervised fine-tuning. The researchers from Carnegie Mellon and NVIDIA have developed a more efficient way to create models that can handle both speech and text without compromising performance in either modality. The team's 3B parameter model demonstrates superior performance compared to previous 7B and 13B SpeechLMs across various speech benchmarks whilst preserving the original text-only capabilities—addressing the critical challenge of catastrophic forgetting that has plagued earlier attempts. Their technical approach employs LoRA adaptation of the LLM backbone, combining text-only SFT data with three distinct types of speech-related data: multilingual ASR/AST, speech-based question answering, and an innovative mixed-modal interleaving dataset created by applying TTS to randomly selected sentences from text SFT data. What's particularly impressive is the model's emergent ability to handle multi-turn, mixed-modal conversations despite being trained only on single-turn speech interactions. The system can process user input in pure speech, pure text, or any combination, showing impressive generalisation to unseen prompts and tasks. The researchers have committed to publicly releasing their data generation scripts, training code, and pre-trained model weights, which should significantly advance research in this rapidly evolving field of speech language models. Paper: https://lnkd.in/dutRcaAA Authors: Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg #SpeechLM #MultimodalAI #SpeechAI
Speech Recognition Innovations
Explore top LinkedIn content from expert professionals.
Summary
Speech recognition innovations are transforming the way humans interact with computers by enabling machines to accurately interpret spoken language and even silent neural signals. These advances include non-invasive wearables, brain-computer interfaces, and AI models that can decipher speech in real time, restoring voices for those who have lost the ability to speak and powering smarter voice agents for everyday tasks.
- Adopt accessible solutions: Consider using the latest non-invasive wearables or open-source voice agents to empower individuals with speech impairments and support fast, natural communication.
- Explore brain-computer advancements: Stay updated on new neuroprosthesis technologies, which allow direct translation of brain signals into speech, offering hope for those with paralysis or severe speech loss.
- Experiment with multimodal AI tools: Test emerging AI models that combine speech, text, and visual data to improve device interactions, support multilingual users, and enable secure, real-time communication.
-
-
Voice agents are having their moment in 2025: an open-source breakthrough just redefined real-time multimodal AI by slashing interaction latency to 1.5 seconds, challenging the recently released proprietary real-time APIs from OpenAI and Google. VITA-1.5, the latest iteration of the open-source interactive omni-multimodal LLM, brings three major improvements that push the boundaries of multimodal AI: (1) Speed transformation - reduced end-to-end speech interaction latency from 4 seconds to 1.5 seconds, enabling true real-time conversations (2) Speech processing leap - decreased Word Error Rate from 18.4 to 7.5, rivaling specialized speech models (3) Multimodal excellence - boosted performance across MME, MMBench, and MathVista from 59.8 to 70.8 while maintaining robust vision-language capabilities One novel method from the paper is VITA’s progressive training strategy that allows speech integration without compromising other multimodal capabilities - a persistent challenge in the field. The image understanding performance only drops by 0.5 points while gaining an entirely new modality. As we move towards agentic AI systems that need to process and respond to multiple input streams in real time, VITA-1.5's achievement in reducing latency while maintaining high accuracy across modalities sets a new standard for what's possible in open-source AI. This release signals a shift in the multimodal AI landscape, demonstrating that open-source alternatives can compete with proprietary solutions in the race for real-time, multi-sensory AI interactions. VITA-1.5 https://lnkd.in/gj7pd77P More tools, open-source models, and APIs for building voice agents in my recent AI Tidbits post https://lnkd.in/g9ebbfX3
-
376,000 ALS patients type 10 words per minute. MIT just gave them normal speech speed. No sound. No surgery. Just seven sensors reading jaw signals. Arnav Kapur and his team built AlterEgo with one mission: empower people with ALS and oral cancer, not replace them. Their wearable reads signals your brain sends to silent muscles—92% accuracy, half-second response. The cost breakthrough that matters: ↳ Neuralink surgery: $30,000-$100,000+ ↳ Brain implants: Infection risks, select trials only ↳ Current ALS devices: $1,500-$8,000 robotic voices ↳ AlterEgo target: Same price, your actual voice Think about that. No drilling into skulls like Synchron or UC Davis implants. No $100,000 medical bills. Just electrodes on your jaw detecting the same signals you use to read silently. Traditional Assistive Reality: ↳ Eye-tracking: 10 exhausting words per minute ↳ Brain surgery: $100,000+ with infection risks ↳ Robotic voices destroying identity ↳ Most patients priced out entirely AlterEgo Reality: ↳ Think naturally, speak instantly ↳ Non-invasive wearable design ↳ Your voice preserved digitally ↳ First responders using it for silent comms But here's what stopped me cold: The same device restoring voices to ALS patients is being tested for secure translation, silent note-taking, and emergency response teams. One innovation serving different needs with high impact. Consumer EEG headsets cost $100-$1,000 but can't handle real speech. Medical BCIs require brain surgery. AlterEgo sits between—medical-grade accuracy without medical risks. The Multiplication Effect: 1 voice preserved = independence restored 100 patients reconnected = isolation broken 1,000 using AlterEgo = new communication standard At scale = surgery becomes obsolete From MIT lab to human trials. From $100,000 brain implants to accessible wearables. From "I need surgery to speak" to "I just need to think." Kapur's team chose technology that empowers rather than replaces human ability. Because 376,000 people with ALS and oral cancer deserve their own voice—not a robot's. Follow me, Dr. Martha Boeckenfeld for innovations that restore human dignity without invasion. ♻️ Share if everyone deserves to keep their voice.
-
A high-performance speech neuroprosthesis, developed by Stanford researchers, decodes attempted speech directly from brain activity—restoring a voice to individuals who have lost the ability to speak. Key Findings: 📍Rapid and naturalistic decoding: The system translated neural signals into real-time text at 62 words per minute—nearly 3.5× faster than prior BCI systems. This speed brings decoded communication closer to everyday conversation, offering a major leap in usability and responsiveness. 📍Robust phoneme mapping and vocabulary range: Impressively, the neuroprosthesis operated with a 125,000-word vocabulary—the largest ever used in speech BCI—while maintaining semantic accuracy. Neural representations of phonemes remained intact even years after speech loss, suggesting the brain’s motor-speech pathways are more persistent than previously assumed. 📍Rethinking the neural basis of speech: While traditional models emphasize Broca’s area, this study found that area 6v was more predictive of speech intention. Furthermore, the system successfully decoded both spoken and silently mouthed words, demonstrating that silent articulation retains a reliable neural signature—crucial for fatigue-free, discreet communication. By Willett et al., Nature, 2023 https://rdcu.be/eyFkC Implication: This work marks a major milestone for brain–computer interfaces, bridging neuroscience and assistive technology to restore speech—and reshaping our understanding of the brain’s language architecture. #BrainComputerInterface #Neuroprosthetics #SpeechNeuroprosthesis #Neuroscience #Stanford #ALS #Neurotech #BCI
-
Breakthrough: BCI + AI = Instant mind-to-speech conversion A new device can detect words and turn them into speech within three seconds. 📍 The researchers used deep learning RNN-T models to achieve fluent speech synthesis with a large-vocabulary with neural decoding in 80-ms increments. In the study, Ann was a participant who lost her ability to speak after a stroke 18 years ago. Researchers used paper-thin rectangle containing 253 electrodes on the surface on her brain cortex (speech sensorimotor area) to record the activity of thousands of neurons. Researchers even personalized the synthetic voice! They used AI on recordings from her wedding video. As a result, the synthetic voice sounds like Ann’s own voice from before her injury. ❗ The result: Before: a single sentence took >20 seconds. Now: 47 - 90 words per minute. “Our framework also successfully generalized to other silent-speech interfaces, including single-unit recordings and electromyography. Our findings introduce a speech-neuroprosthetic paradigm to restore naturalistic spoken communication to people with paralysis.” Huge congratulations to the authors of this work! Just WOW.
-
Machines used to hear us. Now they start to understand us. Google’s new model, Speech-to-Retrieval (S2R), skips transcription. It listens for meaning, not words. ✦ Old loop: Voice → Text → Search → Error → Frustration ✦ New loop: Voice → Intention → Retrieval → Result No more “Scream” becoming “screen.” No more brittle text layers between thought and answer. ⧉ This is more than an upgrade. It’s a paradigm shift from speech as input to speech as understanding. Humans speak in nuance. Machines finally start to respond in kind. What this means for us: › Search becomes semantic. › Interfaces become invisible. › Conversation becomes computation. When the system no longer asks what you said but starts inferring what you meant, the interface dissolves and intent becomes the command. ツ The next frontier is not recognition. It’s understanding. What happens when your system listens and truly knows what you mean?
-
Jan 26, 2026: Can we "hear" a disease before we see it? Think about the last time you had a cold. You could hear it in your voice before you even felt that first fever, right? We’ve always known sound carries clues about our health. As a recent The Times Of India feature highlights, AI is now giving us the "ears" to actually decode these subtle acoustic signatures. We are seeing a massive shift where a simple smartphone recording can screen for everything from TB and Parkinson’s to thyroid and even malignancy. What’s most exciting isn't just the tech, it’s the equity and accessibility it creates. In rural India, where the nearest specialist might be a day's journey away, tools like Swaasa® - By Salcit Technologies , AyuSynk.ai , and Ai Health Highway are changing the game. They’re turning a standard consultation into a high-tech diagnostic session, bridging the gap between primary care and life-saving intervention. Why & How this matters: - Detection before symptoms: Subtle changes in pitch, cadence, and pause patterns can signal issues like Parkinson's or cognitive decline long before physical symptoms appear. - High-precision screening: Modern studies are hitting incredible benchmarks, such as a 97% accuracy for Parkinson’s detection using conversational speech and strong AUROC scores for TB screening. - Mental Health insights: AI can now detect "flat" or monotone speech patterns associated with depression or the rapid word pace common in anxiety. - Empowering Frontline Workers: Devices like AiSteth and AyuSynk allow ASHA workers and rural clinics to record and share digital heart/lung sounds with remote specialists, enabling life-saving referrals. - Beyond the ear: While doctors may struggle to catch every subtle vocal shift, AI trained on massive datasets can identify early warning signs of dementia at a fraction of the cost of traditional blood tests. Quoted by Dr. V. Mohan Diabetes Specialist The takeaway? AI isn’t replacing the stethoscope; it’s supercharging it. PMC systematic reviews and Canary Speech studies show that the goal is to build objective, non-invasive "digital biomarkers" that make healthcare proactive rather than reactive. The future of diagnostics isn't just digital, it’s beyond digital. ARTPARK Indian Institute of Science (IISc) Nihar Desai Rohit Satish Raghu Dharmaraju Indian Council of Medical Research (ICMR) ICMR-National Institute for Research in Digital health and Data Science Startup India Dr (Maj) Satish S Jeevannavar NIMHANS, Bangalore
-
🧵 This week in conversational AI: This week reinforced a clear theme: Voice AI is entering its scale phase, where reliability, latency, and control really matter. Here’s the recap 👇 Deepgram sees its latest funding highlighted by The Wall Street Journal, valuing the company at $1.3B. Real-time voice APIs are officially core infrastructure. ElevenLabs drops 𝗦𝗰𝗿𝗶𝗯𝗲 𝘃𝟮 + 𝗦𝗰𝗿𝗶𝗯𝗲 𝘃𝟮 𝗥𝗲𝗮𝗹𝘁𝗶𝗺𝗲, delivering sub-150ms transcription across 90 languages with ~93%+ accuracy. This is the latency threshold where voice stops feeling like software and starts feeling human. VoiceRun raises a $5.5M seed and launches a full-stack, code-first Voice AI platform for enterprises. Control, observability, and reliability are becoming non-negotiable as voice agents graduate to production. OpenAI releases “𝘈𝘐 𝘢𝘴 𝘢 𝘏𝘦𝘢𝘭𝘵𝘩𝘤𝘢𝘳𝘦 𝘈𝘭𝘭𝘺,” showing how millions of Americans are already using ChatGPT to navigate a broken healthcare system. Conversational AI is emerging as a critical layer for access, clarity, and patient empowerment. Parloa announces a $350M Series D at a $3B valuation, just seven months after its Series C, led by General Catalyst. The company is accelerating global growth, expanding its AI Agent Management Platform, and launching the Parloa Promise, a strong signal that enterprise-grade, responsible AI is scaling fast. Krisp launches webhooks for its AI Meeting Assistant, letting transcripts, notes, and action items flow directly into internal tools. Voice → structured data → action, without friction. NVIDIA releases Nemotron Speech ASR, an open-source model hitting ~24ms median transcription time with massive concurrency on H100s. Real-time voice at scale just became far more accessible. SoundHound AI x Richtech Robotics partner to bring conversational voice AI into robotic food service. Voice continues to emerge as the interface between humans, machines, and real-world transactions. 🚀 Big week for conversational AI. What did we miss?
-
Innovation happens when AI meets neuroscience and lives are changed. After surviving a stroke, Ann was unable to speak or express herself for 18 years. Today, that has changed. Researchers at University of California, San Francisco developed a groundbreaking brain computer interface that reads neural signals directly from her brain and decodes them in real time. Powered by AI, those signals are transformed into text, synthetic speech, and even facial expressions through a digital avatar that reflects how Ann would naturally communicate. There’s no typing. No word selection. Ann simply thinks about what she wants to say and the system converts those thoughts into full sentences, complete with emotional expression like smiling or raised eyebrows. This is one of the first demonstrations of a brain implant restoring both voice and emotional expression together. It marks a major step forward in using AI models to map brain activity to language and movement. The future of communication for people with severe paralysis is no longer theoretical. It’s here and it’s profoundly human. #AI #Neuroscience #Innovation #BrainComputerInterface #HealthcareAI #HumanCenteredAI #medtech #device #ml #human #life #world
-
WHY AI -- Four years ago, Casey Harrell sang his last bedtime nursery rhyme to his daughter. Now, in an experiment that surpassed expectations, implants in his brain were able to recognize words he tried to speak, and A.I. helped produce sounds that came close to matching his true voice. “The key innovation was putting more arrays, with very precise targeting, into the speechiest parts of the brain we can find,” said Sergey Stavisky, a neuroscientist at the University of California, Davis, who helped lead the study. By day two of their initial working sessions together, the machine was ranging across an available vocabulary of 125,000 words with 90% accuracy and, for the first time, producing sentences of Mr. Harrell’s own making. The device spoke them in a voice remarkably like his own, too: Using podcast interviews and other old recordings, the researchers had created a deep fake of Mr. Harrell’s pre-A.L.S. voice. But beyond the tech itself -- what has changed for Casey? • The new AI persona wakened parts of him that had long lain dormant. He started 'small-talk' and bantering again. Just as speaking a foreign language can enable people to express otherwise buried parts of their personalities, his decoder gave him back old elements of himself, even if they had become slightly changed in transit. • He could now tell Aya, his 5yr old daughter, that he loved her. She, in turn, shared more with him, knowing that she would understand her father’s responses. • Visiting health workers who once seemed to take his impaired speech to mean he was stupid or hard of hearing — he is neither — now speak at normal volumes and touch him more carefully. • He could reach back out to old friends who had drifted away, and who he worried were too ashamed to get back in touch. He could “connect with them in a way that meets them where they are at” rather than on wordless terrain. • Casey describes working more productively and independently since the surgery. And that it was a source of pride. Give AI the right job, and the right things can happen. 🚀