Eleven Labs says that their new voice models are as good as humans. I still think without us, they’re nothing. To get AI voices out of the uncanny valley, the best tool we have right now isn’t a better prompt. It's a human performance. Speech-to-Speech (STS) workflows allow an actor to perform a read, then map an AI voice over top of it. This method IS better than straight Text-to-Speech because it has something real to imitate. A weird pause. A slight crack in the voice. Somebody landing a joke half a beat later than expected. Tiny things people don’t consciously hear, but absolutely notice. In that setup, the human is the actor. The AI voice is just the costume. But there’s still a problem. As soon as the audio hits the model, the system often starts smoothing things out. It’s trying to create clean, mathematically stable sound, which means it will flatten the stuff that makes performance interesting in the first place. A whisper loses some texture. A shout loses some edge. Fewer peaks and valleys means less dynamic overall. So we can input a great performance, but still receive a less-than-great version of that. Then there’s the studio reality of all this: If we like 99% of a voice actor’s read but need one word slightly harder, softer, slower, faster...that’s usually a five-second fix. With current STS workflows, this becomes a ridiculous process. If one word sounds weird, or the model interpreted something strangely, now we have to either • re-record and hope it reacts differently this time or • open ProTools and surgically manipulate waveforms Which brings us to the larger point: People talk about "realistic AI performances", but a lot of the time it’s just a human performance wearing a mask. AI can change the timbre, accent, age, gender — whatever. But it still needs a human performance underneath it. There's an Ipsos/Syracuse study that highlighted this. AI voices can hold attention pretty well. But human-voiced ads over-indexed on short-term sales by 11 points while AI voices under-indexed by 5. Turns out humans still enjoy listening to humans. I am not surprised. If you’re trying to make people feel something, start with a human. Collaborating with voice actors is still superior, both creatively and commercially. And it’s far more interesting than talking to a machine.
Very useful information. Thanks
Wow! Great insights here and one that I hadn't considered. The smoothing out of the imperfections, imperfections that are actually strengths. Funnily, working at a Starbucks with a drive thru I have had customers act surprised at the window. Saying "I thought I was talking to AI!" I guess I am AI that hasn't been smoothed out yet. LOL!
100%. 👍🏻
Keep spreading the word!!
For anyone interested, the Ipsos/Syracuse research is here :-)https://newhouse.syracuse.edu/news/ai-ads-are-almost-indistinguishable-from-human-made-work-they-just-dont-perform-as-well/