How AI for lip dubbing could change the film industry

By Burt Helm



In Hollywood, big money is getting lost in translation.








Sure, the global entertainment business is synced up like never before. Marvel blockbusters captivate audiences in China. Korean directors score one coup after another in the U.S. Streaming development executives now scour foreign markets to bring home the next Squid Game, Lupin, and Money Heist. And Western entertainment companies are pouring money into so-called localization efforts to ensure the sun never sets on Spiderman. Disney upped its localization spending to $33 billion in 2022, according to Variety, a 32% increase. Streamers now include options for subtitles and audio in multiple languages, even in old and niche entertainment. 


But even as companies invest in quality script translations and better performances by voice actors, dubbed entertainment often still looks as cheesy as old kung fu films and Mr. Ed, turning audiences off. No matter how good the sound is, it seems wrong. Lips don’t lie. 


“The lips are always, always the last piece that nobody’s solved for,” says Jonathan Bronfman, cofounder and CEO of the visual effects company, Monsters Aliens Robots Zombies (MARZ).




Earlier this year, Bronfman’s company unveiled a technology called LipDub AI, which digitally manipulates actors’ facial expressions to match spoken words in foreign languages. The technology promises to achieve an extraordinary level of realism and fluency, learning to make actors’ lips match the language and the performers. Marlon Brando will mumble in Mandarin; Jim Carrey will gesticulate in German, and Arnold Schwarzenegger’s English . . . well. AI is making more progress every day. 



In the beginning, lip-dubbing technology was a crude joke—Schwarzenegger screaming at a late-night TV host through the superimposed lips of another man (“I AM HEE-AH TO SAVE CALIFORNIA!”). But the promise of new AI-driven software means that global audiences may be laughing with such technology, not at it—as well as crying, cheering, and loving performances where actors deftly deliver lines any of hundreds of languages, whether or not the performers themselves have ever uttered a word in those tongues themselves. 


LipDub’s technology is an evolution of an open-source AI model known as Wav2Lip, first released in 2020 by researchers at Hyderabad’s International Institute of Information Technology. Designed initially to synchronize lip movements in videos with specific audio tracks, it analyzes the input audio’s phonetic elements to identify different speech sounds. In parallel, it processes the video, focusing on the speaker’s face, especially the lip area. Wav2Lip uses deep learning models to understand the facial structure and predict corresponding lip movements. The technology combines audio analysis with video data to generate accurate lip synchronization. This results in a video where the lip movements match the spoken words in the audio track, enhancing realism for applications like movie dubbing, video conferencing, or animated characters. 





Adapting such technology into a valuable product for the film and advertising industries presented MARZ researchers with intricate challenges. The varying elements of movie production, such as changes in lighting and camera angles, along with scenes featuring multiple actors or several faces, demanded careful consideration. The presence of beards or the appearance of lips from different angles added to the complexity. A significant hurdle emerged when the AI initially failed to differentiate between speakers and non-speakers. This resulted in scenes where every character’s lips moved in sync with a single spoken line.


“Early on, we had to put black boxes over the faces we didn’t want speaking,” says Matt Panousis, MARZ’s cofounder and chief operating officer. “It’s one thing to do this in a simple video clip. It’s another to upload a whole movie.” 


While Hollywood clients demand hyperrealism from lip-dubbing software, amateur users are happy to experiment with less sophisticated tech. Plenty of other software companies (Heygen, Eleven Labs) are offering apps that translate short clips of video and audio that are fast, free to use, and still mind-bogglingly real.




 




MARZ, an AI-enabled visual effects (VFX) studio, was founded in 2018 and remains focused on professional users. The Toronto-based company has developed a reputation for delivering high-quality VFX for television, contributing to notable projects like Marvel’s WandaVision, HBO’s Watchmen, and Netflix’s The Umbrella Academy. The company has grown from 45 employees in 2019 to 80. More than 50 employees are dedicated to Machine Learning, says Bronfman, work that resulted in both LipDub and a product called Vanity, an AI-enabled “digital makeup tool” that “air-brushes” away wrinkles and other aged imperfections from actors’ mugs. 


So far, the company is using the LipDub AI technology in house for its existing visual effects clients, including Apple TV. In the months to come, MARZ plans to release a fully automated software tool aimed at video professionals who are already accustomed to software like Adobe Premiere and Final Cut. 


The future of AI in Hollywood is still being determined, of course. The SAG-AFTRA union and Hollywood studios are hashing out the nascent technology’s role in productions—and the need for actors’ explicit consent will complicate the deals necessary for Lipdub and other tech to be of use. And President Biden recently issued an executive order seeking to curb misuse of deep fakes, even pshawing at an ersatz semblance of himself at the event. Lips do lie, after all. 




If LipDub AI and similar technologies thrive, they could expand the reach of both foreign and domestic films, benefiting creators worldwide. That could represent a pivotal shift in the business of pop culture: Studios and streamers will need to act less importer/exporters—simply slapping new audio over the original actors words’ and shipping it off to foreign consumers—and more to like collectors and curators of authentic global culture, finding talent and stories with universal human appeal. 







Fast Company

(6)