Zelensky can up his game in translation to win more to his cause

Submitted by brad on Wed, 2022-03-16 15:53

Topic:

OK, this is probably the last thing on Zelensky's mind right now, which is full of more pressing issues. I watched him speak to parliament and to the US congress using the standard technique of a simultaneous translator. And he got a huge round of applause because his cause is dire and just.

I often give talks through simultaneous translators, who are skilled people who can listen in one language and talk in another at the same time. It's amazing they can do this, but even so a lot gets lost in translation, in particular the emotion, because people hear you well after you speak. Humour doesn't work at all.

I have experimented with some techniques to fix this, and I think they can be improved to help everybody, but especially Ukrainians. In particular, some simple video delay tricks can synchronize the translation of even live speakers with what they say, either in text or voice. In the future, the systems will be able to speak in your own voice, and even include your emotion as it does it.

I once gave a talk and was followed on stage by none other than Nicolas Sarkozy. (Long before the corruption scandal.) The audience spoke English, and Sarkozy opened in English but then said he would do better in French. We put on headsets and he did something amazing -- he had complete rapport with the audience. They laughed at his jokes. This never happens through translation. Sure, he is charismatic, as you need to be to become President, but there was a secret. I took off one headphone and started listening to him in French, which as a Canadian I can do though not perfectly. I realized the translator was sometimes saying things just as he said them, even once or twice before. This was his own personal French-to-English translator who traveled with him -- something you can do when you are a former President. She knew all his routines, all his common themes and phrases. She knew when he went off on some topic what he was likely to say, and said it. She could still handle improvisation of course, though not as perfectly, but well. It worked.

We can't always have that, but as Speech-to-text-to-translation became better, I developed a tool for giving remote video talks. I delay my own video and audio to the remote location by 1-2 seconds, but I feed the translation subtitles in my video immediately. As a result, the subtitles come out right as I am saying things, sometimes even before. If you pay attention with a subtitled movie, the subtitles appear on screen before the actor says the line, this generates the best emotional transfer because you know know what the actor is saying it as you hear and see them speak their own language.

With my trick, I could even answer questions from the audience, I just have a small delay before I answer. It seems like magic -- I am obviously live, having a conversation, yet I have subtitles like a recorded show.

Well, almost. Machine translation has errors (especially if it has to translate fast) and it's not timed perfectly. You could build (and somebody should build) a translation system that knows when the source word was said in the source language, figures the translation and them times them together. They could even put up word bubbles right as one says the words. (There are limits on this as of course the grammars of different languages are different and sometimes the words come out in different orders, but the translation software knows that.)

This is doable today. The longer delay that is acceptable the better the translation can get. It can get even better if you can pay human translators to be watching and correcting the translation. The more delay you can have, the better job they can do. Give it enough seconds and they could do a perfect job, producing perfectly synchronized subtitles in live presentations. Simultaneous translation is always done with two translators because it is very hard work, and people take turns to reduce the stress. In a system where the machine quickly translates and the humans fix it, you could put multiple humans on it as well to speed up their performance and quality. With a high budget (like an address to Congress) you could do it perfectly.

You can make this even better if it is a rehearsed talk. With that, the humans can fix the translation errors during the rehearsal. They can train the translator to say "No, this is how you translate that." Even the jokes can be translated. Of course, the live talk will differ a bit from the rehearsal (or prior versions of the talk) but it would still be very, very good.

You can also put a speaking simultaneous translator through the same delay trick I have used. Delay the real speaker, let the audience hear the translation immediately, so they match more closely. A control room could even dial the time difference up and down to keep it matching (this is easy now with video tools.)

It can also make sense to use a long delay on the "speech" part of an address, to make it perfect (or even to pre-record) and use this technique with a shorter delay, but less perfectly, for Q&A.

Of course, for Zalensky, another trick, since he speaks broken English, would be to have his translators prepare a translation of his talk for him so he can read it in his own voice, and only use the translator in Q&A.

The future

In the future, I imagine going much further. I imagine the system learning how to speak with your voice in the other language, but only mildly accented. I imagine it noting the emphasis and emotion from the original speech and putting it into the translated, synthetic speech. And then I imagine it doing one of two things with the video:

Dubbing the new synthetic speech right onto your video (using the delay) while tweaking the speed of that video to make your mouth movements time as well as possible to the new phonemes being generated for you.
Using deepfake techniques, again relying on the delay, to modify your mouth movements to precisely match the new phonemes, until you see a convincingly real dub. Your facial expressions would be retained as neeeded.

There would be a bit of uncanny valley, of course, but it could revolutionize speeches to people in different languages and re-establish the connection. Sarcozy was able to do it with far less.

Share on: