How to do a low bandwidth, retinal resolution video call

Not everybody loves video calls, but there are times when they are great. I like them with family, and I try to insist on them when negotiating, because body language is important. So I've watched as we've increased the quality and ease of use.

The ultimate goals would be "retinal" resolution -- where the resolution surpasses your eye -- along with high dynamic range, stereo, light field, telepresence mobility and VR/AR with headset image removal. Eventually we'll be able to make a video call or telepresence experience so good it's a little hard to tell from actually being there. This will affect how much we fly for business meetings, travel inside towns, life for bedridden and low mobility people and more.

Here's a proposal for how to provide that very high or retinal resolution without needing hundreds of megabits of high quality bandwidth.

Many people have observed that the human eye is high resolution on in the center of attention, known as the fovea centralis. If you make a display that's sharp where a person is looking, and blurry out at the edges, the eye won't notice -- until of course it quickly moves to another section of the image and the brain will show you the tunnel vision.

Decades ago, people designing flight simulators combined "gaze tracking," where you spot in real time where a person is looking with the foveal concept so that the simulator only rendered the scene in high resolution where the pilot's eyes were. In those days in particular, rendering a whole immersive scene at high resolution wasn't possible. Even today it's a bit expensive. The trick is you have to be fast -- when the eye darts to a new location, you have to render it at high-res within milliseconds, or we notice. Of course, to an outside viewer, such a system looks crazy, and with today's technology, it's still challenging to make it work.

With a video call, it's even more challenging. If a person moves their eyes (or in AR/VR their head) and you need to get a high resolution stream of the new point of attention, it can take a long time -- perhaps hundreds of milliseconds -- to send that signal to the remote camera, have it adjust the feed, and then get that new feed back to you. There is no way the user will not see their new target as blurry for way too long. While it would still be workable, it will not be comfortable or seem real. For VR video conferencing it's even an issue for people turning their head. For now, to get a high resolution remote VR experience would require sending probably a half-sphere of full resolution video. The delay is probably tolerable if the person wants to turn their head enough to look behind them.

One opposite approach being taken for low bandwidth video is the use of "avatars" -- animated cartoons of the other speaker which are driven by motion capture on the other end. You've seen characters in movies like Sméagol, the blue Na'vi of the movie Avatar and perhaps the young Jeff Bridges (acted by old Jeff Bridges) in Tron: Legacy. Cartoon avatars are preferred because of what we call the Uncanny Valley -- people notice flaws in attempts at total realism and just ignore them in cartoonish renderings. But we are now able to do moderately decent realistic renderings, and this is slowly improving.

My thought is to combine foveal video with animated avatars for brief moments after saccades and then gently blend them towards the true image when it arrives. Here's how.

  1. The remote camera will send video with increasing resolution towards the foveal attention point. It will also be scanning the entire scene and making a capture of all motion of the face and body, probably with the use of 3D scanning techniques like time-of-flight or structured light. It will also be, in background bandwidth, updating the static model of the people in the scene and the room.
  2. Upon a saccade, the viewer's display will immediately (within milliseconds) combine the blurry image of the new target with the motion capture data, along with the face model data received, and render a generated view of the new target. It will transmit the new target to the remote.
  3. The remote, when receiving the new target, will now switch the primary video stream to a foveal density video of it.
  4. When the new video stream starts arriving, the viewer's display will attempt to blend them, creating a plausible transition between the rendered scene and the real scene, gradually correcting any differences between them until the video is 100% real
  5. In addition, both systems will be making predictions about what the likely target of next attention is. We tend to focus our eyes on certain places, notably the mouth and eyes, so there are some places that are more likely to be looked at next. Some portion of the spare bandwidth would be allocated to also sending those at higher resolution -- either full resolution if possible, or with better resolution to improve the quality of the animated rendering.

The animated rendering will, today, both be slightly wrong, and also suffer from the uncanny valley problem. My hope is that if this is short lived enough, it will be less noticeable, or not be that bothersome. It will be possible to trade off how long it takes to blend the generated video over to the real video. The longer you take, the less jarring any error correction will be, but the longer the image is "uncanny."

While there are 100 million photoreceptors in the whole eye, but only about a million nerve fibers going out. It would still be expensive to deliver this full resolution in the attention spot and most likely next spots, but it's much less bandwidth than sending the whole scene. Even if full resolution is not delivered, much better resolution can be offered.

Stereo and simulated 3D

You can also do this in stereo to provide 3D. Another interesting approach was done at CMU called pseudo 3D. I recommend you check out the video. This system captures the background and moves the flat head against it as the viewer moves their head. The result looks surprisingly good.

However, with specialized hardware, you can do stereo and it will improve realism. The problem is that, other than on auto-stereo displays which still never look that good, you need them to wear at least polarized glasses. That makes things a little less natural, but see below.

The light field

The other way we pick up 3D cues (along with parallax and stereo) is through focus depth. A truly real image offers a light field -- where light from things at different distances converges appropriately. The background should be slightly out of focus when you look at the face, and then switch as you adjust your eyes. Magic Leap claims it will provide a light field in their augmented reality glasses, which will be interesting to see.

You can also cheat this with the accurate gaze tracking described above. If you know where they are looking, you can generally predict how they are focused, and adjust the flat field image accordingly.

High dynamic range

To make a scene truly realistic, you want more than the 8 bits of intensity of today's computer displays. Fortunately, the new generation displays are starting to offer more dynamic range, so this is going to come for free.

Moving your head

One very difficult challenge is making an image seem real when you move your head. Full-motion VR and AR systems have to do this, but it's really only possible with rendered, animated scenes, not with video. If we sit face to face, and I move my head to the left, not only does your head move against the background (as the pseudo-3D above offers) but I see it from a different angle. I will see more of the right side of your head, and less of the left, and from a slightly different angle.

To do this in a real presence system, you would need to move the camera at the other end, which would take a very long time and be unacceptable. It is acceptable if I have a telepresence robot that I manually steer, because I am used to the delay in seeing how it moves.

It might be possible to once again render a realistic avatar while moving the camera, though you must do it for much longer. One alternative is to have an array of cameras. Software is able to combine the images from two cameras to produce the likely image that would have been shot from a point in between them. Again, this has flaws. You could move the point of view with no latency at the sending side, but to move it without latency at the receiving side would mean you would have to send multiple camera feeds (though highly compressed because of their redundancy.)

This is one reason people advocate all-avatar-all-the-time conferencing. Aside from its super-low-bandwidth it offers the ability to move your head, and of course your eyes.

AR/VR and glasses removal

The approaches described can apply to display monitor based conferencing, particularly room based or table based systems which simulate two people sitting across a table from one another. It can also be done in AR/VR.

The problem, of course, is that people doing AR/VR are wearing big bulky glasses on their heads. Nobody wants to have a video call with that. This means you need software able to "remove" the headset from the image. To do that, you would need a camera inside the headset looking at the eyes, and then painting that onto the image of the head. You would need to fill in the places the camera can't see, and correct the lighting and angles. It's a very tall order. This is a bit easier if all you want to do is remove 3-D glasses, or some AR glasses which still let you see the user's eyes, even though they are wearing a strange contraption.

Gaze meeting

Another big problem in videoconferences is gaze meeting. The camera is not in the screen, it's usually above it. That means when I look at your eyes, I look like I'm looking down below your chin, or worse. Many solutions have been tried for this, including:

  • Remapping the eyes to move them up just a bit so they are looking at the right place. The problem is the uncanny valley is most strong about the eyes. We really are quite sensitive to the eyes looking right. This can be improved by having cameras left and right or top and bottom.
  • Special monitors that can have a camera behind them
  • Teleprompter style -- the use of half-silvered glass with the monitor reflected in it, and the camera behind it. This is the easiest, but it's very bulky.

Robotic telepresence

I have written before about where telepresence robots are going. It's a big difference if you want to do more than sit across the table from somebody. It will be wonderful to combine mobility with this sense of reality, and the use of at least AR glasses to make me feel more present in the environment. Foveal techniques as outlined above really work only with one viewer, or perhaps 2 or 3 in a fancy system, so it may make more sense to offer a good avatar within the telepresence robot to allow the driver to have on VR or AR glasses.

Adding it all up

You can see from the list above why others are hoping instead for really good avatar based systems. They have another advantage, of course. Your avatar can be better looking than you, or at the very least "cleaned up" -- a presentation of you at your best even if the real you is sitting there unshaven in pajamas. This is one of the things people don't like about videoconferencing systems. They don't like having to be presentable for every phone call, and they like not having people see what they are doing in the room (like reading e-mail or eating.) On the other hand, many companies like that video conferencing forces people to be "in the meeting" because people can see when they are not. We all know how distracted from the meeting remote audio-only attendees get.

As a result, we'll see the different technologies racing to see which can solve the problem best and at the lowest cost:

  1. Bandwidth just getting better and cheaper -- if gigabit connectivity is common, we don't need most of these tricks.
  2. Avatar based calls will not seem fully real, but they might get very good, and they need almost no bandwidth
  3. Tricks like the ones I describe above can bring us something quite good, but with much less bandwidth -- but with fancy hardware and software required.

Comments

This is an area where deep learning GAN's could play a role in interpolating data, and a policy gradient could be made to predict various movements.

This might be a bit more compact if the camera used the mirrors and the screen were direct-viewed. You might have to compensate the patch of display that is covered by the camera mirror so it appears to have the same brightness.

Add new comment