For years I’ve been saying that video conferencing as-we-know-it will never get beyond the sub-5% penetration of the business world that it’s been in for the last 20 years – let alone the mass-market. (For my reasons in some detail, see my slides from 2008).

This (see video below) is the most encouraging thing I’ve seen in years. It’s far from production-ready, but this is the most practical way of really solving the all-important issues of knowing who-is-looking-at-whom, pointing, and eye contact that I’ve yet seen. (I don’t think telepresence systems do a sufficiently good job, even ignoring their cost. And this solution is cheap.)

Wait for the end where he steps into the view – until then I thought it was a single frame; it’s not – it’s done in real time on cheap hardware.

The core idea here – if you haven’t figured it out already – is to have two or more cameras looking at the same scene from different angles. To make things easier, he’s also got two Microsoft Kinects (one for each camera) directly reporting depth information. With a little geometry he can figure out the relative position of each object in the scene, and make a “virtual camera” that produces an image as it would be seen from any viewpoint between the cameras.

I mentioned this video to some friends on a mailing list, and got a couple of responses questioning the whole idea that there is a “problem” with current video conferencing technology. Supposedly lots of people already use video telephony – there is Skype video, Google, NetMeeting, AIM, Polycom, Tandberg, telepresence systems, and and the iPhone 4’s front-facing camera, so isn’t practical video communication already here?

Of course. I use video calling too – for example to chat with my son in college. But it’s very – deeply – unsatisfying in many ways, and works even worse in multi-point situations (3 or more sites in the call) and when there is more than one person at each site. And I think that is limiting use of the technology to a tiny fraction of what it would be if video communication worked the way people expect.

Consider the best-case situation of a simple point-to-point video call between two people – two sites, one person on each end. In many ways today’s video is a good improvement over ordinary telephony. The biggest problem is lack of eye contact, because the camera viewpoint is not co-located with the image of the far-end person’s eyes on the display.

Look at George Jetson and Mr. Spacely, below. They most definitely have eye contact. This interaction between boss and employee wouldn’t be the same without it. But we, the viewer, don’t have eye contact with either of them. And both of them know it. 

We also expect, looking at the scene, that Mr. Spacely (on the screen) has peripheral vision – he could tell if we were present in the room with George. We feel he could look at us if he wanted to.

This is how people expect, and want, video communication to work. The artist who drew this scene knows all this without being told. But this is not how today’s video works.

George Jetson & Mr. Spacely (The Jetsons, Hanna Barbera, date unknown)

Eye contact is a profoundly important non-verbal part of human social communication. Our brains are hardwired to know when we’re being looked in the eyes (even at amazing distances). Some people think human eyes have evolved not just to see, but also to be seen by other people. Lots of emotional content is communicated this way – dominance/submission, aggression/surrender, flirting, challenge, respect, belief/unbelief, etc. Without eye contact, video communication often feels too intimate and uncomfortable; because we can’t tell how the other person is looking at us, we have to assume the “worst case” to some extent. I think this is why video today, to the extent it is used for communication, is mostly used with our intimate friends and family members, where there is a lot of trust, and not with strangers.

Consider Jane Jetson, below, chatting away on the videophone with a girlfriend on a beach somewhere. We do this all the time with audio-only mobile phones, so people expect that they ought to be able to do the same with a videophone. This is reflected in fiction. So, look – where is the camera on the beach? Where is the display on the beach? It certainly isn’t in the hand of Jane’s friend – we can see both her hands. It’s not on the beach blanket. Where is it? Jane is sitting off to one side of the display; not in front. If she were using a contemporary video conferencing system, what would her friend on the beach see? Where is the camera on Jane’s end? Would Jane even be in the field of view? Look at Jane’s eyes – she’s looking at the picture of her friend on the screen, not at a camera. Yet they seem to have eye contact. How can this be? (Answer: With today’s equipment, it can’t.)

The Jetsons (Hanna-Barbera, 1962)

Things get much worse when you move to multi-point communication – where there are 3 or more sites in the call, and usually more than one person at each site. Then, you can’t tell who is addressing whom, who is paying attention, etc.

My perspective on this comes from having worked in the video conferencing industry for 15 years (not anymore, and I’m glad of that). That industry has been selling conferencing equipment to large organizations since the mid 1980s. The gear was usually sold on the basis of the costs to be saved from reduced travel – which mostly didn’t happen. Despite the fact that almost every organization bigger than a dozen people or so has at least one conference room for meetings, less than 2% of those conference rooms, worldwide, are equipped for video conferencing even today.

This state of affairs is very different from what anyone in the industry expected in the 90s.

All the things that the Jetsons artist – and typical consumers – so easily expect to “just work”, don’t work at all in the kind of video systems we build today. And that is a large part of why video communication is not nearly as sucessful or widespread as it might be.

These are not the only problems with today’s video systems (for a longer list, see my slides), and virtual camera technology won’t solve all of them. But it may solve some of the most important ones, and it’s a great start. Here’s another demo from the same guy:

This guy – Oliver Kreylos at UC Davis, judging by his web page – is by no means the first to play with these ideas – I saw a demo of something very similar at Fraunhofer in Berlin around 2005, and recall seeing papers describing the same thing posted on the walls of the MIT AI lab way back in the 1980s (back then they were doing it with still images only – the processing power to do video in real-time didn’t yet exist).

What is new is the ability to do it in a practical way, in real-time, with inexpensive hardware. He’s using Microsoft Kinects – which directly output depth maps – to vastly simplify the computation required and, I think, improve the accuracy of the model. Obviously there is a fair amount of engineering work still needed to go from this demo to a salable conferencing system. But I think all the key principles are at this point well understood – nothing lies in the way but some hard work.

To my many friends still in the video conferencing industry – watch out for this. It won’t come from your traditional competitors – well-established companies usually don’t innovate other than minor improvements on what they already do. For something really new, they wait for somebody else to do it first, then they respond. Some small start-up, without people invested in the old way of doing things (or a desperate established firm) will probably do it first. (Why not you?)

Suppose you want to build a classical multipoint video conferencing system (VCS) – you have 3 or more sites, each with multiple people present, for example around a conference table. I think you can use this technology to make a conferencing system that feels natural and allows for real eye contact, pointing, and many of the other things that are missing from today’s VCS and “telepresence” systems.

How would such a system work?

All you need to do is send 2 or 3 video streams plus the depth data. Then each receiver can generate a virtual camera viewpoint anywhere between the cameras, so each viewer can see from a unique viewpoint.

Then if you co-locate the virtual camera positions with the actual (relative) display positions, you have real eye-contact and pointing that (should) work.

And if you have a 3D display, it shouldn’t be too hard to even have depth. (But I think it’ll work pretty well even with regular 2D displays.)

You need to send to the far-end:

  • Each of the camera video streams (time synchronized). Compressed, of course. There might be more than 2.
  • The depth information from the Kinects (or any other means of getting the depth – you could figure this out directly from the video using parallax, but I think it will be easier and more accurate to use something like the Kinect).
  • The relative locations and view angles of the cameras. (I think.) These probably have to be quite accurate. (It might be possible to calibrate the system by looking at test targets or something…)

With that information, the far-end can reconstruct a 2-D view from any virtual point between the cameras. (Or a 3-D view – just create two 2-D views, one for the each eye. But then you’ll need a 3-D display to show it; I’m not sure if that’s really necessary for a good experience.)

In a practical system, you also need to exchange (among all parties), the location, orientation, (and maybe size) of the displays. Then for each of those display locations, you generate a virtual viewpoint (virtual camera) located at the same place as the display. If you can figure out where the eyes of each person are on shown on each display (shouldn’t be hard – consumer digicams all do this now), then you can locate the virtual camera even more accurately where the eyes are (just putting the camera in the middle of the display probably isn’t accurate enough.).

This is entirely practical – 2x or 3x the bit rate of current video calls is no problem on modern networks. I think probably it’s more efficient, in bandwidth terms, to send the original video streams and depth data from each site (compressed, of course, probably with something clever like H.264 SVC), than to construct a 3-D model at the transmitting site and send that in real-time, or to render the virtual camera views for each display at the transmitting site (since you’d need a unique virtual view for each far-end display), but of course you can do that if you want to and the result is equivalent. A mature system could probably exploit the redundancy between the various camera views and depth information to get even better compression – so you might not need even 2x the bandwidth of existing video technology.

Simple two-person point-to-point calls are an obvious subset of this.

There are alternative ways to use virtual cameras for conferencing – for example you could make people into avatars in a VR environment, similar to what Second Life and Teleplace have been doing. I don’t think turning people into avatars is going to feel very natural or comfortable, but maybe one day when subtle facial expressions can be picked up that will become interesting. More plausibly in my view, you could extract a 3-D model of each far-end person (a 3-D image, not a cartoon) and put them into a common virtual environment. That might work better – there isn’t any “uncanny valley” for virtual conference rooms (unlike avatars).

As always, comments are welcome.

P.S. – A side-rant on mobile phone based video telephony:

Mobile phones such as the iPhone 4 are (again) appearing with front-facing cameras meant for video telephony. Phone vendors think (correctly) that lots of customers like the idea of video telephony on their mobiles. Exactly as the dozens of firms that made videophones (see my slides) thought, correctly, that consumers like the idea of video telephony.

I fully agree that consumers like the idea. I’ve been saying that they don’t like the reality when they try it. Not enough to use it beyond a small niche of applications.

Such phones have been around for a long time – I recall trying out cellphone 2-way video, with a front facing camera, in the late 90s. (I was heavily involved in drafting the technical standard for such things, H.324.) In Europe at least, there was a period of 2 years or so in which virtually all high-end phones had front-facing cameras and videotelephone abilities. These flopped with a thick, resounding thud, just as I predict the iPhone 4’s videophone mode will.

Mobile phone video has special problems beyond the ones I’ve mentioned. First, the phone is usually handheld, which means the image is very, very shaky. Aside from the effect on the viewer of the shaky video, this does really bad things to the video compression (subsequent frames are no longer nearly as similar as they would be with a fixed camera). Second, the phone is normally held very close to the face. This results in a viewpoint far closer than a normal conversational distance, which gives a geometrically distorted image – things closer to the camera (noses and chins) look larger than things further away (ears). This is extremely unflattering – it is why portrait photographers use long lenses. Third, cellphones are very often used while the user is mobile (walking around). The requirement to stare at the screen has obvious problems which result in minor accidents (walking into parking meters, etc.).

None of the above problems apply if the phone is set in a stationary position on a desk, at a normal distance. But that’s not how most people want to use something like the iPhone 4.