Archive for category Video telecommunication

Webcam with privacy shutter & mute

Another business idea – this time a simple one.

Today the BBC has an article about “webcam hijackers” who blackmail people (mostly young women) with remote access trojans that grab webcam video.

I don’t get it. Why do almost no webcams come with a physical privacy shutter – just a sliding piece of plastic?

The right way to do this is a knob that turns a cam and slides a shutter in front of the lens.

It should have four positions:

  • Off – shutter blocks lens, cam opens a switch that electrically disconnects the microphone
  • Audio only – shutter blocks lens, cam closes mic switch
  • Video only – shutter away from lens, cam opens mic switch
  • On – shutter away from lens, cam closes switch

How hard is that, really?

 

Hey: VCs often say that good ideas are a dime a dozen. Mine go even cheaper than that. If you use this idea to make money, I’d like 0.5%. Of the equity in your company, or the profits. Or something. If that’s too much (or too little), fine – whatever you think is fair. This is a request for a gift, or reward – it is not a legal obligation. You’re free to use this idea and pay me nothing. If you can live with yourself that way.

Numpy == goodness! (deinterlacing video in Numpy)

A few posts back I was trying to get Linux to record and play video at the same time.

I gave up on that, but got it working under Windows with Python; I’ll post the source for that here at some point.

A big part of the solution was OpenCV, PyGame and Numpy.

I’m hardly the first to say it, but I’m excited – Numpy is goodness!

My (stupid) video capture device grabs both interlaced fields of SDTV and composes them into a single frame. So instead of getting clean 720×240 at 60 Hz (sampling every other line, interlaced), you get 720×480 at 30 Hz with horrible herringbone interlace artifacts if there is any movement in the scene.

The artifacts were really annoying, so I found a way to get Numpy to copy all the even numbered lines on top of the odd numbered lines to get rid of the interlace artifacts:

img[1::2] = img[::2]

That’s it – one line of code. And it’s fast (machine speed)! My laptop does it easily in real time. And I learned enough to do it after just 20 minutes reading the Numpy tutorial.

Then, I decided I could do better – instead of just doubling the even lines, I could interpolate between them to produce the odd-numbered lines as the average of the even-numbered lines (above and below):

img[1:-1:2] = img[0:-2:2]/2 + img[2::2]/2

It works great! Numpy == goodness!

 


PS: Yes, I know I’m still throwing away half the information (the odd numbered lines); if you know a better way, please post a comment. Also, I’d like to record audio too, but OpenCV doesn’t seem to support that – if I get that working, I’ll post here.

Play & record video at the same time in Linux

I didn’t think this would be so difficult.

All I want to do is play live video on my netbook (from /dev/video1:input=1:norm=NTSC) and record it at the same time. Without introducing lag.

mplayer plays the video fine (no noticeable lag).

mencoder records it fine.

The mplayer FAQ says you can do it this way:

mencoder tv:// -tv driver=v4l:width=324:height=248:outfmt=rgb24:device=/dev/video0:adevice=hw.1,0 -oac mp3lame -lameopts cbr:br=128 -flip -ovc lavc -lavcopts threads=2 -o >( tee filename.avi | mplayer -)

But that doesn’t work.

You can’t record and play at the same time because there is only one /dev/video1 device, and once either mencoder or mplayer is using it, the device is “busy” to any other program that wants to read the video stream.

I spent lots of time with mplayer, mencoder, ffmpeg, avconv, and vlc;  as far as I can tell none of them can do it, directly or indirectly. There are ways that work if you don’t mind 200 or 300 ms of extra latency over mplayer alone. But I’m doing a FPV teleoperation thing and that’s too much latency for remote control.

I found a way that sort of works. Here’s a bash script (works in Linux Mint 15, which is like Ubuntu):

#!/bin/bash
mplayer tv:// -tv device=/dev/video1:input=1:norm=NTSC -fs&
outfile=$(date +%Y-%m-%d-%H%M%S)$1.mp4
avconv -f x11grab -s 800×600 -i :0.0+112,0 -b 10M -vcodec mpeg4 $outfile

This works by running mplayer to send the live video to the screen (full screen), then running avconv at the same time to grab the video back off the display (-f x11grab) and encode it. It doesn’t add latency, but grabbing video off the display is slow – I end up with around 10 fps instead of 30.

There must be some straightforward way to “tee” /dev/video1 into two virtual devices, so both mplayer and mencoder can read them at the same time (without one of them complaining that the device is “busy”). But I haven’t found anybody who knows how. I even asked on Stack Overflow and have exactly zero responses after a day.

(If you know how, please post a comment!)


Addendum for Linux newbies (like me):

After you put the script in file “video.sh”, you have to:

chmod +x video.sh # to make it executable (just the first time), then

./video.sh # to run the script (each time you want to run it)

You’ll probably want to tweak the script, so you should know that I’m using a KWorld USB2800D USB video capture device, which puts the composite video on input=1 (the default input=0 is for S-Video) and requires you to do norm=NTSC or it’ll assume the source is PAL.

-fs makes mplayer show the video fullscreen. Since I’m doing this on my Samsung N130 netbook with a 1024×600 screen, the 4:3 video is the 800×600 pixels in the middle of the screen (starting at (1024-800)/2 = 112 pixels from the left).

Also, many thanks to Compn on the #mplayer IRC for trying really hard to help with this.


Update 2013-11-02:

I haven’t given up on this, so I’ll use this space to record progress (or non-progress).

I started a Stack Exchange thread on this.

On IRC I was told that VLC can do this. I got as far as getting it to display the video at 720×96 (yes ninety-six) resolution, with a lot of lag (the source is VGA, 640×480).  Googling about it, it seems the resolution problem is probably fixable with VLC, but the lag isn’t.  So I gave up on that.

The most promising approaches at the moment seem to be:

  1. This page about ffmpeg which gives ways to create multiple output from a single video input device – exactly what I need. But I haven’t found any way to get ffmpeg to read from input=1:norm=NTSC (as mplayer can).
  2. This thread on Stack Exchange seems to describe  ways to “tee” the video from one device into 2 (or more) other devices. One way using V4L2VD, the other using v4l2loopback. I haven’t figured out how to get either working.

Update 2013-11-03:

Pygame has the ability to read and display video streams, but ‘nrp’ (one of the developers of pygame) told me on IRC that he never implemented reading from anything other than the default input 0 (zero). He suggested that the info needed to update the pygame code to do that is here, and the source code is here. I’m not really up for doing that myself, but maybe somebody else will (I posted this bug on it, per nrp’s suggestion).

Another idea I had was to just buy a different USB video capture device, that works with the default input 0 and norm. So far I haven’t found one that does that.

But I’ve got two new leads:


Update 2013-11-03 #2:

I think I made a sort of breakthrough.

v4l2-ctl can be used to control the video4linux2 driver after the app that reads the video stream has started. So even if the app mis-configures /dev/video1, once the app is running you can configure it properly.

The magic word for me is:

v4l2-ctl -d 1 -i 1 -s ntsc

That sets /dev/video1 (-d 1) to input 1 (-i 1) and NTSC (-s ntsc).

Not only that, but I (finally) found out how to get avconv to configure video4linux2 correctly (and maybe also for ffmpeg).

For avconv, “-channel n” sets the input channel, and “-standard NTSC” sets NTSC mode.  I think the equivalents in ffmpeg are “-vc n” and “-tvstd ntsc” respectively, but I haven’t tried those yet.

But this works:

avplay -f video4linux2 -standard NTSC -channel 1 /dev/video1

Now I can try to ‘tee’ the output from /dev/video1….

  • do it…

Update 2014-06-01:

I gave up on this, but eventually got it working in Python with Windows (see this post); maybe that method will also work in Linux (I haven’t tried it).

For what it’s worth, this guy claims he has it working this way:

vlc -vvv v4l2:///dev/video1:input=1:norm=PAL-I:width=720:height=576 –input-slave=alsa://plughw:1,0 –v4l2-standard=PAL_I –sout ‘#duplicate{dst=display,dst=”transcode{vcodec=mp4v,acodec=mpga,vb=800,ab=128}: std{access=file,mux=mp4,dst=test.mp4}}’

I’m doubtful (esp. re latency), but you could try it.

Why videophones always fail and what to do about it

A couple of weeks ago I was in Seattle and presented a slightly updated version of my “Past, Present, and Future of Video Telephony” talk at Microsoft Research Redmond.

The folks at Microsoft were nice enough to post a good-quality video of the talk on their website for public viewing – if you’re interested at all in my (somewhat controversial) views on the matter, have a look.

Link to video presentation

Email me if you want a copy of the slides or video clip.

While I’m on the subject, an interesting article in the June 2011 IEEE Spectrum describes how people respond to eye contact and mimicking of their gestures (the authors used avatars to generate simulated eye contact). As I’ve said elsewhere, I don’t think avatars are a likely solution to the video telecommunication problem (at least with the current state of the art), but the idea of manipulating eye contact deliberately in a video conference is interesting – and these studies seem to provide evidence it might be effective.

The future of video conferencing?

For years I’ve been saying that video conferencing as-we-know-it will never get beyond the sub-5% penetration of the business world that it’s been in for the last 20 years – let alone the mass-market. (For my reasons in some detail, see my slides from 2008).

This (see video below) is the most encouraging thing I’ve seen in years. It’s far from production-ready, but this is the most practical way of really solving the all-important issues of knowing who-is-looking-at-whom, pointing, and eye contact that I’ve yet seen. (I don’t think telepresence systems do a sufficiently good job, even ignoring their cost. And this solution is cheap.)

Wait for the end where he steps into the view – until then I thought it was a single frame; it’s not – it’s done in real time on cheap hardware.


The core idea here – if you haven’t figured it out already – is to have two or more cameras looking at the same scene from different angles. To make things easier, he’s also got two Microsoft Kinects (one for each camera) directly reporting depth information. With a little geometry he can figure out the relative position of each object in the scene, and make a “virtual camera” that produces an image as it would be seen from any viewpoint between the cameras.

I mentioned this video to some friends on a mailing list, and got a couple of responses questioning the whole idea that there is a “problem” with current video conferencing technology. Supposedly lots of people already use video telephony – there is Skype video, Google, NetMeeting, AIM, Polycom, Tandberg, telepresence systems, and and the iPhone 4’s front-facing camera, so isn’t practical video communication already here?

Of course. I use video calling too – for example to chat with my son in college. But it’s very – deeply – unsatisfying in many ways, and works even worse in multi-point situations (3 or more sites in the call) and when there is more than one person at each site. And I think that is limiting use of the technology to a tiny fraction of what it would be if video communication worked the way people expect.

Consider the best-case situation of a simple point-to-point video call between two people – two sites, one person on each end. In many ways today’s video is a good improvement over ordinary telephony. The biggest problem is lack of eye contact, because the camera viewpoint is not co-located with the image of the far-end person’s eyes on the display.

Look at George Jetson and Mr. Spacely, below. They most definitely have eye contact. This interaction between boss and employee wouldn’t be the same without it. But we, the viewer, don’t have eye contact with either of them. And both of them know it. 

We also expect, looking at the scene, that Mr. Spacely (on the screen) has peripheral vision – he could tell if we were present in the room with George. We feel he could look at us if he wanted to.

This is how people expect, and want, video communication to work. The artist who drew this scene knows all this without being told. But this is not how today’s video works.

George Jetson & Mr. Spacely (The Jetsons, Hanna Barbera, date unknown)

Eye contact is a profoundly important non-verbal part of human social communication. Our brains are hardwired to know when we’re being looked in the eyes (even at amazing distances). Some people think human eyes have evolved not just to see, but also to be seen by other people. Lots of emotional content is communicated this way – dominance/submission, aggression/surrender, flirting, challenge, respect, belief/unbelief, etc. Without eye contact, video communication often feels too intimate and uncomfortable; because we can’t tell how the other person is looking at us, we have to assume the “worst case” to some extent. I think this is why video today, to the extent it is used for communication, is mostly used with our intimate friends and family members, where there is a lot of trust, and not with strangers.

Consider Jane Jetson, below, chatting away on the videophone with a girlfriend on a beach somewhere. We do this all the time with audio-only mobile phones, so people expect that they ought to be able to do the same with a videophone. This is reflected in fiction. So, look – where is the camera on the beach? Where is the display on the beach? It certainly isn’t in the hand of Jane’s friend – we can see both her hands. It’s not on the beach blanket. Where is it? Jane is sitting off to one side of the display; not in front. If she were using a contemporary video conferencing system, what would her friend on the beach see? Where is the camera on Jane’s end? Would Jane even be in the field of view? Look at Jane’s eyes – she’s looking at the picture of her friend on the screen, not at a camera. Yet they seem to have eye contact. How can this be? (Answer: With today’s equipment, it can’t.)

The Jetsons (Hanna-Barbera, 1962)

Things get much worse when you move to multi-point communication – where there are 3 or more sites in the call, and usually more than one person at each site. Then, you can’t tell who is addressing whom, who is paying attention, etc.

My perspective on this comes from having worked in the video conferencing industry for 15 years (not anymore, and I’m glad of that). That industry has been selling conferencing equipment to large organizations since the mid 1980s. The gear was usually sold on the basis of the costs to be saved from reduced travel – which mostly didn’t happen. Despite the fact that almost every organization bigger than a dozen people or so has at least one conference room for meetings, less than 2% of those conference rooms, worldwide, are equipped for video conferencing even today.

This state of affairs is very different from what anyone in the industry expected in the 90s.

All the things that the Jetsons artist – and typical consumers – so easily expect to “just work”, don’t work at all in the kind of video systems we build today. And that is a large part of why video communication is not nearly as sucessful or widespread as it might be.

These are not the only problems with today’s video systems (for a longer list, see my slides), and virtual camera technology won’t solve all of them. But it may solve some of the most important ones, and it’s a great start. Here’s another demo from the same guy:

This guy – Oliver Kreylos at UC Davis, judging by his web page – is by no means the first to play with these ideas – I saw a demo of something very similar at Fraunhofer in Berlin around 2005, and recall seeing papers describing the same thing posted on the walls of the MIT AI lab way back in the 1980s (back then they were doing it with still images only – the processing power to do video in real-time didn’t yet exist).

What is new is the ability to do it in a practical way, in real-time, with inexpensive hardware. He’s using Microsoft Kinects – which directly output depth maps – to vastly simplify the computation required and, I think, improve the accuracy of the model. Obviously there is a fair amount of engineering work still needed to go from this demo to a salable conferencing system. But I think all the key principles are at this point well understood – nothing lies in the way but some hard work.

To my many friends still in the video conferencing industry – watch out for this. It won’t come from your traditional competitors – well-established companies usually don’t innovate other than minor improvements on what they already do. For something really new, they wait for somebody else to do it first, then they respond. Some small start-up, without people invested in the old way of doing things (or a desperate established firm) will probably do it first. (Why not you?)

Suppose you want to build a classical multipoint video conferencing system (VCS) – you have 3 or more sites, each with multiple people present, for example around a conference table. I think you can use this technology to make a conferencing system that feels natural and allows for real eye contact, pointing, and many of the other things that are missing from today’s VCS and “telepresence” systems.

How would such a system work?

All you need to do is send 2 or 3 video streams plus the depth data. Then each receiver can generate a virtual camera viewpoint anywhere between the cameras, so each viewer can see from a unique viewpoint.

Then if you co-locate the virtual camera positions with the actual (relative) display positions, you have real eye-contact and pointing that (should) work.

And if you have a 3D display, it shouldn’t be too hard to even have depth. (But I think it’ll work pretty well even with regular 2D displays.)

You need to send to the far-end:

  • Each of the camera video streams (time synchronized). Compressed, of course. There might be more than 2.
  • The depth information from the Kinects (or any other means of getting the depth – you could figure this out directly from the video using parallax, but I think it will be easier and more accurate to use something like the Kinect).
  • The relative locations and view angles of the cameras. (I think.) These probably have to be quite accurate. (It might be possible to calibrate the system by looking at test targets or something…)

With that information, the far-end can reconstruct a 2-D view from any virtual point between the cameras. (Or a 3-D view – just create two 2-D views, one for the each eye. But then you’ll need a 3-D display to show it; I’m not sure if that’s really necessary for a good experience.)

In a practical system, you also need to exchange (among all parties), the location, orientation, (and maybe size) of the displays. Then for each of those display locations, you generate a virtual viewpoint (virtual camera) located at the same place as the display. If you can figure out where the eyes of each person are on shown on each display (shouldn’t be hard – consumer digicams all do this now), then you can locate the virtual camera even more accurately where the eyes are (just putting the camera in the middle of the display probably isn’t accurate enough.).

This is entirely practical – 2x or 3x the bit rate of current video calls is no problem on modern networks. I think probably it’s more efficient, in bandwidth terms, to send the original video streams and depth data from each site (compressed, of course, probably with something clever like H.264 SVC), than to construct a 3-D model at the transmitting site and send that in real-time, or to render the virtual camera views for each display at the transmitting site (since you’d need a unique virtual view for each far-end display), but of course you can do that if you want to and the result is equivalent. A mature system could probably exploit the redundancy between the various camera views and depth information to get even better compression – so you might not need even 2x the bandwidth of existing video technology.

Simple two-person point-to-point calls are an obvious subset of this.

There are alternative ways to use virtual cameras for conferencing – for example you could make people into avatars in a VR environment, similar to what Second Life and Teleplace have been doing. I don’t think turning people into avatars is going to feel very natural or comfortable, but maybe one day when subtle facial expressions can be picked up that will become interesting. More plausibly in my view, you could extract a 3-D model of each far-end person (a 3-D image, not a cartoon) and put them into a common virtual environment. That might work better – there isn’t any “uncanny valley” for virtual conference rooms (unlike avatars).

As always, comments are welcome.

P.S. – A side-rant on mobile phone based video telephony:

Mobile phones such as the iPhone 4 are (again) appearing with front-facing cameras meant for video telephony. Phone vendors think (correctly) that lots of customers like the idea of video telephony on their mobiles. Exactly as the dozens of firms that made videophones (see my slides) thought, correctly, that consumers like the idea of video telephony.

I fully agree that consumers like the idea. I’ve been saying that they don’t like the reality when they try it. Not enough to use it beyond a small niche of applications.

Such phones have been around for a long time – I recall trying out cellphone 2-way video, with a front facing camera, in the late 90s. (I was heavily involved in drafting the technical standard for such things, H.324.) In Europe at least, there was a period of 2 years or so in which virtually all high-end phones had front-facing cameras and videotelephone abilities. These flopped with a thick, resounding thud, just as I predict the iPhone 4’s videophone mode will.

Mobile phone video has special problems beyond the ones I’ve mentioned. First, the phone is usually handheld, which means the image is very, very shaky. Aside from the effect on the viewer of the shaky video, this does really bad things to the video compression (subsequent frames are no longer nearly as similar as they would be with a fixed camera). Second, the phone is normally held very close to the face. This results in a viewpoint far closer than a normal conversational distance, which gives a geometrically distorted image – things closer to the camera (noses and chins) look larger than things further away (ears). This is extremely unflattering – it is why portrait photographers use long lenses. Third, cellphones are very often used while the user is mobile (walking around). The requirement to stare at the screen has obvious problems which result in minor accidents (walking into parking meters, etc.).

None of the above problems apply if the phone is set in a stationary position on a desk, at a normal distance. But that’s not how most people want to use something like the iPhone 4.

The Past, Present, and Future of Video Telecommunication

Last week I presented a talk on “The Past, Present, and Future of Video Telecommunication ” at the 2008 IMTC Fall Forum.

1927 Videophone

I spent 14 years in the video conferencing industry (1993-2006), first at PictureTel and then at Polycom.  For most of that time I did standardization work – I was a rapporteur in the ITU, went to IETF meetings, etc.

The talk is my view on why video telecommunication (videophones, picturephones, video conferencing, etc.) has never taken off in the mass market, despite lots of consumer enthusiasm and lots of investment.

In brief, the quality of experience offered by these products doesn’t come close to what consumers expect and are promised.

Manufacturers keep expecting that being able to see people on a screen is enough to make it “feel like being there”, but it isn’t.  (To be fair, consumers expect this too, and are disappointed when they try it.)

In my opinion, this situation will remain the same until products are offered that really do provide a feeling of “being there”, or at least more so.

Recently, “telepresence” systems (Cisco, HP, Polycom, Tandberg) have taken a step in this direction.  These systems are better, but still far from what I think is needed.  They are also impossibly expensive for consumer use.

The slides make the case in more detail than I care to go into in this post.

1964 Picturephone

Update March 2009: I’ve found it difficult to succinctly express just what it is that’s missing from the video communication experience. The issue is not the quality of video or audio – a long telephone call feels far more intimate than a video call, despite 3 kHz audio bandwidth and no video at all (the reason for this is given in the presentation).  In the slides I talk a lot about eye contact and the importance of knowing who is looking at whom.

Pointing finger

But to put it another way, I’ll make a prediction: No system for video conferencing / video telephony / telepresence will enjoy mass-market adoption until it is possible to point (with a finger) at a particular person or thing on the far-end. To clarify that, I mean the ability for someone in Boston to point at a particular person in Geneva, and for everyone present in Geneva to be aware who or what is being pointed at.

You’re welcome to use the slides as you like, but please do credit me as the source.

2008-11-imtc-sfo.ppt 5.8 Mbytes – PowerPoint slides only

2008-11-imtc-sfo.zip 21 Mbytes – ZIP file (slides & video clips)

(for the ZIP file version, you’ll need to have VLC installed in “C:\Program Files\VideoLAN\VLC” for the clips to play).

Not much like being there

Not much like being there