The failure of the pan-tilt camera in video calls

This year, we stayed with Kathryn's family for the holidays, so I attended dinner in my own mother's home via Skype. Once again, the technology was frustrating. And it need not be.

There were many things that can be better. For those of us who Skype regularly, we don't understand that there is still hassle for those not used to it. Setting up a good videoconferencing setup is still work. As I have found is always the case in a group-to-solos videoconference, the group folks do not care nearly as much about the conference as the remote solos, so a fundamental rule of design here is that if the remotes can do something, they should be the ones doing it, since they care the most. If there is to be UI, leave the UI to the remotes (who are sitting at computers and care) and not to the meeting room locals. Many systems get this exactly backwards -- they imagine the meeting room is the "master" and thus has the complex UI.

In this family setting, however, the clearest problem for me is that no camera can show the whole room. It's like sitting at the table unable to move your head, with blinders on. You can't really be part of the group. You also have to be away from the table so everybody there can see you, since screens are only visible over a limited viewing angle.

One clear answer to this is the pan/tilt camera, which is to say a webcam with servo motors that allow it to look around. This technology is very cheap -- you'll find pan/tilt IP security cameras online for $30 or less, and there are even some low priced Chinese made pan/tilt webcams out there -- I just picked another up for $20. I also have the Logitech Orbit AF. This was once a top of the line HD webcam, and still is very good, but Logitech no longer makes it. Logitech also makes the BCC950 -- a $200 conference room pan/tilt webcam which has extremely good HD quality and a built-in hardware compressor for 1080p video that is superb with Skype. We have one of these, and it advertises "remote control" but in fact all that means is there is an infrared remote the people in the room can use to steer the camera. In our meetings, nobody ever uses this remote for the reason I specify above -- the people in the room aren't the motivated ones.

This is compounded by the fact that the old method -- audio conference speakerphones -- have a reasonably well understood UI. Dial the conference bridge, enter a code, and let the remotes handle their own calling in. Anything more complex than that gets pushback -- no matter how much better it is.

Sadly, in spite of the Orbit AF being one of Skype's first officially blessed cameras, none of the consumer videoconferencing systems offers a means for a remote party to control the camera. The motorized functions are marketed for use in "face tracking" where the camera follows you as you move around the room -- a feature almost nobody turns on -- or for traditional "webcam" or security camera use where you have one-way video streaming. In many cases, there are only Windows drivers to control the pan/tilt function, and Mac users are left out. (Linux users of course have worked out how to do it but Linux is only sparsely used for videoconferencing.)

Getting some remote control going

To make this work you need something seamless, which means integrated with your video tool, be it Skype, Hangout, gotomeeting or any of the others. It needs to be something where the remotes, as part of their window, get a camera control tool. There's really little need to worry about contention between the different remotes -- they all have the same goal and will quickly appoint who is to steer the camera. A simple anti-contention tool that gives exclusive control for a few seconds after a press would stop any fights steering it back and forth.

There are some solutions, but they are too cumbersome for regular use. One small developer has a program called Telerobo for the Logitech Orbit. This program is a Windows binary, and requires the conference computer be on Windows and the single remote operator's computer also run Windows. The former requirement makes some sense as there are no official drivers other than for Windows. The program would be more useful if it used a web control rather than a remote binary client -- in my environment I am often on video calls where none of the remotes run Windows.

For the lovely BCC950 camera, they don't even provide much in the way of drivers for Windows, though another coder built a cloud remote for the BCC950 but it's just a hack, not a finished product, and is again Windows only.

It's a kludge, but another way you can sometimes get remote control is to bring up the pan/tilt/zoom control on the local PC, and then share that computer's screen with one of the remotes using a tool like VNC. Then the remote user can click onto the conference room PC and steer the camera. Obviously this is non-trivial to set up and maintain. Some of the cameras, meant for use as streaming security cams, also offer internal web interfaces that can, with work, be exported to an outsider (especially over a VPN) allowing remote steer. Of course the linux drivers for the Orbit AF have this ability, but again it's all more setup than you can expect from casual conference room users.

A final possible kludge is to use an "infrared over IP" box (used for home automation) to permit remote control of a camera like the BCC950 with its IR remote control.

Nothing approaches the interface that should be, which is plug and play. There should be a standard API on all the OSs for pan/tilt/zoom and all the conference tools should support it. The remote users should get a new control on their screens when in calls to such devices. For security reasons, you may want to have the people with the PTZ camera confirm permission to steer the camera, either permanently (in a conference room) or case-by-case in your home. (I will admit to having done video conferences with a nice shirt and just my underwear when at home, and would not want people to aim the camera down without permission. Admit it, you've done this too.)

High end

Of course there are high end (multi-thousand dollar) video room setups that have nice expensive pan/tilt cameras, though even those see rare use. One nice one (that I am sure cost a fortune) for classrooms has a microphone and button at every desk in a classroom. Push the button and your audio is live and the camera zooms on you. But I'm not talking about what you can do at the high end.

Multiple Cameras

Another interesting approach is the use of multiple cameras in the meeting room. I have done this a few times with tools like Hangout and premium Skype which allow multi-way video calls. The typical meeting room is loaded with cameras that can join the conference -- there is one in every laptop, and one in most smartphones too. You must of course assure that only one device in the room does audio. (It would be a nice trick if the conference tools noticed two devices in the same room based on audio echo and identical source LAN and auto-muted newcomers. Or better still, regularly switch the microphone audio to whichever device is hearing the speaker best.)

You can use multi-camera in several ways. Most simply, you can just make sure you now cover the whole room, and let the remotes switch which camera they want to watch. It can also be very nice to combine a steered/zooming camera with a wide-view camera. The wide-view camera, mounted high on the wall, gives a sense of the whole room, and then the steered camera can do a close-up of whoever is speaking.

Better, but hard to do is to have all the laptops in the room joined with the conference, but not transmitting. You then would like to set it up so that anybody in the room who is in front of a laptop can click to become the video and audio sender (or one of the video senders -- the wide view should still be available.) With permission, it would even be good if one of the remotes could do this switching. Automatic switching is not super practical because people might be sitting next to a laptop but not in its camera view.

Yet another multi-camera appraoch would be to build conference room webcams with a panoramic view, simply by mounting 3 webcams together. This could be done manually -- webcams are very cheap -- or eventually vendors could sell a camera that has 3 or more cameras in it. While a seamless blend would be sweet, it's not really needed. The simple ability to switch quickly between the cameras by the remote viewer would be nice, and faster than using the pan motor. Of course, if bandwidth is available, it would be nice to just see all the cameras and get the wide view. A lower-bandwidth alternative would be to send the selected camera at full bandwidth and send very blurry "peripheral vision" quality video from the other cameras. Here, if a camera were fixed, it could use stereo audio to identify where the speaker is and switch to the best camera automatically -- but you can still let the remotes turn that off or supersede it if it's not working well.

Telepresence robot

I have 4 different friends who have started telepresence robot companies, all at very different price points. If I have 4 friends who have done this, expect a huge raft of companies to be out with products this year. These are the ultimate expression of giving control to the solo remote, who now can steer not just their view, but also their screen, and even move around in flat buildings. The better ones are pretty nice experiences, though it's still an open question of whether the world will embrace this.

Robots are expensive, but the hardware for Pan/Tilt or multi-camera is cheap and available. The failure has been in getting the programming in order and doing the user interface right. Is it that people just don't care about this, or are we just waiting for the iPhone moment, when somebody makes it seamless enough that people realize they wanted it?


One of the simplest solutions I could imagine would be a motorized pan/tilt mount designed to hold a smartphone. It could be controlled via Bluetooth from an app running on the smartphone. At the other end the remote person downloads and installs the same app on their smartphone or tablet, which provides them the video feed and controls for the pan/tilt mount.

The various multi-user functions you write about could be written into the app.

The advantage here is that you don't need to buy a special camera, have it plugged into a computer in the conference room or wherever, deal with drivers, etc. It should also be fairly inexpensive, since it's relying on a smartphone for the processor and camera. Smartphone/tablet app platforms also have superior one-tap software update mechanisms.

One disadvantage is that its video stream would have to go through the custom app, so it wouldn't immediately integrate into Skype or other video software. The developer of the pan-tilt hardware mount could release on API, though, and if it's popular enough it could get integrated into the apps for Skype/etc. Alternately the camera in the pan-tile mount could just be a Skype camera, with a nearby smartphone just to control the mount over Bluetooth.

Another potential disadvantage is that remote people might not have smartphones or tablets to install the needed app. There could still be a web page using something like Adobe Flash to receive the video and control the camera, though (WebRTC has not received wide enough adoption for it to be viable for this).

Even better than the above, the pan/tilt mount could just have a standard tripod camera mount on top, and it'd come with a tripod mounting smartphone clamp, like the Glif. Surely there already exist motorized pan/tilt mounts for tripods that can be controlled by a smartphone over Bluetooth? I'm pretty sure I've seen some of these at Macworld Expo, maybe starting even as far back as three years ago. Add a Glif to one of these mounts and the hardware's already available. The controller apps for these mounts would need to allow remote control over the Internet, but that's such an obvious feature after they've been enabled for local smartphone control via Bluetooth.

Additionally, there are inexpensive clip-on wide-angle lenses for smartphones, like the Olloclip.

"For those of us who Skype regularly, we don’t understand that there is still hassle for those not used to it."

I think the key here is leveraging the much simpler smartphone/tablet platforms, rather than dealing with all the legacy crud that's built up on desktop systems over the years.

In a similar vein I actually prefer to use my smartphone or tablet for Skype. Then it becomes just another phone or video call, and I don't have to deal with headsets, camera connections/drivers/permissions/positioning/etc.

The main value of the pan/tilt is the big room. My consumer application -- Christmas dinner -- is a variant of this. But a smartphone would never cut it as the conferencing station in a big room, and even a tablet wouldn't. In particular, in a big room you need speakers which can put out some decent output.

Since the meeting room needs a big screen and nice speakers, I don't see much value in the phone, except as an adjunct member of the call. If the PC will have the wide-view camera and the phone sits there in a mount able to spin around, now you're talking. Connecting the phone to the PC is another thing to go wrong.

It really has to be seamless. It has to be built into skype/hangout/etc. or they need a plugin architecture which allows 3rd parties to put something in them that looks like it's built in, and for which remote install is just a click. Skype had such an API but is deprecating it.

I agree that a webrtc or flash client for the remotes is a good idea. Skype had a reason not to -- skype sends video peer to peer and flash/webrtc must be relayed via a server. But hangout and gotomeeting and the rest that are not peer to peer have no excuse not to support that.

But anyway, if you are going to have a conference PC, then I see no reason not to just have a pan/tilt webcam on it. They are not expensive, really they are not. Probably cheaper than the smartphone mount, and with a better camera. My lament is that we had such cameras, and we still have them (though Logitech discontinued the best consumer one) but there is no software to let them do what they were made to do.

You talk about both a "family setting" and conference rooms in business.

For a business conference room, presumably they should have IT people who can set these things up properly. Yet, they still don't. Or the tools (needed drivers, client software, etc.) just aren't available.

In a family setting it's even harder to get over the hurdle of the complications of the technology.

When looking at the tech world as a whole, it's clear that the consumer mobile space is were the biggest advances are happening in making new technology usable and adopted by the broadest number of people.

It more likely that the people who will "get it right" with this will come at it from a consumer mobile direction. It may start out as something better suited to family settings, where there are no dedicated IT specialists, but then it takes very little to expand from there into corporate usage. Witness the explosion of consumer driven mobile tech in the corporate world over the past few years.

As for your specific objections: Smartphones all have headphone jacks, and can have speakers plugged in to them. Even better, though, they have Bluetooth audio, which doesn't need any physical connections to be sent to a larger, more audible speaker.

For the remote person's representation in a conference room there could be a computer screen with the Adobe Flash client running on it showing their video full-screen. Or the device in the pan/tilt mount could be a 10" tablet with the remote person's video on it in full-screen. (The downside with using a tablet is that its front-facing camera isn't going to be as good as its rear camera. At least on Apple hardware, though, the front-facing cameras have better low light sensitivity by using larger pixels.)

It has to be built into skype/hangout/etc.

Unless the problem is solved much better by someone coming up with an alternate, disruptive solution. Something that has no driver or configuration hassles and "just works" on all devices (smartphones, tablets, and on PCs in a web interface), using their own protocol, could obviate the advantages of any existing systems. A parallel to this is the explosion in usage of WhatsApp and Line for messaging, in the face of the established players of AOL, MSN, Skype, Yahoo IM, Facebook Messenger, GTalk, etc.

if you are going to have a conference PC, then I see no reason not to just have a pan/tilt webcam on it. ... we had such cameras, and we still have them ... but there is no software to let them do what they were made to do.

And there's your answer. Legacy software hassle. Drivers. Permissions. Arcane configuration. Cross-platform support.

One of the biggest reasons for the ongoing disruptions of legacy PCs by mobile platforms is that they do away with or minimize most of those issues. It's easier and faster to write software for mobile platforms. The hardware is better integrated. It's easier for the end user to set it up and get it running. Keeping the software up to date is easier and faster. It's more likely that it's going to just keep working when a major new version of the device's OS comes out, rather than needing updated drivers and client software.

It would be foolish to dismiss all this.

"low priced Chinese made pan/tilt webcams out there — I just picked another up for $20."


and quid pro quo, amazon has pogoplugs for $20!

I see two styles out there. I found the clearlinks one on ebay for $20. Another, ball-style, may have the same chip inside, will also be found if you search for "motion tracking camera" or "head tracking camera" as well as "pan-tilt webcam" on ebay, amazon and other sites.

No documentation on the protocol to control the camera that I could find, and macs and linux don't recognize it at all, unfortunately.

1) I'm frustrated by this problem too.

2) Apparently the term of art for what you want is "Far-end Camera Control".

I've used SIP-based teleconference solutions in professional settings where the software provided far-end camera control and it makes a huge difference. I can't, however, figure out for the life of me if this is actually standardized in SIP, or what systems out there might or might not support it...

Ah! The protocol seems to have been partially standardized at least:

Ah, and Cisco has patented the idea of controlling the far end camera entirely. Congratulations, this technology is totally dead for the next 18 years.

There's tons of prior art on the basic concepts, so it would fall to the specifics of what Cisco wrote here. All remote presence robots feature "far end camera control" and high end conferencing systems have had it for a while. Some of the earliest webcams of the 90s had it, but not in videoconferencing. In fact, that's part of what is frustrating -- it's been around a long time, but the one place it's hard to find it is videoconferencing.

Ignoring the patent question for now...

The logitech cameras have far-end camera control given some crappy Windows-only proprietary software, but, more to the point, they appear to be fully USB interfaced devices. Intercepting USB controls is easy -- I've done it in the past -- so if someone had open source conferencing software in existence, coming up with a driver to integrate this stuff would not be hard either.

Part of the problem, though, is that there isn't much open source conferencing software out there. (I've had a great deal of trouble trying to set up lab's conference room with a non-proprietary conferencing solution -- so much that I basically gave up a long time ago.) However, if someone had such a solution, the rest might be easy enough.

One of the curses about conferencing software is that matters most is often what the other person has on their end, and you don't control that. I mean, you can encourage people to download something, and it had better be free, but the proprietary tools like skype and hangout are free. But it's also hard to convince the other person to learn a new tool, no matter how easy it is to download.

You mentioned the idea of having multiple cameras in the remote location and being able to switch between them from your location.

I have developed a long distance application for piano/keyboard instruction and performance called Internet MIDI ( We use the Skype API to tell Skype to switch between available cameras, both locally and remotely.

Internet MIDI will thus enable you to switch between as many as 9 cameras. You can set up any key or pedal on your MIDI keyboard to be the trigger for this purpose. For example, on a digital piano with 88 notes and 3 pedals, you may find that the middle pedal or the lowest note is not going to be used musically and is a convenient trigger for this purpose.

Skype announced the demise of the API over a year ago, however it lives on. Recent changes to Skype for Mac and Skype for Windows Desktop have caused us to make some changes to how we interact with the API (and we will be releasing updated software soon). Interestingly, the Skype API works better today than it has ever worked in the past. Of course you have to understand how the API works, and it was never documented very well.

Regrettably, Skype for Modern Windows does not support the API.

The problem, of course, is that nobody wants to develop a product to an API that has been declared deprecated or soon to die. Even though they kept it alive it's hard to bet your product on that.

Where is the best documentation on how to use the API for switching video (or audio) inputs?

Add new comment