comma.ai's neural network car and the hot new technology in robocars
Perhaps the world's most exciting new technology today are deep neural networks, in particular the convolutional neural networks such as "Deep Learning." These networks are conquering some of the most well known problems in artificial intelligence and pattern matching, and since their development just a few years ago, milestones in AI have been falling as computer systems that match or surpass human capability have been demonstrated. Playing Go is just the most recent famous example.
This is particularly true in image recognition. Over the past several years, neural network systems have gotten better than humans at problems like recognizing street signs in camera images and even beating radiologists at identifying cancers in medical scans.
These networks are having their effect on robocar development. They are allowing significant progress in the use of vision systems for robotics and driving, making those progress much faster than expected. 2 years ago, I declared that the time when vision systems would be good enough to build a safe robocar without lidar was still fairly far away. That day has not yet arrived, but it is definitely closer, and it's much harder to say it won't be soon. At the same time, LIDAR and other sensors are improving and dropping in price. Quanergy (to whom I am an advisor) plans to ship $250 8-line LIDARS this year, and $100 high resolution LIDARS in the next couple of years.
The deep neural networks are a primary tool of MobilEye, the Jerusalem company which makes camera systems and machine-vision ASICs for the ADAS (Advanced Driver Assistance Systems) market. This is the chip used in Tesla's autopilot, and Tesla claims it has done a great deal of its own custom development, while MobilEye claims the important magic sauce is still mostly them. NVIDIA has made a big push into the robocar market by promoting their high end GPUs as the supercomputing tool cars will need to run these networks well. The two companies disagree, of course, on whether GPUs or ASCICs are the best tool for this -- more on that later.
In comes comma.ai
In February, I rode in an experimental car that took this idea to the extreme. The small startup comma.ai, lead by iPhone hacker George Hotz, got some press by building an autopilot similar in capability to many others from car companies in a short amount of time. In January, I wrote an introduction to their approach including how they used quick hacking of the car's network bus to simplify having the computer control the car. They did it with CNNs, and almost entirely with CNNs. Their car feeds the images from a camera into the network, and out from the network come commands to adjust the steering and speed to keep a car in its lane. As such, there is very little traditional code in the system, just the neural network and a bit of control logic.
Here's a video of the car taking us for a drive:
The network is built instead by training it. They drive the car around, and the car learns from the humans driving it what to do when it sees things in the field of view. To help in this training, they also give the car a LIDAR which provides an accurate 3D scan of the environment to more absolutely detect the presence of cars and other users of the road. By letting the network know during training that "there is really something there at these coordinates," the network can learn how to tell the same thing from just the camera images. When it is time to drive, the network does not get the LIDAR data, however it does produce outputs of where it thinks the other cars are, allowing developers to test how well it is seeing things.
This approach is both interesting and frightening. This allows the development of a credible autopilot, but at the same time, the developers have minimal information about how it works, and never can truly understand why it is making the decisions it does. If it makes an error, they will generally not know why it made the error, though they can give it more training data until it no longer makes the error. (They can also replay all other scenarios for which they have recorded data to make sure no new errors are made with the new training data.)
You need a lot of training data. To drive using vision, you have to segment the 2D images that cameras provide into independent objects to create a 3D map of what you see. You can use parallax, relative motion and two cameras to do that, as humans do, but you also need to see patterns in the shapes to classify them when you see them from every distance and angle, and in every lighting condition. That's been one of the big challenges for vision -- you must understand things in sunlight, in diffuse light, in night-time (illuminated by headlights or other lights,) when the sun is pointed directly into your camera, and when complex objects like trees and casting moving shadows on the desired objects. You must tell pavement from lane markers from shoulders from debris from other road users in all those situations.
And you must do it so well that you only make a dangerous mistake perhaps once every million miles. A million miles is perhaps 2.7 billion frames of video. Fortunately you don't have to be that good. It's OK to not perceive an object every so often. It's OK to miss it in one frame if you get it the next. Indeed, you as a human are probably looking at multiple frames at once to track motion, and so you must just make sure you don't miss something for more like a couple of seconds. Human brains, good as they are, can't figure out everything in an arbitrary still photograph, but they do a good job at spotting everything in a second or two of video.
For now, most people are not willing to have the neural network be the entire system as comma.ai wishes. This approach is probably capable of producing an autopilot that requires human supervision, it is less clear if this is a good path to a vehicle capable of unmanned or unsupervised operation. If you do have a camera, however, the neural networks are going to do a lot to help you understand what the camera sees.
More LIDAR vs. vision
It may surprise some people to know that Google's early generation cars barely made use of their cameras. The LIDAR is so useful that mostly the camera was there to see things like the difference between red and green lights (LIDAR doesn't see lights and colour.) At the same time, since then, Google has established itself as one of the world leaders on neural network technology, so it's a likely supposition that Google has made strong internal efforts to do "sensor fusion" of the LIDAR, cameras and other sensors, and is probably very good at using neural networks to assist their vehicle. Other companies, such as Zoox and Daimler have shown good skill at fusing camera and LIDAR information.
In 2013, I published an article on the contrast between using LIDAR and cameras. In the article, I pointed out that LIDARS work today and are assured to get cheaper, while vision does not, and needs a breakthrough to get the job done. While we have not yet crossed that threshold, new neural network technology holds the promise of being the technology to make the leap.
One of LIDAR's flaws is that it generally is low resolution. As such, while it is very unlikely to not sense an obstacle in front of the car, it might have trouble figuring out just what the obstacle is. Fusion of LIDAR and camera with CNNs will make those systems much better at this, and knowing what things are means making better predictions about where they will go in the future.
Cameras might also help with LIDAR's other limitation -- the approximate 100m range of near infrared LIDAR for dark objects (such as black cars.) At highway speed, you want more than 100m of range. Vision doesn't have a range limit (except at night) though the further things are from you, the more resolution you need in your image, at least in the areas you are paying attention to you (mostly the road far ahead.) The more resolution, the more CPU it takes to run the vision processing. That's why MobilEye's newer units feature 3 fields of view. One is a "telephoto" view for seeing things further away directly in front and there are two wider views to see things closer and more to the sides. This is a good strategy for making use of vision. It's even better if, knowing the curvature of the road ahead, you can focus your attention only only the road, and not waste pixels or processing on the things to the side of the road.
Some people hope vision systems will get good enough to make a car that, like a human, can drive any road without a map. Mostly this technique is applied to very simple roads, like highways, which are all similar and easy to understand. However, this is an example of the wrong type of thinking about AI. Just as airplanes do not fly by flapping their wings like birds, robotic systems should not necessarily be aimed and doing what humans do the way that we do it. A vision system that can classify everything in its view sufficiently well to drive without a map is also a system that's very useful in building a map with less or no human assistance. The map-making system can do its work with the benefit of a car driving over the territory more than once, and with the benefit of as much cloud supercomputer time as is necessary to do the best job. It would be foolish to throw away the great things a map can give you for the false goal of driving unknown roads with no data. Of course, cars need some ability to drive when the real world has changed from their map, but that's a fairly rare event (which only happens to the very first car to encounter the change) and so it is not necessary that the vehicle be quite as capable in those situations -- or if it has a human driver on board, it need not be capable at all.
Where this approach (probably) falls down.
I wrote in January about how testing is the real blocking problem in robocars -- that while there are many challenges in getting robocars working, one of the biggest is proving (to yourself and others) that you have really done it.
Neural networks face a problem here because it's harder to know that they're working. You don't know why they are working, you can only measure their performance. You can re-test your network on all your old sensor data but you have a hard time being sure that the latest training you have given it won't create a problem that wasn't there before in some new situation.
On the other hand, traditional systems are so complex that it is difficult to judge their performance to. If the test is, "Drive one million km with less than 2 incidents, including every complex situation imagined in the simulator or recorded in the real world" then it could be the regimen is the same no matter how the system makes its decisions.
Strange legal benefit
Neural networks may also face a bizarre -- I might even say perverse -- advantage in the legal system. When there is an accident there will be a scramble to figure out why. In particular the plaintiff's lawyers will be keen to show some negligence on the part of the developers.
With traditional code, you mind discover the cause was a classic old-style bug, like the famous off-by-one error or any other such problem. You will see the cause of the bug (and fix it) but you might now be able to claim that programmer, or the QA process, were negligent in some way.
With a neural network, the is not traditional code. If the network makes an error, we won't know a lot about why, and so there is less likely to be a particular negligent human or negligent act, unless the court decides the whole idea of using the neural network is negligent.
That's a more complex question, but generally if a team is following established good practices, it's harder to find negligence. Negligence isn't any sort of mistake, it's a mistake that good and diligent people should have avoided, but they didn't because they got careless.
The perverse factor here is that knowing less about how your system works may make it less likely somebody can claim you were careless.