comma.ai's neural network car and the hot new technology in robocars

Topic: 

Perhaps the world's most exciting new technology today are deep neural networks, in particular the convolutional neural networks such as "Deep Learning." These networks are conquering some of the most well known problems in artificial intelligence and pattern matching, and since their development just a few years ago, milestones in AI have been falling as computer systems that match or surpass human capability have been demonstrated. Playing Go is just the most recent famous example.

This is particularly true in image recognition. Over the past several years, neural network systems have gotten better than humans at problems like recognizing street signs in camera images and even beating radiologists at identifying cancers in medical scans.

These networks are having their effect on robocar development. They are allowing significant progress in the use of vision systems for robotics and driving, making those progress much faster than expected. 2 years ago, I declared that the time when vision systems would be good enough to build a safe robocar without lidar was still fairly far away. That day has not yet arrived, but it is definitely closer, and it's much harder to say it won't be soon. At the same time, LIDAR and other sensors are improving and dropping in price. Quanergy (to whom I am an advisor) plans to ship $250 8-line LIDARS this year, and $100 high resolution LIDARS in the next couple of years.

The deep neural networks are a primary tool of MobilEye, the Jerusalem company which makes camera systems and machine-vision ASICs for the ADAS (Advanced Driver Assistance Systems) market. This is the chip used in Tesla's autopilot, and Tesla claims it has done a great deal of its own custom development, while MobilEye claims the important magic sauce is still mostly them. NVIDIA has made a big push into the robocar market by promoting their high end GPUs as the supercomputing tool cars will need to run these networks well. The two companies disagree, of course, on whether GPUs or ASCICs are the best tool for this -- more on that later.

In comes comma.ai

In February, I rode in an experimental car that took this idea to the extreme. The small startup comma.ai, lead by iPhone hacker George Hotz, got some press by building an autopilot similar in capability to many others from car companies in a short amount of time. In January, I wrote an introduction to their approach including how they used quick hacking of the car's network bus to simplify having the computer control the car. They did it with CNNs, and almost entirely with CNNs. Their car feeds the images from a camera into the network, and out from the network come commands to adjust the steering and speed to keep a car in its lane. As such, there is very little traditional code in the system, just the neural network and a bit of control logic.

Here's a video of the car taking us for a drive:

The network is built instead by training it. They drive the car around, and the car learns from the humans driving it what to do when it sees things in the field of view. To help in this training, they also give the car a LIDAR which provides an accurate 3D scan of the environment to more absolutely detect the presence of cars and other users of the road. By letting the network know during training that "there is really something there at these coordinates," the network can learn how to tell the same thing from just the camera images. When it is time to drive, the network does not get the LIDAR data, however it does produce outputs of where it thinks the other cars are, allowing developers to test how well it is seeing things.

This approach is both interesting and frightening. This allows the development of a credible autopilot, but at the same time, the developers have minimal information about how it works, and never can truly understand why it is making the decisions it does. If it makes an error, they will generally not know why it made the error, though they can give it more training data until it no longer makes the error. (They can also replay all other scenarios for which they have recorded data to make sure no new errors are made with the new training data.)

You need a lot of training data. To drive using vision, you have to segment the 2D images that cameras provide into independent objects to create a 3D map of what you see. You can use parallax, relative motion and two cameras to do that, as humans do, but you also need to see patterns in the shapes to classify them when you see them from every distance and angle, and in every lighting condition. That's been one of the big challenges for vision -- you must understand things in sunlight, in diffuse light, in night-time (illuminated by headlights or other lights,) when the sun is pointed directly into your camera, and when complex objects like trees and casting moving shadows on the desired objects. You must tell pavement from lane markers from shoulders from debris from other road users in all those situations.

And you must do it so well that you only make a dangerous mistake perhaps once every million miles. A million miles is perhaps 2.7 billion frames of video. Fortunately you don't have to be that good. It's OK to not perceive an object every so often. It's OK to miss it in one frame if you get it the next. Indeed, you as a human are probably looking at multiple frames at once to track motion, and so you must just make sure you don't miss something for more like a couple of seconds. Human brains, good as they are, can't figure out everything in an arbitrary still photograph, but they do a good job at spotting everything in a second or two of video.

For now, most people are not willing to have the neural network be the entire system as comma.ai wishes. This approach is probably capable of producing an autopilot that requires human supervision, it is less clear if this is a good path to a vehicle capable of unmanned or unsupervised operation. If you do have a camera, however, the neural networks are going to do a lot to help you understand what the camera sees.

More LIDAR vs. vision

It may surprise some people to know that Google's early generation cars barely made use of their cameras. The LIDAR is so useful that mostly the camera was there to see things like the difference between red and green lights (LIDAR doesn't see lights and colour.) At the same time, since then, Google has established itself as one of the world leaders on neural network technology, so it's a likely supposition that Google has made strong internal efforts to do "sensor fusion" of the LIDAR, cameras and other sensors, and is probably very good at using neural networks to assist their vehicle. Other companies, such as Zoox and Daimler have shown good skill at fusing camera and LIDAR information.

In 2013, I published an article on the contrast between using LIDAR and cameras. In the article, I pointed out that LIDARS work today and are assured to get cheaper, while vision does not, and needs a breakthrough to get the job done. While we have not yet crossed that threshold, new neural network technology holds the promise of being the technology to make the leap.

One of LIDAR's flaws is that it generally is low resolution. As such, while it is very unlikely to not sense an obstacle in front of the car, it might have trouble figuring out just what the obstacle is. Fusion of LIDAR and camera with CNNs will make those systems much better at this, and knowing what things are means making better predictions about where they will go in the future.

Cameras might also help with LIDAR's other limitation -- the approximate 100m range of near infrared LIDAR for dark objects (such as black cars.) At highway speed, you want more than 100m of range. Vision doesn't have a range limit (except at night) though the further things are from you, the more resolution you need in your image, at least in the areas you are paying attention to you (mostly the road far ahead.) The more resolution, the more CPU it takes to run the vision processing. That's why MobilEye's newer units feature 3 fields of view. One is a "telephoto" view for seeing things further away directly in front and there are two wider views to see things closer and more to the sides. This is a good strategy for making use of vision. It's even better if, knowing the curvature of the road ahead, you can focus your attention only only the road, and not waste pixels or processing on the things to the side of the road.

Some people hope vision systems will get good enough to make a car that, like a human, can drive any road without a map. Mostly this technique is applied to very simple roads, like highways, which are all similar and easy to understand. However, this is an example of the wrong type of thinking about AI. Just as airplanes do not fly by flapping their wings like birds, robotic systems should not necessarily be aimed and doing what humans do the way that we do it. A vision system that can classify everything in its view sufficiently well to drive without a map is also a system that's very useful in building a map with less or no human assistance. The map-making system can do its work with the benefit of a car driving over the territory more than once, and with the benefit of as much cloud supercomputer time as is necessary to do the best job. It would be foolish to throw away the great things a map can give you for the false goal of driving unknown roads with no data. Of course, cars need some ability to drive when the real world has changed from their map, but that's a fairly rare event (which only happens to the very first car to encounter the change) and so it is not necessary that the vehicle be quite as capable in those situations -- or if it has a human driver on board, it need not be capable at all.

Where this approach (probably) falls down.

I wrote in January about how testing is the real blocking problem in robocars -- that while there are many challenges in getting robocars working, one of the biggest is proving (to yourself and others) that you have really done it.

Neural networks face a problem here because it's harder to know that they're working. You don't know why they are working, you can only measure their performance. You can re-test your network on all your old sensor data but you have a hard time being sure that the latest training you have given it won't create a problem that wasn't there before in some new situation.

On the other hand, traditional systems are so complex that it is difficult to judge their performance to. If the test is, "Drive one million km with less than 2 incidents, including every complex situation imagined in the simulator or recorded in the real world" then it could be the regimen is the same no matter how the system makes its decisions.

Strange legal benefit

Neural networks may also face a bizarre -- I might even say perverse -- advantage in the legal system. When there is an accident there will be a scramble to figure out why. In particular the plaintiff's lawyers will be keen to show some negligence on the part of the developers.

With traditional code, you mind discover the cause was a classic old-style bug, like the famous off-by-one error or any other such problem. You will see the cause of the bug (and fix it) but you might now be able to claim that programmer, or the QA process, were negligent in some way.

With a neural network, the is not traditional code. If the network makes an error, we won't know a lot about why, and so there is less likely to be a particular negligent human or negligent act, unless the court decides the whole idea of using the neural network is negligent.

That's a more complex question, but generally if a team is following established good practices, it's harder to find negligence. Negligence isn't any sort of mistake, it's a mistake that good and diligent people should have avoided, but they didn't because they got careless.

The perverse factor here is that knowing less about how your system works may make it less likely somebody can claim you were careless.

Comments

The six main advantages of Reductionism – including self-driving cars that use programming and rules rather than neural networks – are

- Optimality (best answer)
- Completeness (all answers)
- Repeatability (same answers every time)
- Parsimony (find answers in the most efficient way)
- Transparency (you can understand the process of arriving at the answer)
- Scrutability (you understand the given answers)

When moving to Holistic technologies – systems that do their own Reduction to Models they can think about, based on unrestricted ("whole") input such as raw video feeds – you lose ALL of those.
You lose them even when switching to Genetic Algorithms which are close to the simplest Holistic systems we can program (there are about six simpler Model Free Methods below Evolution).

When using Neural Networks you gain the benefits of Holism:

- Saliency (knowing what matters)
- Reduction, Abstraction, Induction, and Abduction (discovering higher level concepts from lower level concepts by discarding the irrelevant)
- Generality (the ability to solve problems in any problem domain by learning the new domain)
- Robustness (by knowing what matters and using multiple redundant viewpoints; the opposite to AI brittleness of old)
- Self-repair (by continuous learning)
- Novelty (learning is a creative act since you need to invent ways that new knowledge fits in with existing knowledge)
- Scale-free-ness. (decent Holistic systems can do these things in constant time. Just because you learn more by reading doesn't mean you read slower as you get older.)
- Understanding (the ability to provide solutions to problems the programmers or users do not themselves Understand)

In summary, Holistic systems mainly provide Understanding. Understanding of traffic situations in the case of cars. Understanding of language, locomotion, and the world at large down the road with better AI.
And the fact that we have to give up all of the benefits of Reductionism to get Understanding is what has kept AI back for 60 years... because all AI researchers wanted to be Scientific in everything they do.
In order to understand the Models of AIs they had to build them themselves. And this is where Neural Networks differ, because they can do their own Reduction to Models.
The uncomfortable feeling of "but we don't know why it does what it does" comes mainly from still thinking of automotive AIs as computers rather than Understanding Machines.

If there is a court case over a traffic accident today we might just say "The defendant wasn't paying enough attention". That's going to – one day – be a valid argument when a self-driving car causes an accident.
Why insist on going beyond that? It might be impossible.

And here's why: the main advantage of deep neural networks is that, like humans, Deep Learning systems get better with practice. Even when first trained, Hotz systems were pretty good, and that's usually an indication that a Deep Learning system is going to crush that problem, since practice only makes them better.

By comparison, heuristics (Reductionist hard-coded rules) actually show more errors with more data, because you find more edge cases that break the heuristic. And on complex real-world data, simple statistical machine-learning methods max out the ability of their limited parameters to adjust effectively to the variations of reality. The neural net has so much internal complexity you can feed it massive amounts of data and the more you feed it, the better it gets. This is why we are seeing them succeed so well in applications, like YouTube and Facebook, where they can be fed enormous data sets.

Now, to be fair, highway driving is relatively easy compared to local-road driving. Mercedes had a self-driving car that could drive the Autobahn, using vision only in 1994! (see this article I wrote about it). But Deep Learning conquering problems in many areas of AI and this is exactly the kind of problem it is great at. For starters, if you had a good Reductionist (heuristic) system, you could start by using it to teach the Deep Learning system, so the Deep learning system can only get better. But the reverse is untrue - a Reductionist system can't learn from a Deep Learning system (or at all; a programmer has to code a new rule).

But I think that an even bigger factor, that Monica alluded, to is Robustness. I think what we'll find is that carmakers switch over to Deep Learning systems specifically because when faced with a strange situation, we will see that they do something generally reasonable, where a heuristic-based system can just totally fail and give up. This is the major problem you have alluded to in earlier articles, with solving the last 1% of weird driving cases.

Can we certify Deep Neural Networks? Sure. We train them to expert level and freeze them. Then we put the car through road tests and show it performs above human professional driver level. I can see regulatory agencies stamping approval on that - it's the same way they would test a human driver, so the model is familiar.

So put me down for a bet on Hotz' approach.

Deep Learning certainly has potential, but I doubt that having an individual vehicle produce all results is the right answer. Any system presented with a new or corner scenario may make a good/bad decision and learn from it. However to transfer this new learning to other vehicles is a bigger challenge. If vehicles can learn new solutions in real time (assuming they made a good decision and didn't crash) then in a city will millions of the vehicles they will diverge quickly. Curating the results and bringing all the vehicles to parity is a real challenge.

It may be simpler to provide limited self learning in the vehicle (all vehicles learn the same data/decision tree) to get predictability and use a higher Deep Thinking solution to manage all the traffic (Think IBM Watson level for a city). This central authority would manage overall traffic flow and eliminate most of the corner cases that might arise by swarm management. All vehicles have to be connected anyway, why not use this to its best advantage.

Deep neural networks do indeed have impressible ability to learn. They learn well from supervised training and even are capable of unsupervised learning which allows this to scale up. But self-driving car systems are so valuable and so important that they will also "learn" in a manner most software does not get to.

Because safety is so critical, developers of self-driving software will be constantly on the look-out for mistakes. Any mistake will cause an immediate effort to fix it, whether that is done through traditional bug-fixing or by retraining a network. The new code must then go through QA, and the QA phase will be the most expensive part of the fix. With the networks, you will be able to verify that the exact problem does not recur, as well as similar problems which you can create in simulator. With the traditional code, you will also do that, but you will have the additional satisfaction of knowing about your specific fix.

But in either case the real cost is the QA, not the fix. And in fact the QA for traditional code might be cheaper, because if you find a traditional bug in the code and the logs show the bug caused the error, it can be much faster for you to be 100% sure you fixed it. Longer to fix the bug, but faster to be sure you fixed it.

The neural network inside the head of a teenager working on her driver's license has had sixteen years of training in navigation and ten years of reading to apply to the little handbook about rules of the road and traffic signs, not to mention being hardwired for learning how to learn since birth. The autodriving car is unlikely to learn that handbook with any amount of following along with a human instructor doing the actual driving, even with tens of thousands of instructors substituting for actual programming.

That pure artificial neural nets can drive cars is much like the circus bear that rides a motorcycle. The remarkable thing is not how well it does it, but that it does it at all.

Add new comment