The computer scientist world is abuzz with the game show world over the showdown between IBM’s “Watson” question-answering system and the best human players to play the game Jeopardy. The first game has been shown, with a crushing victory by Watson (in spite of a tie after the first half of the game.)
Tomorrow’s outcome is not in doubt. IBM would not have declared itself ready for the contest without being confident it would win, and they wouldn’t be putting all the advertising out about the contest if they had lost. What’s interesting is how they did it and what else they will be able to do with it.
Dealing with a general question has long been one of the hard problems in AI research. Watson isn’t quite there yet but it’s managed a great deal with a combination of algorithmic parsing and understanding combined with machine learning based on prior Jeopardy games. That’s a must because Jeopardy “answers” (clues) are often written in obfuscated styles, with puns and many idioms, exactly the sorts of things most natural language systems have had a very hard time with.
Watson’s problem is almost all understanding the question. Looking up obscure facts is not nearly so hard if you have a copy of Wikipedia and other databases on hand, particularly one parsed with other state-of-the-art natural language systems, which is what I presume they have. In fact, one would predict that Watson would do the best on the hardest $2,000 questions because these are usually hard because they refer to obscure knowledge, not because it is harder to understand the question. I expect that an evaluation of its results may show that its performance on hard questions is not much worse than on easy ones. (The main thing that would make easy questions easier would be the large number of articles in its database confirming the answer, and presumably boosting its confidence in its answer.) However, my intuition may be wrong here, in that most of Watson’s problems came on the high-value questions.
It’s confidence is important. If it does not feel confident it doesn’t buzz in. And it has a serious advantage at buzzing in, since you can’t buzz in right away on this game, and if you’re an encyclopedia like the two human champions and Watson, buzzing in is a large part of the game. In fact, a fairer game, which Watson might not do as well at, would involve randomly choosing which of the players who buzz in in the first few tenths of a second gets to answer the question, eliminating any reaction time advantage. Watson gets the questions as text, which is also a bit unfair, unless it is given them one word a time at human reading speed. It could do OCR on the screen but chances are it would read faster than the humans. It’s confidence numbers and results are extremely impressive. One reason it doesn’t buzz in is that even with 3,000 cores it takes 2-6 seconds to answer a question.
Indeed a totally fair contest would not have buzzing in time competition at all, and just allow all players who buzz in to answer an get or lose points based on their answer. (Answers would need to be in parallel.)
Watson’s coders know by now that they probably should have coded it to receive wrong answers from other contestants. In one instance it repeated a wrong answer, and in another case it said “What is Leg?” after Jennings had incorrectly answered “What is missing an arm?” in a question about an Olympic athlete. The host declared that right, but the judges reversed that saying that it would be right if a human who was following up the wrong answer said it, but was a wrong answer without that context. This was edited out. Also edited out were 4 crashes by Watson that made the game take 4 hours instead of 30 minutes.
It did not happen in what aired so far, but in the trials, another error I saw Watson make was declining to answer a request to be more specific on an answer. Watson was programmed to give minimalist answers, which often the host will accept as correct, so why take a risk. If the host doesn’t think you said enough he asks for a more specific answer. Watson sometimes said “I can be no more specific.” From a pure gameplay standpoint, that’s like saying, “I admit I am wrong.” For points, one should say the best longer phrase containing the one-word answer, because it just might be right. Though it has a larger chance of looking really stupid — see below for thoughts on that.
The shows also contain total love-fest pieces about IBM which make me amazed that IBM is not listed as a sponsor for the shows, other than perhaps in the name “The IBM Challenge.” I am sure Jeopardy is getting great ratings (just having their two champs back would do that on its own but this will be even more) but I have to wonder if any other money is flowing.
Being an idiot savant
Watson doesn’t really understand the Jeopardy clues, at least not as a human does. Like so many AI breakthroughs, this result comes from figuring out another way to attack the problem different from the method humans use. As a result, Watson sometimes puts out answers that are nonsense “idiot” answers from a human perspective. They cut back a lot on this by only having it answer when it has 50% confidence or higher, and in fact for most of its answers it has very impressive confidence numbers. But sometimes it gives such an answer. To the consternation of the Watson team, it did this on the Final Jeopardy clue, where it answered “Toronto” in the category “U.S. Cities.”
The Watson team are pleased that, since the software knew it did not understand the category very well, it made only a small bet. (The small bet is wise anyway with such a large lead, even in a 2-game total match. This also seems like the sort of category it should be easy to understand, but it is so terse that confidence might not be high.) They explain that it did understand the category (just not with high confidence) but misunderstood that Toronto was not a U.S. city because there are cities named Toronto in the USA and the big Toronto has MLB and NBA teams making it seem like a U.S. city from a database view. Without specific code to exclude Canadian cities from a search for U.S. cities, Watson was going to include Toronto as a possible answer. They have not explained why it thought Toronto, whose two airports are known as “Pearson” and “Island” matched the clues about a WWII hero and battle, but somehow it did. (The airport everybody calls Island has the official but rarely used name of Billy Bishop, who was a WWI hero.)
What was more interesting to me was the reaction of the public, as judged by articles and blog posts and comments. Watson’s crushing performance was a bit frightening to people who have a hidden fear that the robot overlords are coming. While the stupid answers were expected by computer scientists, to the public they seem to have burst a balloon. They reveal Watson as fallible, and in a deep way that allows humans to feel they remain superior. At the same time, they also reduce the public’s trust in using a system like Watson to do disease diagnosis, something IBM touted in its videos.
A computer vs. computer contest
While we probably won’t see another computer vs. human contest like this, it does seem that it might be nice to arrange regular computer-vs-computer quiz show tournaments. While these contests might include buzzing in, they could be simply scored on a total score for all questions the computer feels confident enough to answer. A buzzing-in contest would test the speed of your code (good) but also your hardware (mostly who had the most money — bad.)
Now that IBM has shown how to do it, I suspect that other teams will also be able to get farther and making quiz show programs or related programs. Over time, we might start seeing people creating more and more difficult, more “human” questions to stump the programs and make the contest harder every year.
Eventually we would also see the programs doing only speech recognition, and since it’s not that hard to do even today, presenting a human looking post-uncanny-valley avatar with realistic speech.
I joked that for fun, they should have pretended to start off with six fake categories like “Famous robots” and “solutions to large systems of linear equations” that are designed for a computer to win. (Watson actually would probably not know how to solve the systems of equations unless it had special heuristics to do so.) But they could also throw in six topics like “Translating poetry” or “Complex metaphors” that are really aimed at humans.
And when you move the problem into the medical diagnosis space as IBM suggests, there already is such a contest being planned, namely the Tricorder X-Prize which, when finalized, may combine the challenge of making a cheap medical sensor with AI software that can diagnose better than human doctors.
I have written some further comments after game 2.
No doubt I will have some more thoughts after the final tomorrow, except I will be heading to the EFF 21st anniversary party in San Francisco — join me there.