Watson, come here, I want you

The computer scientist world is abuzz with the game show world over the showdown between IBM's "Watson" question-answering system and the best human players to play the game Jeopardy. The first game has been shown, with a crushing victory by Watson (in spite of a tie after the first half of the game.)

Tomorrow's outcome is not in doubt. IBM would not have declared itself ready for the contest without being confident it would win, and they wouldn't be putting all the advertising out about the contest if they had lost. What's interesting is how they did it and what else they will be able to do with it.

Dealing with a general question has long been one of the hard problems in AI research. Watson isn't quite there yet but it's managed a great deal with a combination of algorithmic parsing and understanding combined with machine learning based on prior Jeopardy games. That's a must because Jeopardy "answers" (clues) are often written in obfuscated styles, with puns and many idioms, exactly the sorts of things most natural language systems have had a very hard time with.

Watson's problem is almost all understanding the question. Looking up obscure facts is not nearly so hard if you have a copy of Wikipedia and other databases on hand, particularly one parsed with other state-of-the-art natural language systems, which is what I presume they have. In fact, one would predict that Watson would do the best on the hardest $2,000 questions because these are usually hard because they refer to obscure knowledge, not because it is harder to understand the question. I expect that an evaluation of its results may show that its performance on hard questions is not much worse than on easy ones. (The main thing that would make easy questions easier would be the large number of articles in its database confirming the answer, and presumably boosting its confidence in its answer.) However, my intuition may be wrong here, in that most of Watson's problems came on the high-value questions.

It's confidence is important. If it does not feel confident it doesn't buzz in. And it has a serious advantage at buzzing in, since you can't buzz in right away on this game, and if you're an encyclopedia like the two human champions and Watson, buzzing in is a large part of the game. In fact, a fairer game, which Watson might not do as well at, would involve randomly choosing which of the players who buzz in in the first few tenths of a second gets to answer the question, eliminating any reaction time advantage. Watson gets the questions as text, which is also a bit unfair, unless it is given them one word a time at human reading speed. It could do OCR on the screen but chances are it would read faster than the humans. It's confidence numbers and results are extremely impressive. One reason it doesn't buzz in is that even with 3,000 cores it takes 2-6 seconds to answer a question.

Indeed a totally fair contest would not have buzzing in time competition at all, and just allow all players who buzz in to answer an get or lose points based on their answer. (Answers would need to be in parallel.)

Watson's coders know by now that they probably should have coded it to receive wrong answers from other contestants. In one instance it repeated a wrong answer, and in another case it said "What is Leg?" after Jennings had incorrectly answered "What is missing an arm?" in a question about an Olympic athlete. The host declared that right, but the judges reversed that saying that it would be right if a human who was following up the wrong answer said it, but was a wrong answer without that context. This was edited out. Also edited out were 4 crashes by Watson that made the game take 4 hours instead of 30 minutes.

It did not happen in what aired so far, but in the trials, another error I saw Watson make was declining to answer a request to be more specific on an answer. Watson was programmed to give minimalist answers, which often the host will accept as correct, so why take a risk. If the host doesn't think you said enough he asks for a more specific answer. Watson sometimes said "I can be no more specific." From a pure gameplay standpoint, that's like saying, "I admit I am wrong." For points, one should say the best longer phrase containing the one-word answer, because it just might be right. Though it has a larger chance of looking really stupid -- see below for thoughts on that.

The shows also contain total love-fest pieces about IBM which make me amazed that IBM is not listed as a sponsor for the shows, other than perhaps in the name "The IBM Challenge." I am sure Jeopardy is getting great ratings (just having their two champs back would do that on its own but this will be even more) but I have to wonder if any other money is flowing.

Being an idiot savant

Watson doesn't really understand the Jeopardy clues, at least not as a human does. Like so many AI breakthroughs, this result comes from figuring out another way to attack the problem different from the method humans use. As a result, Watson sometimes puts out answers that are nonsense "idiot" answers from a human perspective. They cut back a lot on this by only having it answer when it has 50% confidence or higher, and in fact for most of its answers it has very impressive confidence numbers. But sometimes it gives such an answer. To the consternation of the Watson team, it did this on the Final Jeopardy clue, where it answered "Toronto" in the category "U.S. Cities."

The Watson team are pleased that, since the software knew it did not understand the category very well, it made only a small bet. (The small bet is wise anyway with such a large lead, even in a 2-game total match. This also seems like the sort of category it should be easy to understand, but it is so terse that confidence might not be high.) They explain that it did understand the category (just not with high confidence) but misunderstood that Toronto was not a U.S. city because there are cities named Toronto in the USA and the big Toronto has MLB and NBA teams making it seem like a U.S. city from a database view. Without specific code to exclude Canadian cities from a search for U.S. cities, Watson was going to include Toronto as a possible answer. They have not explained why it thought Toronto, whose two airports are known as "Pearson" and "Island" matched the clues about a WWII hero and battle, but somehow it did. (The airport everybody calls Island has the official but rarely used name of Billy Bishop, who was a WWI hero.)

What was more interesting to me was the reaction of the public, as judged by articles and blog posts and comments. Watson's crushing performance was a bit frightening to people who have a hidden fear that the robot overlords are coming. While the stupid answers were expected by computer scientists, to the public they seem to have burst a balloon. They reveal Watson as fallible, and in a deep way that allows humans to feel they remain superior. At the same time, they also reduce the public's trust in using a system like Watson to do disease diagnosis, something IBM touted in its videos.

A computer vs. computer contest

While we probably won't see another computer vs. human contest like this, it does seem that it might be nice to arrange regular computer-vs-computer quiz show tournaments. While these contests might include buzzing in, they could be simply scored on a total score for all questions the computer feels confident enough to answer. A buzzing-in contest would test the speed of your code (good) but also your hardware (mostly who had the most money -- bad.)

Now that IBM has shown how to do it, I suspect that other teams will also be able to get farther and making quiz show programs or related programs. Over time, we might start seeing people creating more and more difficult, more "human" questions to stump the programs and make the contest harder every year.

Eventually we would also see the programs doing only speech recognition, and since it's not that hard to do even today, presenting a human looking post-uncanny-valley avatar with realistic speech.

I joked that for fun, they should have pretended to start off with six fake categories like "Famous robots" and "solutions to large systems of linear equations" that are designed for a computer to win. (Watson actually would probably not know how to solve the systems of equations unless it had special heuristics to do so.) But they could also throw in six topics like "Translating poetry" or "Complex metaphors" that are really aimed at humans.

And when you move the problem into the medical diagnosis space as IBM suggests, there already is such a contest being planned, namely the Tricorder X-Prize which, when finalized, may combine the challenge of making a cheap medical sensor with AI software that can diagnose better than human doctors.

I have written some further comments after game 2.

No doubt I will have some more thoughts after the final tomorrow, except I will be heading to the EFF 21st anniversary party in San Francisco -- join me there.


I recall an interview with Ken Jennings shortly after his record run. He admitted that after the first 5 or 10 games he had a distinct advantage over other players. First his nerves had settled, he was used to the audience, and he learned the rhythm of the game, almost able to predict when Alex would finish the question, allowing him to push the button a fraction of a second faster. In some ways, this illustrates the advantage Watson has.

I have been impressed by Waton's performance. Even his low confidence answers are correct many times. Something else to consider is would a human, who has had four years to practice and memorize facts, with a team of scientists fair better than the two current contestants?

Toronto's airports are "Pearson" and "Island".
Prime Minister Lester B. Pearson: WWI pilot, and an active Canadian diplomat in WWII.
"Island" airport: There are thousands of 'island' battles.

But this is a distant 2nd to Chicago's "O'Hare" and "Midway".
A thrown loss to make Watson seen more human, perhaps?

Uhm, for one, Toronto has more than two airports.

And for another, what is commonly known as Toronto Island Airport is actually called Billy Bishop Toronto City Airport.

Who is Billy Bishop? A WWI flying ace.

IBM's blog points out that Watson knows that the clue categories only sometimes loosely point to the correct answer. Also, "What US city" wasn't in the question so that decreased weighting of US cities.


I did forget about the Billy Bishop name, but as you say, he was a WW1 hero, and that means Watson has to be mapping Pearson to a battle, which is much harder to do than mapping Midway.

While there are many small airports around Toronto, only the Island and Pearson (also known as Malton to us old folks because it is not in Toronto proper) have serious commercial service -- oops, looks like the Buttonville does have a couple of flights. Downsview might get considered but again not sure why it matched the clues.

Though obviously something about Toronto matched the clues (though poorly) and Chicago got an even worse score and the category understander decided Toronto fit US Cities. As I said, for computer scientists this is not a surprising mistake -- in fact what is surprising is that it did not make more mistakes like this.

Great comments as usual, Brad. Your comments on computer vs. computer challenges reminded me of an ars technica article I read recently on computer vs. computer StarCraft contests. Worth reading if you haven't seen it: http://arstechnica.com/gaming/news/2011/01/skynet-meets-the-swarm-how-the-berkeley-overmind-won-the-2010-starcraft-ai-competition.ars

Admittedly, American audiences might not find StarCraft as compelling as Jeopardy (in S. Korea it's another thing altogether, from what I understand...they love it), but the idea to me is just fascinating.

IBM might do better in terms of publicity/advertising if they narrowly lost this contest so that they could come back next year and win and then double up on the coverage.

One presentation I saw from the Watson team said they had reached the level that they beat the top players 65% of the time. That seems like a low number at which to go ahead and I suspect that Watson is now better than that. Easy to compare Watson with Jennings on answer accuracy and I suppose it's not too hard to see how much better it is at button pushing.

I have not seen now it is told that it can press the button. If a human is telling Watson then it is like that human pressing the button. If the game's computer which accepts presses is telling Watson it has a solid edge.

Watson has a programmed confidence threshold (that bar across the 3 answers as seen on Jeopardy). Any answer Watson comes up with that passes that bar means it will try to buzz in first. If it's in an instance where it has to guess, it chooses the top answer.

As for buzzing in itself, Watson is connected to an actual mechanical apparatus that pushes a button exactly like any other player's.

"Watson gets the questions as text, which is also a bit unfair, unless it is given them one word a time at human reading speed."

One of the IBM puff pieces claimed that's exactly what happens - Watson is given the words in text in real time at the same pace that Alex reads them. Though it wasn't clear what mechanism accomplishes this.

Hmm. I would doubt it is at the same pace as he reads them. Watson needs 2-6 seconds to answer a question, and you can't buzz in until Alex finishes reading, so you need to be confident you know the answer before then. Since the clues appear on the board and anybody can read text faster than listen to Trebek the right thing to do would be to feed the words at the same speed as a fast reader. Alternately you could let Watson OCR them from the screen, but it would do that faster than a human can read, much faster.

Now, part of the demonstration is to show what Watson can do better than people, and so one could argue that the OCR approach is fair. OCR would get few errors if given the video feed of the puzzle screen, so it might be simpler just to time-trial the OCR and figure out how long it takes and give Watson the text in that mode.

Speech recognition would be way too slow, Watson would have to buzz in very early and just hope it is right. Since it is right over 90% of the time, that would actually be a winning strategy, but they appear to have it evaluate confidence before buzzing.

The way the buzzer works is if you buzz in too early, you wait 0.2 seconds before you can buzz again, and if somebody else had better timing, they will then beat you. We have to understand just how it gets the signal about when to buzz. One method that I think would make sense would be for Watson to have the text, and to also be listening to Trebek and figuring out with speech recognition when he finishes. It would do pretty well at that since it doesn't have to recognize the last word, just its approximate duration and the silence after it. But the question is, what enables the buzzer? Is a J staffer listening to Alex and enabling it? Does Alex himself push a button to enable the buzzer? Since it presumably is a human, the trick is to time that human and push close to that to win. Since you can't do that exactly, all players will win some of the time. If Watson is getting an electronic "Ok to push now" signal it has an unfair advantage.

Dunno how Watson was getting the signal, but the buzzers are unlocked by a J! staffer other than Alex Trebek hitting a button. I know that past J! multi-game players have said there was a difference in rhythm when different staffers were responsible for this in their games.

"The shows also contain total love-fest pieces about IBM"

It's my understanding the Corporation for Public Broadcasting and its pseudo-science show Nova broadcast an hour-long commercial for IBM under the guise of "news" and "education."

But it was nothing like this. The Nova was like most other such shows, with footage shot and edited by Nova. Jeopardy contained a number of IBM-made films about how great IBM was, how they planned to use Watson for business etc. They really felt very different in style.

Did anyone keep track of Watson's percentage of correct answers out of the total number of questions? I didn't, although I suppose I could go back and look at the recording.

I also found it compelling that Watson would essentially say, "I don't know," when his confidence is low. That's almost as important as getting the correct answer.


Add new comment