February 29, 2008

EP 6.4: Statistical AI

by Noah Wardrip-Fruin · , 6:10 am

We can see, in Minstrel, symptoms of a much larger problem. One which Turner, alone, could have done little to address. By the late 1980s it was clear that AI systems in general were not living up to the expectations that had been created over the three previous decades. Many successful systems had been built — by both “neats” and “scruffies” — but all of these worked on very small sets of data. Based on these successes, significant funding had been dedicated to attempting to scale up to larger, more real-world amounts of data. But these attempts failed, perhaps most spectacularly in the once high-flying area of “expert systems.” The methods of AI had produced, rather than operational simulations of intelligence, a panoply of idiosyncratic encodings of researchers’ beliefs about parts of human intelligence — without any means of compensating for the non-simulation of the rest of human intelligence. Guy Steele and Richard Gabriel, in their history of the Lisp programming language (1993, 30), note that by 1988 the term “AI winter” had been introduced to describe the growing backlash and resulting loss of funding for many AI projects. In this vacuum, assisted by steadily increasing processing power, a new form of AI began to rise in prominence.

As some readers of this book no doubt remember, in the early 1990s the World Wide Web was largely human-organized. The “What’s New” page at NCSA provided regular links to new web resources (much like one of today’s blogs) and Yahoo — still on servers at Stanford University — maintained a growing hierarchical directory, to which pages were added by hand. Many individuals maintained hotlists, or pages of links, that were the definitive sources for finding the web’s best information on certain topics. Much as pre-Dewey librarians had done before them, and AI researchers more recently, these editors of web resources organized knowledge according to their own ideas (often no more than intuitions) of what belonged with what. Web page “meta tags” helped authors express what they felt were the relevant keywords for their contributions, and page titles were a very important factor in determining which results came back from a search. Generally, finding the resources with “cool shades” next to them in the Yahoo directory was a better route to information than using the clumsy search engines.

But this began to change, and quickly, as search engines improved. This improvement came not by engines coming to better “understand” the pages they indexed, not by better interpreting document contents as traditional AI researchers had attempted, but by counting. For example, the “PageRank” algorithm that rocketed Google to success operates by counting the links that interconnect web pages. As the Google website explains:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important.” (Google)

Of course, describing this process as “democratic” is a rhetorical choice on the part of Google’s marketing department. When the web’s most “important” pages are almost all owned by media and technology conglomerates, the power to bless viewpoints on widely-discussed topics (e.g., national politics) with attention via PageRank lies in much the same hands as the power to grant attention does in non-web media. If democracy was the goal, one can imagining weighting Google’s statistical model somewhat differently. Nevertheless, even if it is possible to debate whether PageRank is democratic, there is no debating that it is statistical. It is based on counting information about pages (links in and out) and interpreting the resulting numbers. We now take for granted the ability of a computer program to find useful documents on a subject that interests us — though it would have seemed an amazing feat of AI not that many years ago.

The success of another high-profile web business is also partially tied to a statistical AI technique. In this case, Amazon’s recommendation system. Amazon doesn’t decide which books and movies to recommend to its customers by deeply analyzing the content of each item it sells and “noticing” similarities between them. Nor does it simply follow rules that dictate recommending, for example, other books by the same author. Instead, it looks at the numbers: the numbers of people who bought an item and also bought another item, or who looked at a product page and also looked at another product page (Linden, Smith, and York, 2003). It is these types of statistical correlations that let Amazon know to (as of this writing) recommend the first season of the television show Veronica Mars to those who visit the page for the show Wonderfalls — even though there is no overlap in the shows’ creators or stars, one is set in upstate New York and the other in Southern California, and the main character of one is a college-educated woman who works in a gift shop while the other is a mystery-solving high school student. Companies like Amazon build such statistical models not only in an attempt to improve sales of these items, but also because this type of information about products and their purchasers is a saleable commodity in itself.

Looking beyond applications such as Amazon’s, statistical AI techniques can also engage with human expressions, not just recognize what correlates with them. Statistical techniques have improved computer speech recognition to the point where automatic dictation (and automated phone systems, with their frustrations) are a practical reality. And statistical techniques don’t just help computers recognize what words people speak, they can also analyze patterns of words. The most successful email spam filters, for example, are built using statistical AI techniques.

The rise of statistical AI presents a significant challenge to those who believe that AI systems should be structured based on theories of human intelligence. The successes of statistical techniques grow each day — and yet no one believes that I choose which media to recommend to friends using the types of counting involved in Amazon’s recommendation system. At the same time, statistical techniques have severe limitations, which are not always acknowledged.

Limits of statistical AI

In the U.S., our most substantial public conversation about the limits of statistical artificial intelligence arose in response to the Total Information Awareness (TIA) initiative. Though often discussed as a response to the terrorist attacks of September 11th, 2001, the TIA effort was first proposed by the Defense Advanced Research Projects Agency two years before — and it was the continuation of a vision with a significantly longer history. But it was not until after the 2001 attacks that TIA gained serious momentum of its own, after acquiring major funding and a high-profile director: John Poindexter, best known for less-than-truthful testimony to Congress about the Iran-Contra scandal that plagued president Reagan’s administration.

The TIA vision is an example of dystopian magical thinking. It is dystopian in that, rather than identifying potential criminals and placing them under surveillance, it imagines putting every citizen under surveillance. It is magical thinking in that it imagines statistical AI techniques are capable of sifting through the vast flood of information this would produce in order to identify good leads for anti-terrorism investigators. This demonstrates a fundamental misunderstanding of how these techniques operate. One of the world’s most respected computing organizations — the Association for Computing Machinery — put it this way in a letter to the Senate Armed Services Committee:

Any type of statistical analysis inevitably results in some number of false positives — in this case incorrectly labeling someone as a potential terrorist. As the entire population would be subjected to TIA surveillance, even a small percentage of false positives would result in a large number of law-abiding Americans being mistakenly labeled.

For example, suppose the system has an 99.9% accuracy rate. We believe that having only 0.1% of records being misclassified as belonging to potential terrorists would be an unachievable goal in practice. However, if records for everyone in the U.S. were processed monthly, even this unlikely low rate of false positives could result in as many as 3 million citizens being wrongly identified each year. More realistic assumptions about the percentage of false positives would drive the number even higher. (Simons and Spafford, 2003)

False positives are fine in Google link lists and Amazon purchase recommendations, but in other areas their cost is much too high. In the area of surveillance, not only would they improperly cast suspicion on many people — the number of false leads would also be so high as to overwhelm any group tasked with investigating them. Further, the problem of false positives is only one of the serious limitations of statistical techniques. Another is the need for large amounts of well-defined data in order to develop the system’s model. The amount of information available on terrorist activity is very small by comparison with that used to build effective statistical systems (e.g., Google’s collection of web pages containing links) — and the patterns sought (e.g., determining whether a conversation is somehow “suspicious”) are extremely ill-defined compared with statistical AI’s success stories (e.g., determining where a web page links). Security expert Bruce Schneier’s estimate is that an “unrealistically accurate” TIA system would generate 1 billion false alarms for every real plot identified (2006). Further, there is the problem of false negatives. Even with all these billions of alarms, the system would fail to go off at times it should.

In 2003, after public outcry, Poindexter resigned, and it seemed that TIA had been dealt a death blow when its funding was removed. Then, in 2005, The New York Times reported that the Bush administration had authorized a far-reaching NSA surveillance program that included warrantless electronic eavesdropping on telephone calls and e-mails of U.S. citizens, and a series of further revelations followed (Williams, 2006). If not TIA in name, this was clearly the same dystopian vision at work. Further, other manifestations of magical thinking continue to generate proposals for utterly inappropriate attempts to employ statistical techniques. In the case of surveillance, even if ineffective for their stated purpose, these efforts still produce databases and monitoring systems, each vulnerable to compromise and improper use.

More generally, to understand what statistical techniques can accomplish, we need to move beyond general principles. Though understanding the basic fact of false positives provides some basis for evaluating proposals for statistical techniques (the cost of false positives must be low) it is also important to foster an understanding of how statistical models are developed and what they represent. As I argued in the Introduction, I believe presenting legible examples, especially from the area of media authoring, can help in this effort. And, of course, such examples can also help one consider when statistical techniques might be useful to media authors.

N-grams

A statistical model used in a system for understanding or generating human language is often called a “language model.” As with most statistical models, the primary use of a language model is to predict the future (linguistic) behavior of a system based on a sample of its past (linguistic) behavior. Constructing such a model requires two efforts. The first is deciding what to pay attention to: what “features” of the past behavior will be treated statistically. The second is deciding how to make those features part of an operating statistical model.

One of the most popular approaches for building a language model is the concept of “n-grams” (also known as “Markov chains”). An n-gram language model is built by looking at a body of linguistic data and seeing what elements follow other elements. If looking at the letters in a body of English, q would quite often be followed by u. Looking at the word level, you would more often be followed by are than by banana. Looking at sets of two elements is examining “bigrams” and sets of three are “trigrams.”

As this probably makes clear, in an n-gram model, the features are often the word or two that appear before the current one. A simple model employing these features would be: the different following words are assigned likelihoods based on how often they appear after the preceding ones within the sample data. In actuality, the models tend to be somewhat more complicated. This is in part because, even after training on millions of words, new pairs and triplets of words will appear regularly even in data from the same source (say, the same newspaper from which the model was developed). This requires that the model be “smoothed” so that newly-appearing digrams and trigrams won’t be assigned zero probabilities.

Such models have many potential uses. For example, dictation software may find certain words in a sentence ambiguous. But an n-gram model can use the words that precede and follow ambiguous words to improve the chances of correct recognition. Certainly “ate the banana” is a much more likely trigram than “ate the bandana.”

On the other hand, an n-gram collection doesn’t “understand” anything about the language it models. It would work just the same way on a collection of musical notes, or any other serially-ordered data. And while, as a model, it does well at capturing coherence from word-to-word, it does poorly at capturing any higher-level structure. This becomes quite apparent in the various attempts to use n-grams to generate language, rather than simply understand it. The foundational example, here, is Claude Shannon’s n-gram generated sentence from his field-defining essay, “A Mathematical Theory of Communication” (1948):

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

This passage is more coherent than a random series of words, because each word follows quite naturally from the one preceding. This is no surprise, given that it was produced by flipping through a book, writing down a word, flipping through until that word was found again, writing down the word immediately following it, then flipping through until the just-written word was found, and so on — a word-level bigram. Longer chain lengths, of course, create longer stretches of local coherence. But the structure of sentences, paragraphs, and ideas is completely absent from the n-gram model.

In other words, a statistical model is nothing magical. Like Google’s capturing of links from web pages, statistical models such as n-grams must be focused on well defined features in consistent pools of data. Then, even once millions of patterns are captured, in any activity with significant variety (like human language) the model must be smoothed to account for the fact that previously-unseen patterns will continually appear. Even once this is accomplished, the model will only be appropriate for limited tasks — it will not be a robust artificial intelligence able to predict, identify, generate, and otherwise stand in for human knowledge of what it models.

The Restaurant Game

After moving to MIT, Jeff Orkin, who led the AI development for F.E.A.R., launched a project with an interesting goal — pursuing an idea from traditional AI using the tools of statistical AI. Specifically, his The Restaurant Game seeks to develop what he characterizes as a modern version of the scruffy idea of a “restaurant script” (Orkin, 2007, 22). His method is to statistically analyze the behavior of thousands of players in a freeform restaurant environment and, thereby, learn a much more flexible and detailed script of restaurant behavior than it would be possible to author by hand.

Orkin calls his effort a “game,” a term that resonates in two senses. First, he plans to use the results of his work to develop a game — and all those who interact with the freeform environment will be credited as designers of the eventual game. Second, by presenting the experience as a game, he is placing The Restaurant Game into the same category as a number of successful efforts to use simple game structures to encourage large numbers of people online to collaboratively develop data that is intractable for machines but too laborious for individual humans.

The best known of these data-collection games is The ESP Game, developed by Luis von Ahn and collaborators at Carnegie Mellon University (von Ahn et al, 2005). This game shows the same image, simultaneously, to two audience members who don’t know the other’s identity and cannot directly communicate. These two people each type words, as quickly as they can, that might describe the contents of the image, avoiding any forbidden words listed. As soon as one player types a word that the other player has already typed — creating a match in their descriptions — they score points and the next image appears, until the game time expires.

People find The ESP Game surprisingly fun, with some experiencing an exciting feeling of connection at the moments when they choose the same word as a partner. This has provided the motivation for people to generate quite a bit of data through such games. The data may be used to label online images (to increase accessibility for the visually impaired) or toward more computationally challenging ends (e.g., as well-formed training data for statistical approaches to computer vision). Quite a few other games along these lines exist, and they differ in their particulars, but all of them use simple scenarios to generate very specific data points.

By comparison, The Restaurant Game is immensely free-form and ambitious. Each person who signs in plays a male customer or female waitress. Players can chat freely with each other — not trying to guess the same word, or taking part in any other game-like activity, but simply playing their roles. The environment has many objects with which the players can interact, including food and drink that can be consumed, a register that can be used to pay bills (but can also be picked up and perhaps stolen), menus, furniture, flowers on the tables, a variety of kitchen implements, and so on. There are also two non-player characters: a bartender who provides drinks and a cook who provides food.

Compared against the real world, The Restaurant Game is certainly a microworld — but it is also a vastly richer environment than that provided in previous data collection games. Because it is a microworld, fewer actions are possible and those actions are unambiguous. The system knows exactly when one player passes another a menu, can tell exactly what words one player says to another, and can always tell if all the food was gone before the waitress picked up the plate. On the other hand, the diversity of possible actions and sequences of actions makes the data collected by The Restaurant Game a challenge to interpret statistically.

Orkin’s approach is to focus on the physical actions that take place, looking at the orders and relationships of actions. Actions are clustered based on their supports, an approach that discovers certain similarities (all foods are eaten the same way) as well as preconditions built into the microworld (eating requires the presence of food) and also social expectations (almost all players eat while seated). Orkin’s analysis prunes away the most infrequent actions and sequences, building up what he calls a “plan network” — a structure representing the normal pathways of actions in The Restaurant Game’s microworld. The language from player interactions is arranged, in the model, in terms of the physical actions between which it took place.

The resulting plan networks have the strengths and failures one would expect. Some of these are very specific to the microworld. Orkin demonstrates this with a visualization of the waitress plan network, which captures the set of interface steps required to use the cash register to pay a bill — and also captures the path produced by a common player confusion about the register’s operations.

Other successes and failures are those common to statistical models. These were revealed in Orkin’s evaluation of the system, which used n-gram modeling as “a means of estimating the likelihood of new gameplay sessions, and quantitatively evaluating how well the learned model of restaurant behavior and language correlates with humans’ ” (69). Here we actually see a microworld version of the ideas behind Total Information Awareness — examining data from thousands of games, Orkin worked to build a statistical profile of the behavior, and then attempted to build a system that would automatically detect unusual behavior.

This basic approach builds on some techniques that have proven very powerful in statistical AI, and Orkin reports some successes with it. For example, Orkin reports a play session in which a very typical set of actions took place. However, the system, correctly, rated the session as highly unlikely based on the language — the customer was trying to leave without paying and convince the waitress to come home with him. On the other hand, the system also responded in the same way to any conversation containing unusual vocabulary, even in games that humans rated as typical. As discussed earlier, this is the classic problem of false positives. Further, Orkin provides examples of sequences of physical actions — such as ordering beer and pie until they fill the restaurant, or sitting down, playfully refusing to order, and leaving — that the current version of the plan network is unable to detect as unusual, but are immediately strange to any human evaluator. This is the classic problem of false negatives.

Orkin had originally announced that a game created using the results from The Restaurant Game would be entered into the 2008 Independent Games Festival. But he had to let the deadline pass. In his thesis he writes that this provided a lesson in the “danger of promising the public a deliverable based on novel research by a specific date” (42). More generally, it provides a lesson about statistical AI. As an authoring tool, it requires large amounts of data that may not be available. If the significant effort is expended to collect this data, as with The Restaurant Game, for behavior with wide variation the results will be unpredictable and incomplete. Nevertheless, perhaps Orkin’s next step will find a way to make interacting with this model a fun, playful experience. In the meantime, the most significant contribution of The Restaurant Game is how it demonstrates that even in a microworld, stripped of most of the complexity of our own, statistical models produce many more false positives and false negatives than a TIA-style system could tolerate.