May 16, 2007

New Interactive Drama in the Works (Part 3): NLU Interfaces

by Andrew Stern · , 9:10 pm

In this post I’ll make a case for natural language understanding interfaces in interactive drama and comedy. This is Part 3 of what’s becoming an intermittent developer-diary series about design and technology issues in play as we develop a new commercial interactive drama/comedy project.

The previous Part 2 post from last December asked and briefly answered several questions: how to achieve content richness for non-linear, real-time interactive stories; how to create satisfying agency; and briefly, how to find funding for this kind of work. Most of the discussion in the comments focused on business plans and funding, which impact the design and technology issues, because resources and time in the production schedule are needed to achieve the design and technology goals.

In this post the primary questions I’d like to address are:
What are the pros and cons of having an open-ended natural language interface for an interactive drama/comedy game?
Is natural language the right choice right now?

Related questions left over from the previous post include,
How well did the natural language interface work in Façade?
Can the failures of the natural language interface in
Façade be overcome?

For the sake of this discussion, when I use the term “speak”, I mean it to be equivalent to “type”; for now I’m not going to distinguish between speaking into a microphone, which requires speech recognition, versus requiring the player to just directly type their dialog in a la text chatting, which avoids the need for speech recognition.

As we’ve long argued, such as in our 2003 GDC paper (pdf), a key to making games about people is to give players the means to express themselves more deeply — their feelings, attitudes, thoughts and ideas, with some nuance and their own personality. And just as important, to have the game’s NPCs and drama manager truly listen, respond, and alter the events of the game in meaningful ways. Allow the player to say more, and have it matter.

We believe such rich player expression is fundamental to making playing games more meaningful for players. This is in contrast what seems possible with purely physical game interactions such as running, jumping, shooting and manipulating objects in the gameworld. A richer level of expression requires language, and perhaps gesture as well.

But wait, is it true that language and gesture are required to expand the player’s range of expression? Couldn’t developers do more to have purely physical action translate into more meaningful expression, which joysticks and controllers already do a reasonable job supporting?

Perhaps. Having the player choose to go certain places, physically picking up and using objects, and so on, I think could be designed to have more meaning than what we’ve seen in games to date. As examples, the stealth-based sub-genre of action games such as Thief and Splinter Cell, and thought-provoking quests such as Ico and Shadow of the Colossus offer more nuance and meaning from their physical action than what you find in typical games where the player is merely trying to destroy enemies and survive. The Tale-of-Tales folks are taking this approach, I believe.

That said, there are still so many things a player cannot express through physical action alone — the particular way a player feels about someone or something; descriptions of one’s self and others; ideas, thoughts, desires, fears, beliefs, the list goes on and on. Language and gesture are truly necessary tools for meaningful expression — there’s just no getting around that.

But wait, in regards to expanding the range of expression — what about offering players limited but context-dependent choices? Aren’t there only a few things a player would typically want to say at any one time anyway, and there’s no need to offer a broad range of expression at all times?

In fact, as I’m sure you’re all familiar with, the status quo interface for allowing players to non-physically (verbally) express themselves in a game is to have them choose from a context-specific, multiple-choice list of pre-written lines of dialog. Multiple-choice lists leads to subsequent multiple-choice lists, with the overall “conversation” having a branching-tree structure.

The primary strength of the context-dependent menu approach, in my view, is that each choice is understood by the game (since its meaning is hardwired into the conversation’s branching-tree structure) and that each choice will (or should) lead to a believable response. In games in general, it is very important that the player feel understood, and not become frustrated by giving input that is misunderstood or ignored by the game.

But there are many drawbacks with context-dependent menus, that we’ve blogged about before. A major one is that the player is limited to speaking only what the game designers have offered, typically only 3 or 4 choices at any one moment. Yet isn’t that enough? If the choices are changing everytime something happens, aren’t there only 3 or 4 things I, as the player, would care to do at anyone time anyhow?

The short answer is, no. I hear that argument being made from time to time, and it’s shortsighted. When analyzing the range of what players try to say in Façade by studying their stageplays, conducting further design experiments, as well following our own design intuition, it becomes clear that there are a dozen or more different ways players want to be able to react to a given situation in any one moment; add nuanced variations to each of those dozen or more ways, and the number of unique player moves at any one moment required for satisfying play is in the hundreds.

Another problem with dialog menus is that if the pre-written phrases are written in a specific style, they may not feel like the player’s own voice. Or if the phrases are written in a generic style, they may feel like bland and uninteresting things to say.

These issues combine to make dialog menus fall short of giving players the means to meaningfully express themselves. (It should be noted that by limiting the expressiveness of the player in this way, it is far easier for the game developers to implement.)

A second approach to player expression is to eschew the first-person conversational natural language that dialog menus offer, and instead supply the player a set of high-level commands, with parameters, expressed in the second-person, such as “yell angrily”, “ask joe about the banana”, or “flirt with Henry”; see more examples here. If the interface allows the player to freely combine any command with a range of characters, objects and adverbs, then the resulting range of player expression is far greater than the multiple-choice list approach.

Yet a major drawback of commands is, obviously, they are not natural language. The NPCs are speaking to you in their own natural words, yet you’d be speaking or acting back to them with commands, and often somewhat generic ones. You don’t have your own voice, or the means to express yourself in your own individual way with your nuance and style. I argue that a command interface feels less meaningful than if you could speak naturally, in your own words.

But wait — couldn’t the non-naturalness of a command interface be overlooked if the player’s expanded range of expression results in a greater range of interesting and meaningful responses from the NPCs and game? That is, can we sacrifice naturalness for the holy grail — greater agency?

Well, typically, in games with command interfaces, few combinations of the broad range of commands and parameters are actually supported at any one moment of the game. In any one moment, many command-parameter combinations are deflected, ignored or rejected. So although a command interface is formally more expressive than a multiple-choice list of natural dialog, pragmatically, players often only have a small number of meaningful choices at any one moment — and we’re back to the problems of only offering players a few context-specific choices, with the added frustration that you don’t know exactly what those choices are.

ALL of this actually comes back to the biggest technical challenge for creating more meaningful games: content. Once you expand the interface to allow players to express more that a few choices at one time, in service of making the games more meaningful to players, the game will need to have a much richer capability for meaningful responses! This requirement is perhaps even more challenging to implement as a natural language interface itself. I briefly addressed this fundamental issue in answers 1 and 2 in the previous post.

Getting back to command-based interfaces… Games using the command-based approach include text-based interactive fiction, and the LucasArts adventure games, though in both cases, most of the commands are not for expression of the player’s feelings, attitudes, thoughts or ideas, but for physical action, such as pick up, use, examine, walk, etc. (Please point out any game examples with commands that do more non-physical action! Fable?) A new twist on the command interface is Chris Crawford’s Deikto for Storytron, allowing players to construct more complex expressions, akin to a sentence diagram.

These command-based games have the non-natural language problem described above, at a minimum, and each have the deflect/ignore/reject issue as well, to varying degrees (well, how Storytron performs is this regard is not yet known, since interactive stories with built with it haven’t been released yet).

That brings us to a third way — to give the player a totally open-ended natural language interface, in which players can speak (or type) anything they want. In such an interface, players could be given the freedom to speak at any time, as in Façade, or in a turn-taking fashion, as in chatterbots like Eliza and ALICE.

Finally we have an interface where the player is given the means to express themselves in their own words, without limitation! Hallelujah! With such an interface, new genres of games can be created with interaction centered around the comedies and dramas of real life, such as the kinds of content we find in many of the most popular TV shows (Friends, Seinfeld, Desperate Housewives). These new games can be about the subject matter that has mass appeal (versus just fans of the action genre); suddenly games could become appealing to all those people who have had little or no interest in them to date. Think about all the people you know who don’t yet play videogames — perhaps your female friends who love good movies and TV shows, but find action and puzzle games boring or juvenile; perhaps these are your parents or older folks, who like sending email but hesitate to pick up a controller, even a Wiimote; these might be friends, men or women, who need something more sophisticated in their entertainment, but need it to be short and sweet, because they’re busy people.

The problem with this vision, of course, is that an open natural-language interface means little if the game isn’t listening. In the previous post, Black and White developer and now Sims 3 lead AI engineer Richard Evans commented the following:

Natural language creates a huge gap between the player’s expectation and the reality – this inevitably creates disappointment. Use of natural language in Facade was (imo) over-ambition which served to lessen the perceived quality of the other areas where Facade really innovated. In an interactive product, you want to innovate in one or two core areas, and be massively risk-averse in the others.

You write “Façade attempts to allow the player to speak anything they like to the characters”. This is strictly speaking untrue – Facade allows you to *type* anything you like to the characters, but most of it isn’t *spoken* to them, because they mostly don’t understand what you are saying. It allows you, for a brief moment, to *think* you can speak anything you like, but when you realize the (inevitable) limitations of the natural-language comprehension, it leaves you feeling disappointed. Psychologically, you want your player’s “Ooo-Wow” feeling to slowly increase over time, not to be unsustainably high for the first 4 minutes, and then tail off.

(As a side note, Richard is equating the term “speak to” as the same as “be understood”, where as I am considering “speak” or “type” to be the same as “utter”, separate from “be understood”.)

Let’s break these problems down, and talk about ways to improve them. Depending on how well these problems can be addressed will, I believe, determine the viability of an open-ended natural language interface in a commercial product that people are willing to pay money for.

First let me characterize the severity of the language understanding problem in Façade. When studying traces of Façade players, we found that the NPCs:

For a commercial product, a ratio of 33% satisfying, 33% fair, 33% frustrating won’t fly. I’d suggest we need to improve that to at least 60% satisfying, 30% fair, 10% frustrating, for people to pay money for it. My guess is many players could live with 10% frustrating, given the benefits of the interface.

Of the unsatisfying responses when you interact with the NPCs in Façade, let me further break them down into the ways that things can go awry. Note that when the player feels frustrated, there is no easy way for them to know which of these problems is occurring. When things go awry, it could be that:

You might assume that Problem 1, literal misunderstanding, is what is going wrong most of the time when playing Façade, but that’s not true. We haven’t measured it precisely, but when a response is unsatisfying in Façade, a misunderstanding error is responsible approximately a third of the time; therefore, overall, this error occurs for ~20% of all things the player types (33% of 66%).

While a 20% misunderstanding rate is not as high some of you might have thought, it’s still pretty high and needs to be improved.

Yet content, problem 2b above, is an even bigger problem to solve, the root cause of the remaining two-thirds of unsatisfying responses in Façade, equivalent to ~40% of all player inputs. (Issues 2a and 2c above are design choices, that can be as present or absent as the developer wishes, especially if 2b is solved.)

Okay, let me summarize my analysis so far:

Alright, now I’ll give the core of my argument for this post, which also addresses the “is the right interface, right now?” question:

Even modest success in achieving the content and parsing requirements mentioned above would justify use of an open-ended natural language interface for interactive drama/comedy, because the bulk of the players the games would ultimately appeal to — the largely untapped market of TV and movie lovers who dislike current games — would forgive the interface’s shortcomings to get a chance to finally play games that interest them.

Thoughts?

I’ll spend the rest of this post talking about how to improve the parser misunderstanding problem, one of the two problems requiring at least modest solutions. (Regarding the content problem, again I’ll refer you to my brief response to this issue in answers 1 and 2 of the previous post; it warrants much further discussion.)

(Note that the Façade parser, as well as The Party‘s parser, actually translates the player’s natural text input into discourse acts, essentially parameterized commands just like the “yell angrily”, “ask joe about the banana”, or “flirt with Henry” examples that I excoriated earlier. So why not just have players directly enter discourse acts? Well, even if the player’s natural language does ultimately get turned into discourse acts under the hood, the process for the player of entering natural language is a lot more, well, natural. Also, a single natural language input, such as saying “martinis are gross” in Façade, could translate into several discourse acts at once, such as “refer-to drinks”, “criticize Trip” and “agree Grace”, especially if Grace had just said “maybe you’d prefer a mineral water?” It would be cumbersome and unintuitive to manually enter all three of those discourse acts simultaneously with a command-based interface.)

On Façade, surprisingly little time was spent developing the parser itself, also known as the Phase I parsing rules (see pdf). We only spent about 8 person-weeks on that part of the engine. For The Party significant resources will be applied to improving this — hopefully 9 person-months months or more. The technical details of how we plan to improve this are too detailed to go into here, but I anticipate significant improvements.

(I should reiterate that we do not need to solve the Turing Test here. We “only” need to understand what will be typically spoken in the context of the drama/comedy being played. The range of input required for a specific domain, such as a crazy cocktail party, is still very large, but is hopefully only a small fraction of what passing the Turing Test would require.)

In addition technical improvements to the parser, there are design techniques we can implement to lessen these misunderstanding errors.

First, in Façade, we limited the number of words players can utter to about 8 at any one time. This limited the complexity of what players could enter, making the job of the parser easier, but still allowing significant expressivity for the player. Admittedly, this results in an asymmetric interface: the NPCs can speak more than 8 words at a time to you, but you can only speak 8 words back. Nonetheless, I think this was a reasonable tradeoff that players generally accepted, and we plan to stick with it for The Party. (Note that with a speech interface, there would be no way to limit the length of the player’s input.)

A major new feature we could add to the interface for The Party is an optional real-time display of the how the player’s words will be interpreted, viewable before the player hits enter and actually “speaks” the dialog. For example, imagine that the player, a man in this case, has typed “hey, what are you doing later tonight” to Joanne, one of the more attractive characters in the game, but hasn’t hit “enter” yet. Beneath the player’s text appear the parameterized discourse acts “flirt with Joanne; hint at rendezvous”.

This real-time display would tell the player how their words will be interpreted before the player truly speaks them, giving the player a chance to backspace and re-word what they are saying if they wish. As another example: if the player, a woman this time, says something more complicated to an NPC named Fred, like “you remind me of my favorite cousin”, the interpretation display might only read “express positive; refer-to family”, which is not all she meant to say — i.e., the parser couldn’t understand the deeper meaning of that expression. So she might backspace and re-type something more direct, like “i like you, you’re funny”, which would display “praise Fred; ally Fred”.

Arguably, the displaying of the interpretation of the player’s dialog could reduce immersion and the naturalness of the interface. It could break the illusion (to the extent it exists) that the game is understanding more than it actually is. I think focus testing is required here to help us understand if this feature helps more than it hurts. (Some players may prefer it, it would be an optional feature.)

Another major new feature we could add to The Party is rewind. If the player says something, and then gets a reaction they don’t like, they could have the option to rewind the action to just before the last thing they spoke, or 10 seconds in time, whichever is shortest. The ability to undo a misinterpretation by the parser could really help alleviate player’s frustration when it happens. (We’d only offer one level of undo; the player is not given the option to rewind again until speaking something new, or waiting until another 10 seconds has gone by.)

Rewind would also affect gameplay itself, hopefully enhancing it; it would be a mini-version of the replay value in the overall game itself. It would be fun to say something, get a reaction, rewind, say something else, see a different reaction, rewind again, etc. (Overall, both Façade and The Party are designed to be replayed many times, to see how things can turn out differently each time, with The Party offering much more replay value than Façade.)

Finally, although this may seem antithetical to my previous arguments, we could also expose the discourse acts to the player, offering an optional command-based interface. While this would have all the drawbacks of command interfaces I mentioned earlier, it could appeal to hardcore gamers who may prefer the tradeoff of a crystal-clear interface over naturalness. A drawback would be that players probably would rarely enter multiple discourse acts at once, diminishing the reactions they could activate if they were creating multiple simultaneous discourse acts via natural language. But still, it may be a good thing for those players, also worth focus testing.

In sum, I hope I’ve made a reasonable case for why an open-ended natural language interface could be viable for a commercial interactive drama/comedy project. Ultimately it relies on creating a moderately successful parser (e.g. with no greater than a 10% failure rate), and the ability for the NPCs and drama manager to respond richly enough, to do justice to the player’s now greater range of expressivity.