Grand Text Auto » New Interactive Drama in the Works (Part 3): NLU Interfaces

May 16, 2007

New Interactive Drama in the Works (Part 3): NLU Interfaces

by Andrew Stern · , 9:10 pm

In this post I’ll make a case for natural language understanding interfaces in interactive drama and comedy. This is Part 3 of what’s becoming an intermittent developer-diary series about design and technology issues in play as we develop a new commercial interactive drama/comedy project.

The previous Part 2 post from last December asked and briefly answered several questions: how to achieve content richness for non-linear, real-time interactive stories; how to create satisfying agency; and briefly, how to find funding for this kind of work. Most of the discussion in the comments focused on business plans and funding, which impact the design and technology issues, because resources and time in the production schedule are needed to achieve the design and technology goals.

In this post the primary questions I’d like to address are:
What are the pros and cons of having an open-ended natural language interface for an interactive drama/comedy game?
Is natural language the right choice right now?

Related questions left over from the previous post include,
How well did the natural language interface work in Façade?
Can the failures of the natural language interface in Façade be overcome?

For the sake of this discussion, when I use the term “speak”, I mean it to be equivalent to “type”; for now I’m not going to distinguish between speaking into a microphone, which requires speech recognition, versus requiring the player to just directly type their dialog in a la text chatting, which avoids the need for speech recognition.

As we’ve long argued, such as in our 2003 GDC paper (pdf), a key to making games about people is to give players the means to express themselves more deeply — their feelings, attitudes, thoughts and ideas, with some nuance and their own personality. And just as important, to have the game’s NPCs and drama manager truly listen, respond, and alter the events of the game in meaningful ways. Allow the player to say more, and have it matter.

We believe such rich player expression is fundamental to making playing games more meaningful for players. This is in contrast what seems possible with purely physical game interactions such as running, jumping, shooting and manipulating objects in the gameworld. A richer level of expression requires language, and perhaps gesture as well.

But wait, is it true that language and gesture are required to expand the player’s range of expression? Couldn’t developers do more to have purely physical action translate into more meaningful expression, which joysticks and controllers already do a reasonable job supporting?

Perhaps. Having the player choose to go certain places, physically picking up and using objects, and so on, I think could be designed to have more meaning than what we’ve seen in games to date. As examples, the stealth-based sub-genre of action games such as Thief and Splinter Cell, and thought-provoking quests such as Ico and Shadow of the Colossus offer more nuance and meaning from their physical action than what you find in typical games where the player is merely trying to destroy enemies and survive. The Tale-of-Tales folks are taking this approach, I believe.

That said, there are still so many things a player cannot express through physical action alone — the particular way a player feels about someone or something; descriptions of one’s self and others; ideas, thoughts, desires, fears, beliefs, the list goes on and on. Language and gesture are truly necessary tools for meaningful expression — there’s just no getting around that.

But wait, in regards to expanding the range of expression — what about offering players limited but context-dependent choices? Aren’t there only a few things a player would typically want to say at any one time anyway, and there’s no need to offer a broad range of expression at all times?

In fact, as I’m sure you’re all familiar with, the status quo interface for allowing players to non-physically (verbally) express themselves in a game is to have them choose from a context-specific, multiple-choice list of pre-written lines of dialog. Multiple-choice lists leads to subsequent multiple-choice lists, with the overall “conversation” having a branching-tree structure.

The primary strength of the context-dependent menu approach, in my view, is that each choice is understood by the game (since its meaning is hardwired into the conversation’s branching-tree structure) and that each choice will (or should) lead to a believable response. In games in general, it is very important that the player feel understood, and not become frustrated by giving input that is misunderstood or ignored by the game.

But there are many drawbacks with context-dependent menus, that we’ve blogged about before. A major one is that the player is limited to speaking only what the game designers have offered, typically only 3 or 4 choices at any one moment. Yet isn’t that enough? If the choices are changing everytime something happens, aren’t there only 3 or 4 things I, as the player, would care to do at anyone time anyhow?

The short answer is, no. I hear that argument being made from time to time, and it’s shortsighted. When analyzing the range of what players try to say in Façade by studying their stageplays, conducting further design experiments, as well following our own design intuition, it becomes clear that there are a dozen or more different ways players want to be able to react to a given situation in any one moment; add nuanced variations to each of those dozen or more ways, and the number of unique player moves at any one moment required for satisfying play is in the hundreds.

Another problem with dialog menus is that if the pre-written phrases are written in a specific style, they may not feel like the player’s own voice. Or if the phrases are written in a generic style, they may feel like bland and uninteresting things to say.

These issues combine to make dialog menus fall short of giving players the means to meaningfully express themselves. (It should be noted that by limiting the expressiveness of the player in this way, it is far easier for the game developers to implement.)

A second approach to player expression is to eschew the first-person conversational natural language that dialog menus offer, and instead supply the player a set of high-level commands, with parameters, expressed in the second-person, such as “yell angrily”, “ask joe about the banana”, or “flirt with Henry”; see more examples here. If the interface allows the player to freely combine any command with a range of characters, objects and adverbs, then the resulting range of player expression is far greater than the multiple-choice list approach.

Yet a major drawback of commands is, obviously, they are not natural language. The NPCs are speaking to you in their own natural words, yet you’d be speaking or acting back to them with commands, and often somewhat generic ones. You don’t have your own voice, or the means to express yourself in your own individual way with your nuance and style. I argue that a command interface feels less meaningful than if you could speak naturally, in your own words.

But wait — couldn’t the non-naturalness of a command interface be overlooked if the player’s expanded range of expression results in a greater range of interesting and meaningful responses from the NPCs and game? That is, can we sacrifice naturalness for the holy grail — greater agency?

Well, typically, in games with command interfaces, few combinations of the broad range of commands and parameters are actually supported at any one moment of the game. In any one moment, many command-parameter combinations are deflected, ignored or rejected. So although a command interface is formally more expressive than a multiple-choice list of natural dialog, pragmatically, players often only have a small number of meaningful choices at any one moment — and we’re back to the problems of only offering players a few context-specific choices, with the added frustration that you don’t know exactly what those choices are.

ALL of this actually comes back to the biggest technical challenge for creating more meaningful games: content. Once you expand the interface to allow players to express more that a few choices at one time, in service of making the games more meaningful to players, the game will need to have a much richer capability for meaningful responses! This requirement is perhaps even more challenging to implement as a natural language interface itself. I briefly addressed this fundamental issue in answers 1 and 2 in the previous post.

Getting back to command-based interfaces… Games using the command-based approach include text-based interactive fiction, and the LucasArts adventure games, though in both cases, most of the commands are not for expression of the player’s feelings, attitudes, thoughts or ideas, but for physical action, such as pick up, use, examine, walk, etc. (Please point out any game examples with commands that do more non-physical action! Fable?) A new twist on the command interface is Chris Crawford’s Deikto for Storytron, allowing players to construct more complex expressions, akin to a sentence diagram.

These command-based games have the non-natural language problem described above, at a minimum, and each have the deflect/ignore/reject issue as well, to varying degrees (well, how Storytron performs is this regard is not yet known, since interactive stories with built with it haven’t been released yet).

That brings us to a third way — to give the player a totally open-ended natural language interface, in which players can speak (or type) anything they want. In such an interface, players could be given the freedom to speak at any time, as in Façade, or in a turn-taking fashion, as in chatterbots like Eliza and ALICE.

Finally we have an interface where the player is given the means to express themselves in their own words, without limitation! Hallelujah! With such an interface, new genres of games can be created with interaction centered around the comedies and dramas of real life, such as the kinds of content we find in many of the most popular TV shows (Friends, Seinfeld, Desperate Housewives). These new games can be about the subject matter that has mass appeal (versus just fans of the action genre); suddenly games could become appealing to all those people who have had little or no interest in them to date. Think about all the people you know who don’t yet play videogames — perhaps your female friends who love good movies and TV shows, but find action and puzzle games boring or juvenile; perhaps these are your parents or older folks, who like sending email but hesitate to pick up a controller, even a Wiimote; these might be friends, men or women, who need something more sophisticated in their entertainment, but need it to be short and sweet, because they’re busy people.

The problem with this vision, of course, is that an open natural-language interface means little if the game isn’t listening. In the previous post, Black and White developer and now Sims 3 lead AI engineer Richard Evans commented the following:

Natural language creates a huge gap between the player’s expectation and the reality – this inevitably creates disappointment. Use of natural language in Facade was (imo) over-ambition which served to lessen the perceived quality of the other areas where Facade really innovated. In an interactive product, you want to innovate in one or two core areas, and be massively risk-averse in the others.
…
You write “Façade attempts to allow the player to speak anything they like to the characters”. This is strictly speaking untrue – Facade allows you to *type* anything you like to the characters, but most of it isn’t *spoken* to them, because they mostly don’t understand what you are saying. It allows you, for a brief moment, to *think* you can speak anything you like, but when you realize the (inevitable) limitations of the natural-language comprehension, it leaves you feeling disappointed. Psychologically, you want your player’s “Ooo-Wow” feeling to slowly increase over time, not to be unsustainably high for the first 4 minutes, and then tail off.

(As a side note, Richard is equating the term “speak to” as the same as “be understood”, where as I am considering “speak” or “type” to be the same as “utter”, separate from “be understood”.)

Let’s break these problems down, and talk about ways to improve them. Depending on how well these problems can be addressed will, I believe, determine the viability of an open-ended natural language interface in a commercial product that people are willing to pay money for.

First let me characterize the severity of the language understanding problem in Façade. When studying traces of Façade players, we found that the NPCs:

fully understand the player and respond well approximately a third of the time (satisfying);
partially understand and at least adequately respond about a third of the time (fair);
and don’t understand and may respond poorly about a third of the time (frustrating).

For a commercial product, a ratio of 33% satisfying, 33% fair, 33% frustrating won’t fly. I’d suggest we need to improve that to at least 60% satisfying, 30% fair, 10% frustrating, for people to pay money for it. My guess is many players could live with 10% frustrating, given the benefits of the interface.

Of the unsatisfying responses when you interact with the NPCs in Façade, let me further break them down into the ways that things can go awry. Note that when the player feels frustrated, there is no easy way for them to know which of these problems is occurring. When things go awry, it could be that:

You might assume that Problem 1, literal misunderstanding, is what is going wrong most of the time when playing Façade, but that’s not true. We haven’t measured it precisely, but when a response is unsatisfying in Façade, a misunderstanding error is responsible approximately a third of the time; therefore, overall, this error occurs for ~20% of all things the player types (33% of 66%).

While a 20% misunderstanding rate is not as high some of you might have thought, it’s still pretty high and needs to be improved.

Yet content, problem 2b above, is an even bigger problem to solve, the root cause of the remaining two-thirds of unsatisfying responses in Façade, equivalent to ~40% of all player inputs. (Issues 2a and 2c above are design choices, that can be as present or absent as the developer wishes, especially if 2b is solved.)

Okay, let me summarize my analysis so far:

Physical action alone, while a clear interface, doesn’t significantly expand player expression
Similarly, dialog menus, while a clear interface, doesn’t significantly expand player expression
A command-based interface, while it expands the range of player expression, is too generic and unnatural to feel satisfying
An open-ended natural language interface would expands the range of player expression in a natural way; but it requires both the game to have a very rich range of response, and a parser capable of understanding the player’s natural language input.

Alright, now I’ll give the core of my argument for this post, which also addresses the “is the right interface, right now?” question:

Even modest success in achieving the content and parsing requirements mentioned above would justify use of an open-ended natural language interface for interactive drama/comedy, because the bulk of the players the games would ultimately appeal to — the largely untapped market of TV and movie lovers who dislike current games — would forgive the interface’s shortcomings to get a chance to finally play games that interest them.

Thoughts?

I’ll spend the rest of this post talking about how to improve the parser misunderstanding problem, one of the two problems requiring at least modest solutions. (Regarding the content problem, again I’ll refer you to my brief response to this issue in answers 1 and 2 of the previous post; it warrants much further discussion.)

(Note that the Façade parser, as well as The Party‘s parser, actually translates the player’s natural text input into discourse acts, essentially parameterized commands just like the “yell angrily”, “ask joe about the banana”, or “flirt with Henry” examples that I excoriated earlier. So why not just have players directly enter discourse acts? Well, even if the player’s natural language does ultimately get turned into discourse acts under the hood, the process for the player of entering natural language is a lot more, well, natural. Also, a single natural language input, such as saying “martinis are gross” in Façade, could translate into several discourse acts at once, such as “refer-to drinks”, “criticize Trip” and “agree Grace”, especially if Grace had just said “maybe you’d prefer a mineral water?” It would be cumbersome and unintuitive to manually enter all three of those discourse acts simultaneously with a command-based interface.)

On Façade, surprisingly little time was spent developing the parser itself, also known as the Phase I parsing rules (see pdf). We only spent about 8 person-weeks on that part of the engine. For The Party significant resources will be applied to improving this — hopefully 9 person-months months or more. The technical details of how we plan to improve this are too detailed to go into here, but I anticipate significant improvements.

(I should reiterate that we do not need to solve the Turing Test here. We “only” need to understand what will be typically spoken in the context of the drama/comedy being played. The range of input required for a specific domain, such as a crazy cocktail party, is still very large, but is hopefully only a small fraction of what passing the Turing Test would require.)

In addition technical improvements to the parser, there are design techniques we can implement to lessen these misunderstanding errors.

First, in Façade, we limited the number of words players can utter to about 8 at any one time. This limited the complexity of what players could enter, making the job of the parser easier, but still allowing significant expressivity for the player. Admittedly, this results in an asymmetric interface: the NPCs can speak more than 8 words at a time to you, but you can only speak 8 words back. Nonetheless, I think this was a reasonable tradeoff that players generally accepted, and we plan to stick with it for The Party. (Note that with a speech interface, there would be no way to limit the length of the player’s input.)

A major new feature we could add to the interface for The Party is an optional real-time display of the how the player’s words will be interpreted, viewable before the player hits enter and actually “speaks” the dialog. For example, imagine that the player, a man in this case, has typed “hey, what are you doing later tonight” to Joanne, one of the more attractive characters in the game, but hasn’t hit “enter” yet. Beneath the player’s text appear the parameterized discourse acts “flirt with Joanne; hint at rendezvous”.

This real-time display would tell the player how their words will be interpreted before the player truly speaks them, giving the player a chance to backspace and re-word what they are saying if they wish. As another example: if the player, a woman this time, says something more complicated to an NPC named Fred, like “you remind me of my favorite cousin”, the interpretation display might only read “express positive; refer-to family”, which is not all she meant to say — i.e., the parser couldn’t understand the deeper meaning of that expression. So she might backspace and re-type something more direct, like “i like you, you’re funny”, which would display “praise Fred; ally Fred”.

Arguably, the displaying of the interpretation of the player’s dialog could reduce immersion and the naturalness of the interface. It could break the illusion (to the extent it exists) that the game is understanding more than it actually is. I think focus testing is required here to help us understand if this feature helps more than it hurts. (Some players may prefer it, it would be an optional feature.)

Another major new feature we could add to The Party is rewind. If the player says something, and then gets a reaction they don’t like, they could have the option to rewind the action to just before the last thing they spoke, or 10 seconds in time, whichever is shortest. The ability to undo a misinterpretation by the parser could really help alleviate player’s frustration when it happens. (We’d only offer one level of undo; the player is not given the option to rewind again until speaking something new, or waiting until another 10 seconds has gone by.)

Rewind would also affect gameplay itself, hopefully enhancing it; it would be a mini-version of the replay value in the overall game itself. It would be fun to say something, get a reaction, rewind, say something else, see a different reaction, rewind again, etc. (Overall, both Façade and The Party are designed to be replayed many times, to see how things can turn out differently each time, with The Party offering much more replay value than Façade.)

Finally, although this may seem antithetical to my previous arguments, we could also expose the discourse acts to the player, offering an optional command-based interface. While this would have all the drawbacks of command interfaces I mentioned earlier, it could appeal to hardcore gamers who may prefer the tradeoff of a crystal-clear interface over naturalness. A drawback would be that players probably would rarely enter multiple discourse acts at once, diminishing the reactions they could activate if they were creating multiple simultaneous discourse acts via natural language. But still, it may be a good thing for those players, also worth focus testing.

In sum, I hope I’ve made a reasonable case for why an open-ended natural language interface could be viable for a commercial interactive drama/comedy project. Ultimately it relies on creating a moderately successful parser (e.g. with no greater than a 10% failure rate), and the ability for the NPCs and drama manager to respond richly enough, to do justice to the player’s now greater range of expressivity.

18 Responses to “New Interactive Drama in the Works (Part 3): NLU Interfaces”

andrew Says:
May 17th, 2007 at 1:37 am
I should point to Ian Bogost’s comment in that past NLU and games thread, where he challenges our assertion that natural language is the required interface for giving the player a more meaningfully expressive interface. Several of the other comments in that thread are worth re-reading too.
josh g. Says:
May 17th, 2007 at 11:29 am
…the bulk of the players the games would ultimately appeal to — the largely untapped market of TV and movie lovers who dislike current games — would forgive the interface’s shortcomings to get a chance to finally play games that interest them.

I would have doubts about this point without some solid play-testing to back it up. Why is this new audience going to be more forgiving of technical problems than an audience which is already used to wrestling with the input systems of various digital games and systems? It may draw in a new audience, certainly, but it seems risky to assume that the appeal of a new genre is going to be strong enough to overcome usability issues. When your targeted audience are those who currently enjoy passive drama, couldn’t the usability and accessibility of the game be an even higher concern? Your games are turning the audience into the actor, and failure of the game’s comprehension may feel like a failure of the player in this new and unfamiliar role.

Has your previous play-testing actually shown otherwise? I’d gladly be proven wrong, I’m just concerned that you’re working with dangerous assumptions.

Of course, it may be that reducing the ‘misunderstanding’ rate down to 10% would be enough for anyone, gamer or non-gamer, to forgive and move on easily. Human-to-human conversation is robust in it’s own way, although it’s probably more forgiving of signal errors (“Sorry, what was that? I didn’t hear you.”) than semantic shortcomings. (Nobody asks, “What did that word mean?” outside of a very safe social context in my experience!) Perhaps the best way to avoid frustration is to find ways to steer conversations away from repeated semantic misunderstandings, to allow for the illusion that any confusion was a signal error? Maybe that illusion isn’t going to hold up in a typed input interface anyway, I don’t know. Slightly random idea which may have to wait for voice input.
josh g. Says:
May 17th, 2007 at 11:33 am
I guess I should clarify what I’m asking; have you found that fans of drama in general were less bothered by the interface issues than testers who were less interested in drama than in experiencing Facade as a new type of game? Or is that miles away from the point you were asserting?
Christopher Romero Says:
May 17th, 2007 at 12:56 pm
There’s an older game that present an interesting variation on interface for relating to NPCs called Midnight Stranger – http://www.strangermedia.com/midnightstranger.htm. In this ‘Mature’ game you are presented with a mood chooser, basically a color bar that allows you to select responses based on whether you want to respond angrily (red), cooly (blue), or somewhere in between. Users never actually selected text, responses were delivered with the emotional character selected. As choices range from within a continuum (the colored bar) it was hard to tell at any one time how may choices were actually available, thus deepening the gameplay experience. It’s a clever example of how to make emotionally meaningful decisions available to a user by manipulating the UI. Would be very interesting to combine this with multiple choice or chatbot tech.
Borut Pfeifer Says:
May 18th, 2007 at 1:31 am
So I buy that NLU is the *most* meaningful interface for interactive storytelling, but there is still a lot to be explored elsewhere too, I’d say. Like, isn’t gesture & body language more an extension of physical action? Perhaps not from a processing/recognition standpoint, but from a controller interface standpoint. Although perhaps that’s not really useful exploration if we’re going to go towards NLU anyway.

Secondly, were there specific design patterns you guys formed around problems 2a and 2c? When the NPC understands you, but can’t or doesn’t want to respond (and not for lacking content) – in terms of conveying that to the player, I mean. Specifically, it always seemed to me that giving NPCs a clear agenda to get information out of the player & communicating that agenda simplifies/reduces 2c. For example, let’s say the NPC is an interrogator and you’re a prisoner. Any attempt you make to derail the conversation and the interrogator will punish you to bring you back in line with their questioning. Or, let’s say it’s an undesirable guest at a party that’s hitting on you; they’re going to keep bringing the conversation back to that. Are there other types of solutions to those problems you’ve run into?

The dialog/interpretation feedback to the player as they type sounds really interesting though – definitely something worth experimenting with. I’d imagine the effect on immersion would be less than you think (the fact that the player was getting feedback at this skill of communicating with NPCs would have a much larger, opposite effect).
Gilbert Bernstein Says:
May 20th, 2007 at 2:44 am
As this stands, it seems to be a good argument against the status quo, and it seems to make NLU a worthwhile approach to try. However, it doesn’t really sound like a particularly strong argument for NLU over potential alternatives. In fact comments that some degree of restriction in input is necessary or desirable to achieve transparency of input seem to make a case against an NLU approach. Other than simultaneous input and combinatoric restriction of actions, it doesn’t seem clear that stripping away the Phase 1 parsing wouldn’t be an equally good approach.

Allowing players to express themselves would be a really good thing, but if the player realizes that the NLU interface is really just a facade for the discourse acts, (even if they don’t understand that technically) then the natural language processing might just become an obstacle they have to wrestle to issue discourse acts. The interface might reduce to a glorified version of an IF interpreter. If, say, typing “I think you’re funny” makes someone like you in pretty much any situation, you accrue a lot of disbelief in the computer’s ability to understand the player. If you instead expose a “compliment” button or action to the player explicitly, the repitition is built into the interaction, and so there’s no disbelief that needs suspending. Similarly enumerating options helps prevent the disillusionment of not being able to do whatever you want.

It seems to me that NLU interfaces (and interfaces in general) should be justified from the machine side as well as the player side. If the machine’s just going to remove these input nuances, (confusing and disappointing us in the process) is it really worth adding them in the first place? Maybe we should be looking for ways to make these nuances useful to the machine?
breslin Says:
May 20th, 2007 at 8:47 pm
andrew writes:

> Richard is equating the term “speak to” as the same as “be understood”, where as I am
> considering “speak” or “type” to be the same as “utter”, separate from “be understood”.

No, I don’t think so. Richard is saying that if “speak” means “utter” as entirely apart from “be understood,” then “Façade attempts to allow the player to speak anything they like to the characters” is a pretty thin claim.

Gilbert Bernstein writes:

> Other than simultaneous input and combinatoric restriction of actions, it doesn’t seem
> clear that stripping away the Phase 1 parsing wouldn’t be an equally good approach.

That’s a big “other than.” It’s like you’re saying “other than the benefits of NLU, NLU doesn’t seem to have any benefits.” The most interesting part of andrew’s post for me is the fact that Phase 1 parsing allows multiple discourse acts with a single expression. I’ve never heard that argument in favor of NLU, and it sounds dead on. I’m a proponent of command-line interaction, but andrew’s post turned a light on for me.

> It seems to me that NLU interfaces (and interfaces in general) should be justified from
> the machine side as well as the player side.

There’s a pithy little game programmer’s motto, I learned from the guy who wrote Frogger: “you don’t have to solve the problem, so long as it looks like you did.” That said, I hold that the primary enjoyment of AI in games is learning the AI mechanism and how to mess with it. So for my money, “believable agents” is a fun trick, but the final goal is making a mechanism that’s fun to mess with. (If, on the other hand, “believable agents” were the final goal, then the results-oriented motto holds: it wouldn’t matter how many cheats were involved in achieving believability.)

But I entirely agree with everything else you said. I have much the same point of view.
Ian Bogost Says:
May 21st, 2007 at 2:58 am
I agree that Andrew’s point about NLU allowing multiple discourse acts at once is the most persuasive argument I’ve heard yet on this topic.

This brings me to a question that relates somewhat to the comment of mine from nearly three years ago Andrew linked to above. One of the wonderful things about your position on NLU is that it is an aesthetic one. That is to say, you are making an argument for an interactive experience that uniquely embraces NLU rather than showing it off as a tech demo or using it for instrumental reasons.

In this post, however, I’m starting to see a conflation of aesthetic and market justifications. The question of feedback, such as “previewing” discourse acts before commiting them, is an example of the logical end of such justification. This is understandable as you guys try to figure out how to make this work more commercially viable than Façade. But it also strikes me as a somewhat dangerous concern, in that the NLU could become mired in an HCI-style argument about intuitiveness and transparency rather than aesthetics. I might be reaching, but it’s possible that one of the primary appeals of NLU in your work is in fact its aesthetic immoderation rather than its expressive intuitiveness. Back in 2004 I suggested that NLU in Façade is more about a creative signature than it is about usability. I wonder if emphasizing that position rather than dampening it might be a wiser creative move — and even possibly a wiser commercial move, although it is a riskier one.
Gilbert Bernstein Says:
May 21st, 2007 at 2:57 pm
Oh ok. You caught me breslin. =)

That is one of the best arguments in there for NLU. It might even be worth further emphasizing that it’s not possible to issue any arbitrary combination of discourse acts with an NLU interface. What combinations you can use follow from the logical encoding of the language. It might even be conceivable that translating the game to, say, German would therefore also change the gameplay possibilities.

Ian,

You might also consider the aesthetic justifications for previewing discourse acts. If a player becomes frustrated with their inability to communicate to the machine causing them to lose their suspension of disbelief, then they have aesthetic grounds for complaint. True, having an intentionally imperfect NLU interface seems like a worthwhile aesthetic goal, (ie. mimicking human miscommunications) but excusing current technological shortcomings with an aesthetic argument seems a little disingenuous here.
andrew Says:
May 22nd, 2007 at 6:23 pm
The comments so far touch upon several topics, each too big to address in this single comment thread, so I’m going to break them out into top-level blog posts, with links to each from here, as I write them.

– What do non-gamers want from games?
– Transparency in the behavior of, and interface to, NPCs
– HCI goals for NPCs
– The aesthetics of the interface to NPCs

Here I’ll clear up a few miscellaneous points.

Gilbert, you asked if NLU is just a facade for discourse acts under the hood, why bother with NLU and its drawbacks. Well, think about discourse acts as interpretations of the meaning of the words spoken. Extracting the meaning out of the player’s “raw” dialog will always be necessary in order to reason about them; even we humans do that, subconsciously. Instead we should debate how detailed or nuanced the interpretations need to be. Also, you suggested that the player’s text is always interpreted the same way in any context; that isn’t so. In fact, differing interpretations of the same text (e.g., “I think you’re funny”) in different contexts is one of the primary features of our approach to NLU (more on that in the transparency thread).

Breslin, I agree that it is a pretty thin claim to allow the player to merely be able to “speak” anything they want, without a guarantee that they are understood. Such a claim implies that the player will be understood; but in this discussion, we are actually addressing how well we are actually doing that.
Grand Text Auto » What Do Non-Gamers Want? Says:
May 22nd, 2007 at 8:50 pm
[…] rn some arbitrary control mechanism either; it should be easy and natural to play. In the discussion of natural language understanding […]
Grand Text Auto » Transparency in the Behavior of and Interface to NPCs Says:
May 23rd, 2007 at 3:30 am
[…] previous one asking “what do non-gamers want?”, is a spinoff from our recent < discussion about natural language interfaces for […]
Rubes Says:
May 23rd, 2007 at 1:13 pm
I find this discussion fascinating, and I’m glad you took the time to put this all into words, Andrew.

I guess I was a little surprised to see the numbers you report, as I’m one of those under the impression that the true misunderstanding error fraction was more than 20%. But like you say, from the player’s perspective it’s difficult to determine if the unexpected response is due to misunderstanding or one of the other explanations.

What I think would be more useful to report, however, is the *kind* of responses that fall into each category. After playing Facade a few times, I found that the most efficient communication occurred with very short (one or two word) responses — those situations requiring a yes or no response, or a simple decision between different options. If the “satisfying” response fraction was 33%, what proportion of those fell under the category of a simple one- or two-word response?

Likewise, of the 20% of “unsatisfying” responses, how many of those were misunderstood because they were longer phrases, or phrases that used more sophisticated “natural language”?

I think that’s a more accurate measure of a NLU system’s ability to interpret. If you’re succeeding mostly as simple responses that’s one thing, but if you’re really succeeding at some more complicated phrases, then that’s important to note as well. Perhaps a breakdown of % satisfying responses by number of words in the player’s input?

I also think it’s important to note that some of the errors that are probably classified as #2 (NPCs do understand, but are limited in response) may be related to the issues around typing in real-time, and failure to input a response in time for an appropriate reaction. I found on a few occasions that I didn’t type my response fast enough, and by the time I hit return they had already moved on to something else, resulting in a confused response.

This is at least an important consideration if, as you later note, you would like to engage non-gamers. Many of those are not going to be good typists, resulting in typos, the need for backspacing to correct, and just generally slow response time.

What we’re trying to do with Vespers3D is just an extension of the command-line interface to a graphical format. The content issue is a beast, as you know, but I think we’re finding that just having an audible and visual response to an IF-like command adds a pretty nice level of immersion over and above straight text IF. But it really heightens the desire for a true NLU interface.
Richard Evans Says:
May 25th, 2007 at 10:19 pm
Veh interesting discussion…

Natural language: I really like the idea of a real-time display showing how the words will be interpreted before the player speaks them. This avoids the main problem of natural-language input: you don’t know what the response will be, until it is too late.

Dialog menus: I agree that the dialog menu strategy is often used in a very limited way. There aren’t many options, and the NPC always responds in the same way to the same input. But these are not limitations with the dialog menu strategy *in principle*, just a problem with (most) instances of the strategy. Convicting the strategy of limitedness is like saying that socialism is in principle unworkable because it didn’t work in Soviet Russia. (There is, in other words, a difference between *didn’t work* and *couldn’t work*).

You say “A major [drawback with context-dependent menus] is that the player is limited to speaking only what the game designers have offered, typically only 3 or 4 choices at any one moment”.

There are two separate points here – first, that the player’s choice is limited to what the game understands, and, second, that the number of choices is very small. The first is true but unproblematic – we actively *want* the player to only be able to choose things which the system understands! The second is problematic, but isn’t an essential aspect of dialog menu choices. We can have a dialog-menu system with 50 or more interactions at any moment, as long as we group them with parent menus and sub-menus.

You describe the dialog-menu system – “Multiple-choice lists leads to subsequent multiple-choice lists, with the overall “conversation” having a branching-tree structure”. Again, dialog-menu systems *typically* have a branching-tree structure (where the choice of the current action is the sole determinant of the set of next available options), but they don’t have to. We *could* have a multiple-choice dialog system in which the set of choices is determined by, e.g, taking the actions available in the most probable states in a set of concurrent Hierarchical Markov Models (i.e. use state-estimation in POMDPs to drive the set of available choices)…

Your other main argument, that natural language input allows multiple-discourse acts, also applies to only *some* instances of the dialog menu strategy. A dialog menu choice can be tagged as simultaneously satisfying multiple discourse acts.

I think the dialog menu strategy has a lot of unexplored potential…
John Wood Says:
August 9th, 2007 at 4:12 am
Andrew,

The real-time display idea is FANTASTIC. Consider: simply by providing players the opportunity to compose their own thoughts into words is an experience that a command-interface cannot offer. With a real-time display, you retain the precision of a command-interface while allowing players the vital immersive experience of composing their own thoughts into their own words.

That way, whatever qualms the player may have with the system’s interpretive capabilities will be directed towards the *system*, rather than the fellow characters in the drama. To put it bluntly, players will be calling the computer stupid, rather than Grace and Trip. That’s tremendously important — maintaining the illusion that *if only the parser had understood*, surely Grace and Trip would have responded as rich, complex human characters.
andrew Says:
August 9th, 2007 at 4:00 pm
John, yeah, that would be nice! We’ll find out how true that effect is eventually in playtesting; I’ll report the results on the blog. Time-wise, that’s still a ways off.
Grand Text Auto » More on GTxA the Show Says:
October 18th, 2007 at 1:52 pm
[…] the wizard intermediary, on top of Façade’s original two potential breakdowns that I outlined in a past post on NLU interfaces, namely […]
Grand Text Auto » Say It All in Six Words Says:
March 6th, 2008 at 4:43 pm
[…] 9 months ago (time flies!) I posted my thoughts on an improved the natural language understanding interface for interactive comedies/dramas. NLU is […]

andrew Says:
May 17th, 2007 at 1:37 am

I should point to Ian Bogost’s comment in that past NLU and games thread, where he challenges our assertion that natural language is the required interface for giving the player a more meaningfully expressive interface. Several of the other comments in that thread are worth re-reading too.

josh g. Says:
May 17th, 2007 at 11:29 am

…the bulk of the players the games would ultimately appeal to — the largely untapped market of TV and movie lovers who dislike current games — would forgive the interface’s shortcomings to get a chance to finally play games that interest them.

I would have doubts about this point without some solid play-testing to back it up. Why is this new audience going to be more forgiving of technical problems than an audience which is already used to wrestling with the input systems of various digital games and systems? It may draw in a new audience, certainly, but it seems risky to assume that the appeal of a new genre is going to be strong enough to overcome usability issues. When your targeted audience are those who currently enjoy passive drama, couldn’t the usability and accessibility of the game be an even higher concern? Your games are turning the audience into the actor, and failure of the game’s comprehension may feel like a failure of the player in this new and unfamiliar role.

Has your previous play-testing actually shown otherwise? I’d gladly be proven wrong, I’m just concerned that you’re working with dangerous assumptions.

Of course, it may be that reducing the ‘misunderstanding’ rate down to 10% would be enough for anyone, gamer or non-gamer, to forgive and move on easily. Human-to-human conversation is robust in it’s own way, although it’s probably more forgiving of signal errors (“Sorry, what was that? I didn’t hear you.”) than semantic shortcomings. (Nobody asks, “What did that word mean?” outside of a very safe social context in my experience!) Perhaps the best way to avoid frustration is to find ways to steer conversations away from repeated semantic misunderstandings, to allow for the illusion that any confusion was a signal error? Maybe that illusion isn’t going to hold up in a typed input interface anyway, I don’t know. Slightly random idea which may have to wait for voice input.

josh g. Says:
May 17th, 2007 at 11:33 am

I guess I should clarify what I’m asking; have you found that fans of drama in general were less bothered by the interface issues than testers who were less interested in drama than in experiencing Facade as a new type of game? Or is that miles away from the point you were asserting?

Christopher Romero Says:
May 17th, 2007 at 12:56 pm

There’s an older game that present an interesting variation on interface for relating to NPCs called Midnight Stranger – http://www.strangermedia.com/midnightstranger.htm. In this ‘Mature’ game you are presented with a mood chooser, basically a color bar that allows you to select responses based on whether you want to respond angrily (red), cooly (blue), or somewhere in between. Users never actually selected text, responses were delivered with the emotional character selected. As choices range from within a continuum (the colored bar) it was hard to tell at any one time how may choices were actually available, thus deepening the gameplay experience. It’s a clever example of how to make emotionally meaningful decisions available to a user by manipulating the UI. Would be very interesting to combine this with multiple choice or chatbot tech.

Borut Pfeifer Says:
May 18th, 2007 at 1:31 am

So I buy that NLU is the *most* meaningful interface for interactive storytelling, but there is still a lot to be explored elsewhere too, I’d say. Like, isn’t gesture & body language more an extension of physical action? Perhaps not from a processing/recognition standpoint, but from a controller interface standpoint. Although perhaps that’s not really useful exploration if we’re going to go towards NLU anyway.

Secondly, were there specific design patterns you guys formed around problems 2a and 2c? When the NPC understands you, but can’t or doesn’t want to respond (and not for lacking content) – in terms of conveying that to the player, I mean. Specifically, it always seemed to me that giving NPCs a clear agenda to get information out of the player & communicating that agenda simplifies/reduces 2c. For example, let’s say the NPC is an interrogator and you’re a prisoner. Any attempt you make to derail the conversation and the interrogator will punish you to bring you back in line with their questioning. Or, let’s say it’s an undesirable guest at a party that’s hitting on you; they’re going to keep bringing the conversation back to that. Are there other types of solutions to those problems you’ve run into?

The dialog/interpretation feedback to the player as they type sounds really interesting though – definitely something worth experimenting with. I’d imagine the effect on immersion would be less than you think (the fact that the player was getting feedback at this skill of communicating with NPCs would have a much larger, opposite effect).

Gilbert Bernstein Says:
May 20th, 2007 at 2:44 am

As this stands, it seems to be a good argument against the status quo, and it seems to make NLU a worthwhile approach to try. However, it doesn’t really sound like a particularly strong argument for NLU over potential alternatives. In fact comments that some degree of restriction in input is necessary or desirable to achieve transparency of input seem to make a case against an NLU approach. Other than simultaneous input and combinatoric restriction of actions, it doesn’t seem clear that stripping away the Phase 1 parsing wouldn’t be an equally good approach.

Allowing players to express themselves would be a really good thing, but if the player realizes that the NLU interface is really just a facade for the discourse acts, (even if they don’t understand that technically) then the natural language processing might just become an obstacle they have to wrestle to issue discourse acts. The interface might reduce to a glorified version of an IF interpreter. If, say, typing “I think you’re funny” makes someone like you in pretty much any situation, you accrue a lot of disbelief in the computer’s ability to understand the player. If you instead expose a “compliment” button or action to the player explicitly, the repitition is built into the interaction, and so there’s no disbelief that needs suspending. Similarly enumerating options helps prevent the disillusionment of not being able to do whatever you want.

It seems to me that NLU interfaces (and interfaces in general) should be justified from the machine side as well as the player side. If the machine’s just going to remove these input nuances, (confusing and disappointing us in the process) is it really worth adding them in the first place? Maybe we should be looking for ways to make these nuances useful to the machine?

breslin Says:
May 20th, 2007 at 8:47 pm

andrew writes:

> Richard is equating the term “speak to” as the same as “be understood”, where as I am
> considering “speak” or “type” to be the same as “utter”, separate from “be understood”.

No, I don’t think so. Richard is saying that if “speak” means “utter” as entirely apart from “be understood,” then “Façade attempts to allow the player to speak anything they like to the characters” is a pretty thin claim.

Gilbert Bernstein writes:

> Other than simultaneous input and combinatoric restriction of actions, it doesn’t seem
> clear that stripping away the Phase 1 parsing wouldn’t be an equally good approach.

That’s a big “other than.” It’s like you’re saying “other than the benefits of NLU, NLU doesn’t seem to have any benefits.” The most interesting part of andrew’s post for me is the fact that Phase 1 parsing allows multiple discourse acts with a single expression. I’ve never heard that argument in favor of NLU, and it sounds dead on. I’m a proponent of command-line interaction, but andrew’s post turned a light on for me.

> It seems to me that NLU interfaces (and interfaces in general) should be justified from
> the machine side as well as the player side.

There’s a pithy little game programmer’s motto, I learned from the guy who wrote Frogger: “you don’t have to solve the problem, so long as it looks like you did.” That said, I hold that the primary enjoyment of AI in games is learning the AI mechanism and how to mess with it. So for my money, “believable agents” is a fun trick, but the final goal is making a mechanism that’s fun to mess with. (If, on the other hand, “believable agents” were the final goal, then the results-oriented motto holds: it wouldn’t matter how many cheats were involved in achieving believability.)

But I entirely agree with everything else you said. I have much the same point of view.

Ian Bogost Says:
May 21st, 2007 at 2:58 am

I agree that Andrew’s point about NLU allowing multiple discourse acts at once is the most persuasive argument I’ve heard yet on this topic.

This brings me to a question that relates somewhat to the comment of mine from nearly three years ago Andrew linked to above. One of the wonderful things about your position on NLU is that it is an aesthetic one. That is to say, you are making an argument for an interactive experience that uniquely embraces NLU rather than showing it off as a tech demo or using it for instrumental reasons.

In this post, however, I’m starting to see a conflation of aesthetic and market justifications. The question of feedback, such as “previewing” discourse acts before commiting them, is an example of the logical end of such justification. This is understandable as you guys try to figure out how to make this work more commercially viable than Façade. But it also strikes me as a somewhat dangerous concern, in that the NLU could become mired in an HCI-style argument about intuitiveness and transparency rather than aesthetics. I might be reaching, but it’s possible that one of the primary appeals of NLU in your work is in fact its aesthetic immoderation rather than its expressive intuitiveness. Back in 2004 I suggested that NLU in Façade is more about a creative signature than it is about usability. I wonder if emphasizing that position rather than dampening it might be a wiser creative move — and even possibly a wiser commercial move, although it is a riskier one.

Gilbert Bernstein Says:
May 21st, 2007 at 2:57 pm

Oh ok. You caught me breslin. =)

That is one of the best arguments in there for NLU. It might even be worth further emphasizing that it’s not possible to issue any arbitrary combination of discourse acts with an NLU interface. What combinations you can use follow from the logical encoding of the language. It might even be conceivable that translating the game to, say, German would therefore also change the gameplay possibilities.

Ian,

You might also consider the aesthetic justifications for previewing discourse acts. If a player becomes frustrated with their inability to communicate to the machine causing them to lose their suspension of disbelief, then they have aesthetic grounds for complaint. True, having an intentionally imperfect NLU interface seems like a worthwhile aesthetic goal, (ie. mimicking human miscommunications) but excusing current technological shortcomings with an aesthetic argument seems a little disingenuous here.

andrew Says:
May 22nd, 2007 at 6:23 pm

The comments so far touch upon several topics, each too big to address in this single comment thread, so I’m going to break them out into top-level blog posts, with links to each from here, as I write them.

– What do non-gamers want from games?
– Transparency in the behavior of, and interface to, NPCs
– HCI goals for NPCs
– The aesthetics of the interface to NPCs

Here I’ll clear up a few miscellaneous points.

Gilbert, you asked if NLU is just a facade for discourse acts under the hood, why bother with NLU and its drawbacks. Well, think about discourse acts as interpretations of the meaning of the words spoken. Extracting the meaning out of the player’s “raw” dialog will always be necessary in order to reason about them; even we humans do that, subconsciously. Instead we should debate how detailed or nuanced the interpretations need to be. Also, you suggested that the player’s text is always interpreted the same way in any context; that isn’t so. In fact, differing interpretations of the same text (e.g., “I think you’re funny”) in different contexts is one of the primary features of our approach to NLU (more on that in the transparency thread).

Breslin, I agree that it is a pretty thin claim to allow the player to merely be able to “speak” anything they want, without a guarantee that they are understood. Such a claim implies that the player will be understood; but in this discussion, we are actually addressing how well we are actually doing that.

Grand Text Auto » What Do Non-Gamers Want? Says:
May 22nd, 2007 at 8:50 pm

[…] rn some arbitrary control mechanism either; it should be easy and natural to play. In the discussion of natural language understanding […]

Grand Text Auto » Transparency in the Behavior of and Interface to NPCs Says:
May 23rd, 2007 at 3:30 am

[…] previous one asking “what do non-gamers want?”, is a spinoff from our recent < discussion about natural language interfaces for […]

Rubes Says:
May 23rd, 2007 at 1:13 pm

I find this discussion fascinating, and I’m glad you took the time to put this all into words, Andrew.

I guess I was a little surprised to see the numbers you report, as I’m one of those under the impression that the true misunderstanding error fraction was more than 20%. But like you say, from the player’s perspective it’s difficult to determine if the unexpected response is due to misunderstanding or one of the other explanations.

What I think would be more useful to report, however, is the *kind* of responses that fall into each category. After playing Facade a few times, I found that the most efficient communication occurred with very short (one or two word) responses — those situations requiring a yes or no response, or a simple decision between different options. If the “satisfying” response fraction was 33%, what proportion of those fell under the category of a simple one- or two-word response?

Likewise, of the 20% of “unsatisfying” responses, how many of those were misunderstood because they were longer phrases, or phrases that used more sophisticated “natural language”?

I think that’s a more accurate measure of a NLU system’s ability to interpret. If you’re succeeding mostly as simple responses that’s one thing, but if you’re really succeeding at some more complicated phrases, then that’s important to note as well. Perhaps a breakdown of % satisfying responses by number of words in the player’s input?

I also think it’s important to note that some of the errors that are probably classified as #2 (NPCs do understand, but are limited in response) may be related to the issues around typing in real-time, and failure to input a response in time for an appropriate reaction. I found on a few occasions that I didn’t type my response fast enough, and by the time I hit return they had already moved on to something else, resulting in a confused response.

This is at least an important consideration if, as you later note, you would like to engage non-gamers. Many of those are not going to be good typists, resulting in typos, the need for backspacing to correct, and just generally slow response time.

What we’re trying to do with Vespers3D is just an extension of the command-line interface to a graphical format. The content issue is a beast, as you know, but I think we’re finding that just having an audible and visual response to an IF-like command adds a pretty nice level of immersion over and above straight text IF. But it really heightens the desire for a true NLU interface.

Richard Evans Says:
May 25th, 2007 at 10:19 pm

Veh interesting discussion…

Natural language: I really like the idea of a real-time display showing how the words will be interpreted before the player speaks them. This avoids the main problem of natural-language input: you don’t know what the response will be, until it is too late.

Dialog menus: I agree that the dialog menu strategy is often used in a very limited way. There aren’t many options, and the NPC always responds in the same way to the same input. But these are not limitations with the dialog menu strategy *in principle*, just a problem with (most) instances of the strategy. Convicting the strategy of limitedness is like saying that socialism is in principle unworkable because it didn’t work in Soviet Russia. (There is, in other words, a difference between *didn’t work* and *couldn’t work*).

You say “A major [drawback with context-dependent menus] is that the player is limited to speaking only what the game designers have offered, typically only 3 or 4 choices at any one moment”.

There are two separate points here – first, that the player’s choice is limited to what the game understands, and, second, that the number of choices is very small. The first is true but unproblematic – we actively *want* the player to only be able to choose things which the system understands! The second is problematic, but isn’t an essential aspect of dialog menu choices. We can have a dialog-menu system with 50 or more interactions at any moment, as long as we group them with parent menus and sub-menus.

You describe the dialog-menu system – “Multiple-choice lists leads to subsequent multiple-choice lists, with the overall “conversation” having a branching-tree structure”. Again, dialog-menu systems *typically* have a branching-tree structure (where the choice of the current action is the sole determinant of the set of next available options), but they don’t have to. We *could* have a multiple-choice dialog system in which the set of choices is determined by, e.g, taking the actions available in the most probable states in a set of concurrent Hierarchical Markov Models (i.e. use state-estimation in POMDPs to drive the set of available choices)…

Your other main argument, that natural language input allows multiple-discourse acts, also applies to only *some* instances of the dialog menu strategy. A dialog menu choice can be tagged as simultaneously satisfying multiple discourse acts.

I think the dialog menu strategy has a lot of unexplored potential…

John Wood Says:
August 9th, 2007 at 4:12 am

Andrew,

The real-time display idea is FANTASTIC. Consider: simply by providing players the opportunity to compose their own thoughts into words is an experience that a command-interface cannot offer. With a real-time display, you retain the precision of a command-interface while allowing players the vital immersive experience of composing their own thoughts into their own words.

That way, whatever qualms the player may have with the system’s interpretive capabilities will be directed towards the *system*, rather than the fellow characters in the drama. To put it bluntly, players will be calling the computer stupid, rather than Grace and Trip. That’s tremendously important — maintaining the illusion that *if only the parser had understood*, surely Grace and Trip would have responded as rich, complex human characters.

andrew Says:
August 9th, 2007 at 4:00 pm

John, yeah, that would be nice! We’ll find out how true that effect is eventually in playtesting; I’ll report the results on the blog. Time-wise, that’s still a ways off.

Grand Text Auto » More on GTxA the Show Says:
October 18th, 2007 at 1:52 pm

[…] the wizard intermediary, on top of Façade’s original two potential breakdowns that I outlined in a past post on NLU interfaces, namely […]

Grand Text Auto » Say It All in Six Words Says:
March 6th, 2008 at 4:43 pm

[…] 9 months ago (time flies!) I posted my thoughts on an improved the natural language understanding interface for interactive comedies/dramas. NLU is […]