October 18, 2007
More on GTxA the Show
I thought the opening of the Grand Text Auto group show at the Beall Art+Tech gallery went very well, especially when you consider how elaborate the three large installations were. All of the artworks worked, installations and otherwise! And they were physically arranged to fit nicely in a somewhat small space, without feeling overly cramped. Thanks again to all those who put so much time into organizing and setup. (I wasn’t one of them. ;-)
As I hoped would happen, I found it really interesting to experience our various literary and ludic works together in one place. I think they play off each other in both obvious and non-obvious ways. For example, you’ve got [giantJoystick], an overgrown re-creation of the Atari 2600 controller, which thirty years ago was probably the first exposure to digital media for all of us, side-by-side with a proto-Holodeck, AR Façade, as futuristic a working prototype of digital fiction I’ve ever played, a fully-working version of such perhaps we’ll see when we’re all old and gray. You’ve got Screen, a story constructed as a massive black rectangle with flying white text floating in the middle of the gallery, next to Implementation, a story constructed as a massive collection of small white rectangles and black text, mounted on the gallery walls, the walls outside the gallery, and the wide world beyond.
I also really like how the blog literally has a presence in the gallery, with gallery visitors’ guest entries showing up here on the blog, as well as being printed out and hung on the gallery walls.
The symposium was good too, although I felt just the start of what could be a bigger and better one. We each presented our current lines of work and concerns (the text of mine is at the end of this post), followed by some freewheeling discussion (blogged by Mark at WRT), responding to thoughts and questions from the audience. That was great, though in addition, in a future gathering, I’d love to have each us directly address and discuss a set of focused topics. (Noah brought a few seed topics for us, but we didn’t really get to them.) Some of us did manage to admit various feelings of guilt about this or that, so it was therapeutic in that way. ;-)
Here are a few comments on Augmented Reality Façade, which I played for the first time at the show. (Don’t miss Scott and his Unknown collaborators’ experience with it.)
I mentioned to Blair MacIntyre and Steven Dow of the GVU Center at Georgia Tech, where the AR was put into AR Façade, how remarkably suited the original Façade was for augmented reality adaptation! The drama takes place in a single location, with only two characters, and is primarily dialog driven, making minimal use of physical action, other than the player walking around. These were all design decisions we had made long ago, for reasons other than AR adaptation (see our initial 2000 paper for details), so it was really interesting to see how well they applied to AR.
For me, both the highlights and problems of AR Façade were in the interface. A highlight: being able to reach my arm out and put it around Grace’s shoulders, to hug her, was really cool. That was a big improvement over clicking on Grace to hug her in the original desktop verison. I thought Blair, Steve and the rest of the AR team did a good job merging the animation of Grace and Trip with the real-life environment; the head tracking worked pretty well, I didn’t feel any vertigo or motion-sickness.
The primary problem I found with AR Façade, it being a lab prototype making its first public showing with understandably rough edges, is in the delay between the player speaking her dialog and the dialog actually being transmitted to Grace and Trip. In AR Façade, a human “wizard” is hidden behind a curtain, and via a mic and surveillance camera is listening to the player speak and watching her perform. The wizard, in real-time, manually enters the text of what the player is saying to the Grace and Trip, as well as actions she takes, e.g., hugging. Much of the time though, the text was entered to the AI 5+ seconds after being spoken — the time it takes for the wizard to listen to it and type it in, plus any required simplification of the phrasing. Recall, the original AI (basically unaltered for AR Façade) can only handle about 8 words at a time, as frequently as every 5 seconds or so; of course, people often verbally speak faster and lengthier than that. (Also, at the time of the opening, the wizards were mostly newbie gallery docents just learning how to be wizards; over time they will surely get better at it.)
So, AR Façade suffers from a third layer of potential communication breakdown, the wizard intermediary, on top of Façade’s original two potential breakdowns that I outlined in a past post on NLU interfaces, namely 1) at times the NPCs may not literally understand your words due to parser/NLU limitations, and 2) when they do understand your words, at the particular moment when you spoke, they may be unable to respond or are only able to respond in a limited way, for a variety of reasons, such as content limitations.
Additionally, as a technology-based artwork situated in a gallery, it is cumbersome that AR Façade requires a human wizard to make it run, which arguably diminishes the AI appeal of the piece, even though the original AI is still in operation. (Voice recognition doesn’t work well enough to replace the wizard, especially for emotionally-rich speech.)
Still, all that said, AR Façade is a remarkable combination of technologies and designs, a real taste of what the Holodeck could be like. Like Façade, it is a research/art experiment, for which many people are willing to forgive the rough edges in order to experience a new form of digital fiction. And it’s free to play, until mid-December, so go check it and the rest of show out!
here is the text of my future directions presentation at the symposium. I had pre-written this as a blog post, in the spirit of the show being borne from the blog. (Moved to its own top level blog post. -ed)
October 20th, 2007 at 10:16 pm
During our initial design discussions for AR Façade, we had a number of discussions about how the design of Façade nicely fit many of the design constraint for AR experiences (as well as allowing us to sidestep many of the technical difficulties that continue to plague AR, such as the need for precise virtual object alignment with the real world). For those not able to make it to the Beall show, there’s decent documentation of the project on the AR Façade website. Our ACE 2006 paper describes the technical design of AR Façade, and includes a discussion of both the helpful and challenging aspects of Façade’s design from the point of view of doing an AR adaptation.
We’re very aware that the wizard interface introduces another layer of noise and delay in the interaction, and that the fact the constraints imposed by Façade’s NLU are not made explicit in the (wizarded) voice interface is particularly problematic. However, it’s interesting to note that if we’d somehow magically had a robust speech recognition solution, this would not have solved the problem. Speech recognizers don’t recognize words one by one (since they’re using language models that contain statistics of word co-occurrences to compute the maximally probable textual representation of a speech signal), so the “display words as they are recognized and show when you hit a boundary” interface wouldn’t work. We talked about creating a speech recognition interface in which, after your speech has been “recognized” (in our case, by the wizard), the text would appear in a buffer at the bottom of the screen, with a truncation indication if it hit the buffer boundary; the player would then explicitly commit the recognized text, the speech equivalent of hitting the enter key in the text interface. But we decided not to implement this (at least for the first interface approach), as this requires giving the player an additional interface element to operate (such as a clicker they hold in their hand, or the introduction of meta speech commands which are not heard by the characters). This certainly begins interfering with the “naturalness” of the speech interface; it becomes something you have to start training on (like typing). All this is to say that speech doesn’t magically make interface issues go away; if anything, it makes the interface design issues more pressing, because a speech interface is supposed to be all about doing away with interface (you know, “natural”, “transparent”).
Our CHI 2007 paper did a detailed comparison study of player’s reactions to traditional desktop Façade, a wizarded speech recognition version of desktop Façade (just like the original, except you talk to it instead of typing to it) and AR Façade. The big result is that increased sense of presence does not necessarily yield increased engagement. In the HCI community, there is a long held belief that the more present you can make someone feel in an experience (and this is tied to notions of naturalness and transparency in the interface), the better (meaning more effective, more engaging) the experience will be. In our study we found this is not that case, and that this was not an effect of immature technology (if the underlying technologies were perfect, you’d still have the effect). It’s no surprise to media artists that mediation has positive value, and that you should explicitly design your mediation instead of wishing it away. But within the discourse of HCI, this has been rarely discussed.
We also have an AAMAS 2007 paper, using the same AI traces and interviews as for the CHI study, on the effectiveness of Facade’s NLP. Our approach was to note places in the retrospective protocols where player’s noted breakdowns in the conversation, and to correlate this with what was happening in the AI. There’s not a simple one-to-one relationship between failures in AI processing and perceived breakdowns in the experience. Sometimes NLP has a failure (like a failure to understand an utterance) and the player notices nothing unusual in the conversation, and sometimes NLP does exactly what it’s supposed to do, and the player is confused and feels like they are experiencing a bug (so this is a conversation design bug, rather than an NLP failure). We wanted to understand this relationship in detail. The paper reports some interesting results.
I also just want to point out that as a member of the AR Façade team and a GVU faculty member during the time the project was done, I was fully involved in the development of AR Façade. However, Steven Dow, a Ph.D. student at Tech, and someone who will be on the job market soon (you should hire him), was responsible for bringing the project to fruition. He served as the unofficial project manager, bringing all the pieces together, making sure that all the technology, experience design and physical design needed to pull off this project melded together into a coherent unity.
October 26th, 2007 at 1:54 pm
I’m glad you enjoyed AR Façade Andrew. And yes, thanks Michael, readers out there should indeed hire me.
The speech interfaces for interactive drama remains an interesting challenge. There are at least three reasons speech recognition is not feasible (at least in Façade): a) recognition is difficult in uncontrolled noisy environments, b) Façade limits the user to 8 words and there is no way to effectively truncate the user during the middle of a speech act, and c) Façade allows unconstrained language where most speech recognition software rely on limited vocabularies. To maintain some illusion of conversation, we must employee wizard methods. The question is how do we structure that interaction? As Michael suggests, we could give the player a hand held device with two buttons, one to enter the text when it finally appears and another to clear the text if the characters have already moved on to a different topic. Our earlier study points out that “natural” interfaces may not be ultimately what the player needs to have proper hooks into the game interaction. So I’m not against adding buttons. Timing is key. In our studies of desktop Façade, we noticed many players typed in words only to erase them later. This happened a lot. The explicit act of “entering” the text does not have an equivalent with the delayed speech interface.
I am actually investigating another approach. What if we tear out the NLP completely and task our wizard with selecting high-level discourse acts? So, instead of typing “Hi Trip, good to see you”, the wizard would select the built in Façade construct called “GREET”. Façade has about 30 such discourse acts with parameters. So it will force the wizard to be an expert with the interface and to know the story well, but it’s possible that this approach could lower the chance of communication breakdowns. It will in most cases be faster than typing, and it will likely eliminate some of the NLU breakdowns (the type 1 breakdowns mentioned above by Andrew). We have implemented this as a separate page in the wizard interface and it’s currently an option available to the docents at the Beall, so we will see. My initial observation is that some time delay still exists and that the player still needs some sort of feedback as to when the system “hears” their statement. I will report on what I find later.
This approach certainly does diminish the AI appeal of the piece to some extent. We tore out the natural language parser, but most of the AI engine remains untouched. In my view, we are still presented with an opportunity to riff on a player’s expressive input. Only now, we are using an intermediary–the wizard–who not only interprets the meaning of the player’s speech act, but can also read into the player’s gestures and emotions. If we did, in fact, enable a wider range of expressive possibilities for the player, how would this change the design of the behaviors for Trip and Grace?