March 3, 2004
Silverman on Evaluating Interactive Fiction/Drama/Games
University of Pennsylvania Professor Barry Silverman has contributed a note on a very important topic for those doing engineering or science work on Grand Text Auto-style systems: how can you evaluate your development effort so that a granting agency (or thesis committee, or journal) can see if you made some measurable improvements?
Not too long ago, I became interested in the topic of how to evaluate an interactive fiction or drama-based game that might also have training value. This came about because I was applying to NIH for a grant to create a game and they wanted any game to prove it had value. My own thesis was that the training value would be enhanced to the extent that it was stealth learning, that the users were engaged/transported, and to the extent they were entertained. We compromised and the sponsors had me place a psychometrician and a narratologist on the team. It is now three years later, we completed the game (see here or here on this site), a clinical trial with 200 subjects is currently underway, and to conduct the trial the team had to produce a number of instruments by which to evaluate the game.
In particular, we developed various new survey instruments to capture user reactions to the game and shifts in their knowhow and behavioral intent. By comparing across various experimental arms (ie, game, movie version, pamphlet, etc.) we hope to capture the impact of the game relative to other mediums. Overall, in addition to a demographics instrument we have developed instruments for metrics in both training dimensions (Knowledge, Stated Intent, Willingness to Pay) as well as in aesthetic dimensions (Narrative Engagement, Game Entertainment, Usefulness, and Usability). We also make use of two previously developed instruments on decisional conflict and need for cognition. The nine draft instruments are provided in a recent report I have posted to my website – select “Evaluation” tab under http://www.seas.upenn.edu/~barryg/heart/index.html …
… It is in draft form until the instruments are fully validated, however, these instruments may be of use to a number of other researchers since, by substituting in keywords, one can readily adapt them for other drama-based games as well.
A few points about the design of the instruments might be in order here. First off, we approached this task as one would a multi-dimensional decision space – a fairly traditional approach in psychometrics. Each instrument contains dimensions and sub-dimensions, items to measure on these sub-dimensions, and questions that are the qualitative elicitation of user scores for the items. As an example, one of our instruments is for Engagement (ENG). One can think of engagement (ENG) as the strength of the bond or how compelling is the experience of the artifact (can the user stop easily): the level to which one is personally involved in and affected by a story. In the literature on engagement, some researchers distinguish between things like immersion and presence or transport (e.g., see Whitton, 2003) — the former referring to what the technology delivers (3-D effects, sound quality, etc.), while the latter refers to the absorption of or impact upon mental state of the user (Green & Brock, 2004). We think this distinction is relevant, but there is another dimension that is equally important in story-worlds since it is not just the immersive technology that impacts presence, but its also the 3Ps of the story (plot, people, place). Books have no immersive technology yet readers often find themselves transported, and compellingly so.
The result of all this is that there are several sub-dimensions that seem worth trying to introduce into the design. Overall, we might state that ENG = T + P1 + P2 + P3, where each of these include:
- TRANSPORT, T, defines the degree of ‘presence’ at the top level. According to Green & Brock (2000) , transport is the phenomenological experience of being absorbed in a story.
- PLOT (P1) – Degree to which one is interested in themes raised and conflicts resolved.
- PEOPLE (P2) –
a) Degree to which characters are “like me” and/or “like people close to me.”
b) Degree to which the characters are interesting to me.
c) Degree to which character development occurs.
- PLACE (P3) –
a) Degree to which one can transfer the story experience to one’s real life.
b) Degree to which actual sensate experience (i.e., sound, visuals, smells,
terrain, sets, etc.) of story is “like real.” This is what others measure as ‘immersion’.
These are the resulting items that questions are then constructed around, and which are now being validated through a number of efforts as the tech report elaborates. Also, the report goes into more detail on the derivation of some of these sub-dimensions for the ENG instrument and on the process for each of the other instruments as well. Later this year when the trial wraps up, our team hopes to be in a position to publish this material in some journal papers – some on the user reactions to the game, some on the validation of the instruments. In the interim, I would love to hear others’ reactions to this or their experiences with attempting to evaluate drama games in formal, measurable ways.
For those who have thoughts on this topic, I will add my invitation to Barry’s — please let us know how you’ve thought to go about quantitatively evaluating your work.
My own monolitic approach in my Masters work at MIT was to just test students’ improvement when they used the interactive conversational character I developed vs. when they used a system that made all the same information available but which lacked the character interface. It’s nice and direct, but there are several problems with a blunt instrument like this one. For instance, who is to say that any improvement that is seen is actually due to the superiority along any dimension of the interface, rather than just the novelty? The development of more detailed instruments of the sort that Barry outlines here (and details in his report) could be an effective way to use quantatative technqiues to highlight what particular aspects of a project are working and which ones need improvement.
March 3rd, 2004 at 4:50 pm
This is a hard question, I’ll be curious to see if anyone has a good answer for it.
As someone who has mostly worked in industry or independently, to me the issue of evaluation has long seemed to be a dilemma for artistically-oriented engineering academic projects, as you point out, both in regards to getting funding, and for understanding the value of the research.
In industry, evaluation is usually measured in sales, of course, or otherwise is informal, with the occasional focus test to gather some data. (I think some, such as Brenda Laurel, think industry should be much more rigorous about doing user-oriented research/evaluations, in order to improve the design of the product.)
It occurs to me to ask, when artists in academia (technology-oriented or not) are applying for art grants, to what extent do they have to offer evaluations of their work? They don’t need to “measure” the value of their work in the user-testing-style terms Barry is describing. Is the value of artwork, vis-a-vis grants, in terms of reception in the art world (i.e., number of shows, awards, etc.)? I assume success in that domain, even for technology-heavy art/creative projects, holds little or no influence for getting money from traditional engineering sources of funding.
Also, I think it’s often the case that artistically-oriented engineering research tends to not have a strong artwork associated with it, at least at first, and so it’s probably tough to get funding from art sources. Not that you’d get enough money from them anyway to support costly engineering research.
This seems to be a problem of a hybrid discipline.