Computing Applications Human-computer etiquette: managing expectations with intentional agents

-nspoken Rules of Spoken Interaction

Body language and familiar silent signals are as much a part of social experience as the conversation. Building systems to recognize and respond to such moves will propel interface technology to the next horizon.

By Timothy W. Bickmore

Posted Apr 1 2004

Introduction
The Rules of Etiquette
Conclusion
References
Author
Sidebar: REA the Polite Real Estate Agent
Figure
Sidebar: Automatic Generation of Nonverbal Behavior in BEAT
Figure
Sidebar: Managing Long-Term Relationships with Laura
Figure

Our face-to-face interactions with other people are governed by a complex set of rules, of which we are mostly unaware. For decades now, social scientists have been unraveling the threads of face-to-face interaction, investigating everything from descriptions of body posture used to indicate interest in starting a conversation, to eye gaze dynamics used to convey liking or disliking, to the myriad ways that language can convey attitude, social status, relationship status, and affective state. Even though we are not always aware of them, these rules underpin how we make sense of and navigate in our social world. These rules may seem uninteresting and irrelevant to many computer scientists, but to the

extent that a given interaction rule is universally followed within a user population, it can be profitably incorporated into a human-machine interface in order to make the interface more natural and intuitive to use. Computers without anthropomorphic faces and bodies can (and already do) make use of a limited range of such rules—such as rules for conversational turn-taking in existing interfaces—but one kind of interface has the potential to make explicit, maximal use of these rules: embodied conversational agents (ECAs).

ECAs are animated humanoid computer characters that emulate face-to-face conversation through the use of hand gestures, facial display, head motion, gaze behavior, body posture, and speech intonation, in addition to speech content [5]. The use of verbal and nonverbal modalities gives ECAs the potential to fully employ the rules of etiquette observed in human face-to-face interaction. ECAs have been developed for research purposes, but there are also a growing number of commercial ECAs, such as those developed by Extempo, Artificial Life, and the Ananova newscaster. These systems vary greatly in their linguistic capabilities, input modalities (most are mouse/text/speech input only), and task domains, but all share the common feature of attempting to engage the user in natural, full-bodied (in some sense) conversation.

Social scientists have long recognized the utility of making a distinction between conversational behaviors (surface form, such as head nodding) and conversational function (the role played by the behavior, such as acknowledgement). This distinction is important if general rules of interaction are to be induced that capture the underlying regularities in conversation, enabling us to build ECA architectures that have manageable complexity, and that have the potential of working across languages and cultures. This distinction is particularly important given that there is usually a many-to-many mapping between functions and behaviors (for example, head nodding can also be used for emphasis and acknowledgment can also be indicated verbally).

Although classical linguistics have traditionally focused on the conveying of propositional information, there are actually many different kinds of conversational function. The following list reviews some of the functions most commonly implemented in ECAs and examines their range of conversational functions and associated behaviors:

Propositional functions of conversational behavior involve representing a thought to be conveyed to a listener. In addition to the role played by speech, hand gestures are used extensively to convey propositional information either redundant with, or complementary to, the information delivered in speech. In ECA systems developed to date, the most common kind of hand gesture implemented is the deictic, or pointing gesture. Steve [10], the DFKI Persona [1], and pedagogical agents developed by Lester et al. [7], use pointing gestures that can reference objects in the agent’s immediate (virtual or real) environment.

Interactional functions are those that serve to regulate some aspect of the flow of conversation (also called “envelope” functions). Examples include turn-taking functions, such as signaling intent to take or give up a speaking turn, and conversation initiation and termination functions, such as greetings and farewells (used in REA, see pevious page). Other examples are “engagement” functions, which serve to continually verify that one’s conversational partner is still engaged in and attending to the conversation, as implemented in the MEL robotic ECA [11]. Framing functions (enacted through behaviors called “contextualization cues”) serve to signal changes in the kind of interaction taking place, such as problem-solving talk versus small talk versus joke-telling, and are used in the FitTrack Laura ECA (see “Managing Long-Term Relationships with Laura.”)

Attitudinal functions signal liking, disliking, or other attitudes directed toward one’s conversational partner (as one researcher put it, “you can barely utter a word without indicating how you feel about the other”). One of the most consistent findings in this area is that the use of nonverbal immediacy behaviors—close conversational distance, direct body and facial orientation, forward lean, increased and direct gaze, smiling, pleasant facial expressions and facial animation in general, head nodding, frequent gesturing, and postural openness—projects liking for the other and engagement in the interaction, and is correlated with increased solidarity [2]. Attitudinal functions were built into the FitTrack ECA so it could signal liking when attempting to establish and maintain working relationships with users, and into the Cosmo pedagogical agent to express admiration or disappointment when students experienced success or difficulties [7].

Etiquette rules often serve as coordination devices and can be seen as enacting an interactional function.

Affective display functions. In addition to communicating attitudes about their conversational partners, people also communicate their overall affective state to each other using a wide range of verbal and nonverbal behaviors. Although researchers have widely differing opinions about the function of affective display in conversation, it seems clear it is the result of both spontaneous readouts of internal state and deliberate communicative action. Most ECA work in implementing affective display functions has focused on the use of facial display, such as the work by Poggi and Pelachaud [8].

Relational functions are those that either indicate a speaker’s current assessment of his or her social relationship to the listener (“social deixis”), or serve to move an existing relationship along a desired trajectory (for example, increasing trust, decreasing intimacy, among others). Explicit management of the ECA-user relationship is important in applications in which the purpose of the ECA is to help the user undergo a significant change in behavior or cognitive or emotional state, such as in learning, psychotherapy, or health behavior change [3]. Both REA and Laura were developed to explore the implementation and utility of relational functions in ECA interactions.

While it is easiest to think of the occurrence (versus non-occurrence) of a conversational behavior as achieving a given function, conversational functions are often achieved by the manner in which a given behavior is performed. For example, a gentle rhythmic gesture communicates a very different affective state or interpersonal attitude compared to a sharp exaggerated gesture. Further, while a given conversational behavior may be used primarily to affect a single function, it can usually be seen to achieve functions from several (if not all) of the categories listed here. A well-told conversational story can communicate information, transition a conversation into a new topic, convey liking and appreciation of the listener, explicate the speaker’s current emotional state, and serve to increase trust between the speaker and listener.

The Rules of Etiquette

Within this framework, rules of etiquette can be seen as those conversational behaviors that fulfill certain conversational functions. Emily Post would have us believe the primary purpose of etiquette is the explicit signaling of “consideration for the other”—that one’s conversational partner is important and valued [9]—indicating these behaviors enact a certain type of attitudinal function. Etiquette rules often serve as coordination devices (for example, ceremonial protocols) and can be seen as enacting an interactional function. They can also be used to explicitly signal group membership or to indicate a desire to move a relationship in a given direction, in which case they are fulfilling a relational function. Each of these functions has been (partially) explored in existing ECA systems.

Is etiquette—especially as enacted in nonverbal behavior—important in all kinds of human-computer interactions? Probably not. However, for tasks more fundamentally social in nature, the rules of etiquette and the affordances of nonverbal behavior can certainly have an impact. Several studies of mediated human-human interaction have found that the additional nonverbal cues provided by video-mediated communication do not affect performance in task-oriented interactions, but in interactions of a more relational nature, such as getting acquainted, video is superior [12]. These studies have found that for social tasks, interactions were more personalized, less argumentative, and more polite when conducted via video-mediated communication, that participants believed video-mediated (and face-to-face) communication was superior, and that groups conversing using video-mediated communication tended to like each other more, compared to audio-only interactions. The importance of nonverbal behavior is also supported by the intuition of business people who still conduct important meetings face-to-face rather than on the phone. It would seem that when a user is performing these kinds of social tasks with a computer, an ECA would have a distinct advantage over non-embodied interfaces.

Will users willingly engage in a social chat with an animated real estate agent or tell their troubles to a virtual coach? Evidence to date indicates the answer is yes.

Conclusion

Will users willingly engage in a social chat with an animated real estate agent or tell their troubles to a virtual coach? Evidence to date indicates the answer is, for the most part, yes. In the commercial arena, people have shown willingness to engage artifacts such as Tamagotchis, Furbies, and robotic baby dolls in ever more sophisticated and encompassing social interactions. Experience in the laboratory also indicates users will not only readily engage in a wide range of social behavior appropriate to the task context, but that the computer’s behavior will have the same effect on them as if they had been interacting with another person [35]. This trend seems to indicate a human readiness, or even need, to engage computational artifacts in deeper and more substantive social interactions.

Unfortunately, there is no cookbook defining all of the rules for human face-to-face interaction that human-computer interface practitioners can simply implement. However, many of the most fundamental rules have been codified in work by linguists, sociolinguists, and social psychologists (for example, [2]), and exploration that makes explicit use of these rules in work with ECAs and robotic interfaces has begun. By at least being cognizant of these rules, and at most by giving them explicit representation in system design, developers can build systems that are not only more natural, intuitive, and flexible to use, but result in better outcomes for many different kinds of tasks.

Sidebar: REA the Polite Real Estate Agent

REA is a virtual real estate agent who conducts initial interviews with potential home buyers, then shows them virtual houses she has for sale [4]. In these interviews—based on studies of human real estate agent dialogue—REA is capable of using a variable level of etiquette, which in turn conveys varying levels of sensitivity to users’ “face needs” (needs for acceptance and autonomy). If the etiquette gain is turned up, she starts the conversation with small talk, gradually eases into the real estate conversation, and sequences to more threatening topics, like finance, toward the end of the interview. If the etiquette gain is turned down, her conversational moves are entirely driven by task goals, resulting in her asking the most important questions first (location and finance) and not conducting any small talk whatsoever. The amount of etiquette required at any given moment is dynamically updated each speaking turn of the conversation based on an assessment of the relationship between REA and the user, and how it changes as different topics are discussed.

Figure. REA interviewing a buyer.

REA’s dialogue planner is based on an activation network that integrates information from the following sources to choose her next conversational move:

Task goals. REA has a list of prioritized goals to discover the user’s housing needs in the initial interview. Conversational moves that directly work toward satisfying these goals (such as asking interview questions) are preferred (given activation energy).

Logical preconditions. Conversational moves have logical preconditions (for example, it makes no sense for REA to ask users how many bedrooms they want until she has established they are interested in buying a house), and are not selected for execution until all of their preconditions are satisfied. Activation energy flows through the network to prefer moves able to be executed (“forward chaining”) or that support (directly or indirectly) REA’s task goals (“backward chaining”).

Face threat. Moves expected to cause face threats to the user, including threats due to overly invasive topics (like finance) are not preferred.

Face threat avoidance. Conversational moves that advance the user-agent relationship in order to achieve task goals that would otherwise be threatening (for example, small talk and conversational storytelling to build trust) are preferred.

Topic coherence. Conversational moves that are somehow linked to topics currently under discussion are preferred.

Relevance. Moves that involve topics known to be relevant to the user are preferred.

Topic enablement. REA can plan to execute a sequence of moves that gradually transition the topic from its current state to one that REA wants to talk about (for example, from talk about the weather, to talk about Boston weather, to talk about Boston real estate). Thus, energy is propagated from moves whose topics are not currently active to moves whose topics would cause them to become current.

Figure. REA interviewing a buyer.

Sidebar: Automatic Generation of Nonverbal Behavior in BEAT

Although the nonverbal behavior exhibited by an ECA can play a significant role in enacting rules of etiquette, the correct production of these behaviors can be a very complex undertaking. Not only must the form of each behavior be correct, but the timing of the behavior’s occurrence relative to speech must be precise if the behavior is to have the intended effect on the user.

The BEAT system simplifies this task, by taking the text to be spoken by an animated human figure as input, and outputting appropriate and synchronized nonverbal behaviors and synthesized speech in a form that can be sent to a number of different animation systems [6]. The nonverbal behaviors are assigned on the basis of linguistic and contextual analysis of the text, relying on rules derived from research into human conversational behavior. BEAT can currently generate hand gestures, gaze behavior, eyebrow raises, head nods, and body posture shifts, as well as intonation commands for a text-to-speech synthesizer.

The BEAT system was designed to be modular, to operate in real time, and to be easily extensible. To this end, it is written in Java, is based on an input-to-output pipeline approach with support for user-defined extensions, and uses XML as its primary data structure. Processing is decomposed into modules that operate as XML transducers; each taking an XML object tree as input and producing a modified XML tree as output. The first module in the pipeline operates by reading in XML-tagged text representing the character’s script and converting it into a parse tree. Subsequent modules augment this XML tree with suggestions for appropriate nonverbal behavior while filtering out suggestions in conflict or that do not meet specified criteria. The figure here shows an example XML tree at this stage of processing, with annotations for speech intonation (SPEECH-PAUSE, TONE, and ACCENT tags), gaze behavior (GAZE-AWAY and GAZE-TOWARDS, relative to the user), eyebrow raises (EYEBROWS), and hand gestures (GESTURE). In the final stage of processing, the tree is converted into a sequence of animation instructions and synchronized with the character’s speech by querying the speech synthesizer for timing information.

BEAT provides a very flexible architecture for the generation of nonverbal conversational behavior, and is in use on a number of different projects at different research centers, including the FitTrack system (see “Managing Long-Term Relationships with Laura”).

Figure. BEAT annotated parse tree and its performance.

Sidebar: Managing Long-Term Relationships with Laura

The effective establishment and maintenance of relationships requires the use of many subtle rules of etiquette that change over time as the nature of the relationship changes. The FitTrack system was developed to investigate the ability of ECAs to establish and maintain long-term, social-emotional relationships with users, and to determine if these relationships could be used to increase the efficacy of health behavior change programs delivered by the agent [3]. The system was designed to increase physical activity in sedentary users through the use of conventional health behavior change techniques combined with daily conversations with Laura, a virtual, embodied exercise advisor.

Laura’s appearance and nonverbal behavior were based on a review of the health communication literature and a series of pretest surveys (see figure). BEAT (see “Automatic Generation of Nonverbal Behavior in BEAT”) was used to generate nonverbal behavior for Laura, and was extended to generate different baseline nonverbal behaviors for high or low immediacy (liking or disliking of one’s conversational participant demonstrated through nonverbal behaviors such as proximity and gaze) and different conversational frames (health dialogue, social dialogue, empathetic dialogue, and motivational dialogue). In addition to the nonverbal immediacy behaviors, verbal relationship-building strategies used by Laura include empathy dialogue, social dialogue, meta-relational communication (talk about the relationship), humor, reference to past interactions and future together, inclusive pronouns, expressing happiness to see the user, use of close forms of address (user’s name), and appropriate politeness strategies.

The exercise-related portion of the daily dialogues Laura has with users was based on a review of the health behavior change literature, input from a cognitive-behavioral therapist, and observational studies of interactions between exercise trainers and MIT students. These interventions were coupled with goal-setting and self-monitoring, whereby users would enter daily pedometer readings and estimates of time in physical activity, and were then provided with graphs plotting their progress over time relative to their goals.

In a randomized trial of the FitTrack system, 60 users interacted daily with Laura for a month on their home computers, with one group interacting with the fully “relational” Laura, and the other interacting with an identical agent that had all relationship-building behaviors disabled. Users who interacted with the relational Laura reported significantly higher scores on measures of relationship quality, liking Laura, and desire to continue working with Laura, compared with users in the non-relational group, although no significant effects of relational behavior on exercise were found. Most users seemed to enjoy the relational aspects of the interaction (though there were definitely exceptions). As one user put it: “I like talking to Laura, especially those little conversations about school, weather, interests. She’s very caring. Toward the end, I found myself looking forward to these fresh chats that pop up every now and then. They make Laura so much more like a real person.”

Figure. Laura and the MIT FitTrack system.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

-nspoken Rules of Spoken Interaction

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/975817.975842

April 2004 Issue

Published: April 1, 2004

Vol. 47 No. 4

Pages: 38-44

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More