Embodying a Virtual Agent in a Self-Driving Car: A Survey-Based Study on User Perceptions of Trust, Likeability, and Anthropomorphism

This article considers the visual appearance of a virtual agent designed to take over the driving task in a highly automated car, to answer the question of which visual appearance is appropriate for a virtual agent in a driving role. The authors first selected five models of visual appearance thanks to a picture sorting procedure (N = 19). Then, they conducted a survey-based study (N = 146) using scales of trust, anthropomorphism, and likability to assess the appropriateness of those five models from an early-prototyping perspective. They found that human and mechanical-human models were more trusted than other selected models in the context of highly automated cars. Instead, animal and mechanical-animal ones appeared to be less suited to the role of a driving assistant. Learnings from the methodology are discussed, and suggestions for further research are proposed


INTRoDUCTIoN
Virtual agents represent a shift toward more natural interactions with technology, for example, Siri, Google Assistant, Alexa, and Cortana.In everyday life, they enable quick access to web content and other functionalities through computers, mobile devices, and even our cars.Using virtual agents for in-car interactions may benefit drivers, enhancing cockpit features (e.g., music and navigation) with novel user interfaces (Weng et al., 2016).Recent implementations have come in various forms, such as mobile apps (e.g., Apple CarPlay and Google Assistant Driving Mode), in-car dedicated devices (e.g., Echo Auto, Chris, and Garmin Speak; Lugano, 2017), and proprietary personal agents from car manufacturers (e.g., Honda, BMW, Mercedes, and Volkswagen; Majji & Baskaran, 2021).With the rise of highly automated cars, new concepts of virtual agents (Ajitha & Nagra, 2021;Okamoto & Sano, 2017) are emerging with new roles.Beyond delivering complex messages about driving and road safety, particularly during handover procedures, these virtual agents embody the artificial intelligence of the automated system.In this context, the perception of trust toward such agents plays an important role in adopting this new technology.
Previous research showed that providing a visual appearance, specifically an anthropomorphic one, might be a way of ensuring trust toward these agents (Meng et al., 2021;Ekman et al., 2018;Waytz et al., 2014).Similar to human-to-human relationships, in which people tend to judge others' social warmth, honesty, trustworthiness, and intellectual competence based on facial appearance or attire (Zebrowitz & Montepare, 2008;Willis & Todorov, 2006;Smith et al., 2018), human -virtualagent interactions are greatly influenced by visual appearance.Research revealed that the visual appearance of an embodied virtual agent tends to have an impact on user first impressions (ter Stal et al., 2020), perceptions of competence, intelligence, trust, and acceptance.Several studies have shown that pedagogical agents' visual attributes, including realism, gender, and ethnicity, significantly affect student self-efficacy and motivation (Baylor & Kim, 2004;Gulz & Haake, 2006;McDonnell et al., 2012;Ashby Plant et al., 2009).
Further, it appears that the appropriate appearance for a virtual agent may vary with the use case or context of use.For example, agents with realistic human-like looks tend to be preferred for health-related roles, whereas cartoon-like human looks are preferred for social interactions (Ring et al., 2014).Also, the level of realism in human-like representations encourages positive subjective user ratings (Yee et al., 2007).Zoomorphic agents tend to be associated with entertainment, education, and therapy-related roles, and agents with more mechanical looks are expected to carry out securityrelated tasks (Dingjun et al., 2010;Kalegina et al., 2018).Parmar et al. (2018) provided evidence that a virtual health agent designed with attire fitting the role (e.g., a white coat) was perceived to be more trustworthy, reassuring, and persuasive than an agent whose attire was not role appropriate (e.g., a casual outfit).Strohmann et al. (2019) attempted to provide visual design recommendations for in-car virtual agents, but the choice of a visual appearance for a virtual agent in a driving role can still be challenging.This is mainly due to the almost unlimited number of visual design options when creating a virtual agent for a new application and specific type of user.Haake and Gulz (2009) proposed a comprehensive framework for this design space (see Figure 1).It defines three high-level aspects of static visual characteristics: basic model, physical properties, and graphical style.Basic model refers to the constitution of an embodied agent and the conceptual entity related to this constitution (human, animal or creature, inanimate object, fantasy, or a combination of these).Physical properties refers to design elements (e.g., face shape; eye, hair, and skin color; clothes, and accessories) that carry the most variety and suggest cultural, social, and psychological characteristics.Finally, graphical style (realistic or stylized) refers both to the degree of realism and details applied in a visual design.
The aspect of graphical style has received a lot of attention in visual design of a virtual agent research (Duffy, 2003;Billard, 2005;Blow et al., 2006;Kopp et al., 2005).It is inspired by the work of McCloud (1993), where the author presents an in-depth theory of comic design space and gives clear definitions of abstract, iconic, and realistic visual design.Abstract design does not necessarily relate to a real-world entity (e.g., a Jackson Pollock painting); on the contrary, realistic design resembles actual things, like a photograph of a person, animal, or object; and finally, iconic design (e.g., cartoons) are an abstraction of real appearances, but still perceived as resembling something real (Manning, 1998).Trust is a key factor for technology adoption in safety-critical contexts such as in driving.While prolonged visual interactions might not be suitable for manual driving (i.e., because of competing attentional resources; Wickens, 2008), highly automated cars allow for building trust through visual interfaces such as visually embodied agents.Thus, it is necessary to investigate the relationship between visual design choices and perception of trust.
In the case of anthropomorphism, measurement tools are also diverse.Anthropomorphism is a cognitive process in which human form or characteristics such as intelligence or capacity of emotions are projected onto a non-human object such as a computer or robot.Single (MacDorman, 2006) or two item scales have been used (Aggarwal & Mcgill, 2007) to measure anthropomorphism.Powers and Kiesler (2006) proposed a multiple item scale that is often used for comparisons across studies (Bartneck et al., 2009;Häuslschmid et al., 2017;Ekman et al., 2018) , 2019).Other studies use behavioral measurement (Minato et al., 2005) and scales have been developed specifically for intelligent personal assistants (Moussawi & Koufaris, 2019).
Additionally, the concepts of mindful (Araujo, 2018) and mindless (Kim & Sundar, 2012) anthropomorphism have been measured.Mindful anthropomorphism is defined as a "sincere belief that the object has human characteristics" (Nass & Moon, 2000, p. 93;Powers & Kiesler, 2006;Bartneck et al., 2009).In other words, after interacting with the machine, the user judges it as possessing qualities such as intelligence or the ability to be aware of its environment.Conversely, mindless anthropomorphism is described as an automatic response of the human to an anthropomorphic stimulus that causes it not only to attribute human characteristics but also to sense the presence of the machine from which the stimulus originates (Nass & Moon, 2000;Kim & Sundar, 2012).The difference between the two types of anthropomorphism is the perception of social presence, defined as the "feeling of being in the presence of another person" (Burgoon et al., 2001, p. 2).Hence, it is important to assess the trustworthiness of a virtual agent in the role of a driving assistant in a highly automated car (such as in SAE Level 4; SAE International, 2014).Higher trust for driving assistants may increase highly automated driving system acceptance or improve safety.A first step toward this goal is to identify basic models that encourage the perception of trust toward a virtual agent.For these reasons, we present a study that compares different basic models to potential users and assesses their relative trustworthiness, likability, and anthropomorphism.
In our study, we chose to use the zero-acquaintance approach (Vartanian et al., 2012;ter Stal et al., 2020).It consists of evaluating virtual agents based on static image prototypes of virtual agents.It consists of evaluating virtual agents based solely on static images, without any actual interaction between the agent and user.While high-fidelity and interactive prototypes are more representative, such a low-fidelity approach is particularly relevant early in the design process.Being able to predict user trust with this approach may save resources for researchers and design practitioners.Here, the images were sorted into categories by a group of participants using a picture sorting procedure.Selected categories were included in an online survey which immersed respondents into the context of autonomous vehicles, virtual agents, and driving situations, using text and pictures to illustrate context.Finally, the categories of basic models were rated in terms of trust, likability, and anthropomorphism.The scale from Waytz et al. (2014) was used as it was specifically developed for the context of highly automated driving and included measurement for both trust and anthropomorphism.

Participants
A total of 146 participants (60% of them were females, mean age = 36.92,SD = 14.04) completed the online questionnaire.They were recruited through a public link to the questionnaire shared on social media and mailing lists (academic and professional).The first page of the questionnaire explained the context and goals of the study and asked the participant to give informed consent.A total of 99% reported France as their current residency country and 91% as the country where they have lived the most; 69% of participants had a driver's license for over five years; 41% drive every day and 31% once a week; 58% declared using at least one driving assistance system.When asked about their prior usage of virtual assistants from mainstream consumer electronics (such as those of Google and Apple), 29% of our participants declared using those; 85% of our participants graduated from a level above or equal to a baccalaureate level (high school or higher education diploma).

Material
This work focused on the basic model dimension from Haake's and Gulz's design space (2009).Five basic models were evaluated using a French-translated version of the 16-item scale from Waytz et al. (2014).This scale was used because it was specifically developed for the context of highly automated driving and includes items related to anthropomorphism, liking, and trust.Static high-fidelity images from an online database were used, and, to avoid evaluation of a specific image instead of a basic model category, four images were given for each basic model.The four example images were selected using a picture sorting procedure (described in more detail below).This procedure was performed with an independent group of participants to ensure that basic models were built from user perception and understanding.

Picture Sorting Procedure
Picture sorting (Lobinger & Brantner, 2020) is a variant of the card sorting technique that helps to study a user's mental model and understand how they categorize items.The aim of this procedure was to create basic model categories and find pictures that illustrate these categories.Multiple illustration pictures are necessary for each basic model to avoid the interpretation of a single image instead of the model as a concept.Firstly, we created 10 category labels using two aspects (basic model and graphical style) of the framework proposed by Haake and Gulz (2009).Then, each label was written on the Pinterest website search bar (one at a time), and pictures were selected from the prompted results, based on their quality.Depending on the category label, results from the website were variable.Hence, the number of pictures selected for each category is different.The complete selection was as follows: realistic human (5), stylized human (11), realistic animal (9), stylized animal (7), mechanical human (5), mechanical animal (5), mini mechanical (5), abstract (9), inanimate object (5), and fantasy (21).Ultimately, 83 pictures were collected from the Pinterest website.The pictures were printed and placed on a large table, with the model labels placed on an adjacent table (see Figure 2).
A total of 19 participants (eight females, mean age = 28.10)recruited from a research institute and from a university campus (Paris, France) participated in the card sorting.The card sorting was individual and took place at IRT SystemX (Paris, France).Firstly, each participant was presented with the definition of every label and with questions to ensure they understood what each label meant.Then, they were asked to sort the pictures using the provided labels.A picture could be categorized to one or many labels or stay uncategorized.A picture was considered as representative of a basic model when it was placed in that model category by more than 70% of participants (Fincher & Tenenberg, 2005).Following the results from the picture sorting procedure, five basic models had at least four representative pictures: realistic human, mechanical human, realistic animal, mechanical animal, and abstract.The word realistic was removed for simplification (see Figure 3).These basic models were selected for the questionnaire.All the other basic models did not have enough representatives' pictures to be selected for the questionnaire.

Survey Procedure
The questionnaire was designed using LimeSurvey.From the publicly shared link, participants landed on the introductory page of the questionnaire.Information was gathered about participant driving behavior and experience as well as prior knowledge of virtual agent technology.To increase the relevance of participant ratings, we provided them with driving scenarios including potential interaction with a virtual agent.Participants were immersed in such scenarios using a picture and an accompanying text (see Figure 4).
Several questions about their perception and intention were asked (e.g., "Will you let the assistant drive the car from the garage to the front door?") to help immerse participants in the automated driving context.They also had to project themselves into the scenario by answering a question about their intended behavior ("When the automated mode is on and the virtual assistant is in charge of the driving task, what are you most likely to do in this situation?").The immersive procedure is achieved thanks to presenting a risky driving situation (see Figure 5).
Once this immersive phase was completed, participants went through the trust rating procedure for each of the five basic models selected, thanks to our picture sorting results.For each basic model (represented by four examples; see Figure 6), participants completed the French-translated version

RESULTS
To calculate the perceived anthropomorphism score, four items (1-4 in the translated scale) were averaged to form a single composite (α = .92).For likability, three items (5-7) were averaged (α = .96)and eight items (8-16) were averaged for trust (α = .97).Table 1 shows median scores, standard deviations, and quartiles for each basic model on the three measures of trust, likability, and anthropomorphism.For each of these scales, the higher the score, the better the judgment of participants toward a specific model.

Trust
The mechanical-human and human categories appeared to be the most trusted, with 53% and 51% of ratings above neutrality (see Figure 7).The abstract category was less trusted, with 46% of ratings above neutrality.In contrast, mechanical-animal and animal categories had a very low trust rating: 44% and 51% below neutrality, respectively.It is noteworthy that neutral ratings were from 18% to 24% across categories.These differences appeared to be statistically significant as shown by the Friedman non-parametric test (Chi-squared value = 73.889;p < 0.05).An additional post hoc pair-wise comparison test (Wilcoxon test) with Bonferroni correction (see Table 2) showed that the animal and mechanical-animal categories were significantly different from all other categories but not different from each other.

Liking
The mechanical-human, abstract, and human categories appeared to be the most liked (see Figure 8), with relatively similar ratings: 47%, 47%, 43% above neutrality, respectively.Animal and mechanical animal were the least liked, with 36% and 31% of ratings above neutrality, respectively.Neutral ratings ranged from 11% to 18% across categories.These differences were statistically significant as shown by a Friedman non-parametric test (Chi-square value = 47.788;p < 0.05).An additional post hoc pair-wise comparison test (Wilcoxon test) with Bonferroni correction (see Table 3) showed the Animal model was significantly different from all other models.

Anthropomorphism
The human and abstract models were judged as the most anthropomorphic (see Figure 9) with 58% and 55% of ratings above neutrality, respectively.They were followed by the mechanical-animal (47%), mechanical-human (45%), and animal (38%) categories.Neutral ratings ranged from 13% to 21% across categories.These differences were statistically significant, as shown by a Friedman nonparametric test (Chi-square value = 109.42;p < 0.05).An additional post hoc pair-wise comparison test (Wilcoxon test) with Bonferroni correction (see Table 4) showed that anthropomorphism ratings for the animal and mechanical-animal categories were significantly different from all other models but not different from each other.

Cross-Indicator Correlations
When correlating the three different scales together we found that all correlation were high and statistically significant.Which means that when a user is ranking a category high (or low) on a scale it also ranks high (or low, respectively) on other scales.More precisely, it means that anthropomorphism is positively correlated with liking (r(728) = 0.89; p < 0.05), and with trust (r(728) = 0.90; p < 0.05).
Trust and liking are also highly and positively correlated or each other (r(728) = 0.90; p < 0.05).

Discussion
This study aimed to assess the trustworthiness, perceived anthropomorphism, and likability of five visual appearance models for a virtual driving agent in the context of highly automated driving.We used an early-prototyping survey-based methodology to make early design decisions and explore potential-user attitudes.Rather than observing a high rating in the level of trust, our goal was to find the most appropriate visual representation for the role under study.Our findings point to mechanicalhuman and human models as the most appropriate in terms of both trust and likability.Human and abstract models were judged as the most anthropomorphic.In contrast, animal and mechanical-animal models were least liked, trusted, and anthropomorphic, pointing to a low relevance of those visual representations in the context under study.This confirms our expectations that some visual appearances are more appropriate than others for the role of driving (Verberne et al., 2015).Specifically, animal appearances are not associated with safety-related tasks (Nomura et al., 2008) such as driving.Further, results confirm that anthropomorphism is a key factor in trust perception toward virtual agents (Waytz et al., 2014;P. Ruijten et al., 2018;Lemoine & Cherif, 2012;Christoforakos et al., 2021).An unexpected result is that the abstract model was perceived as more anthropomorphic than the mechanical-human model.Firstly, participants might have more experience with agents designed as an abstract model (such as Alexa, Google Assistant, and Siri) than mechanical-human ones.Indeed, there is evidence that these intelligent personal agents are highly anthropomorphized by users (Kuzminykh et al., 2020).Secondly, abstract visual representations might give more room for imagination.Thus, it might be easier for participants to project human qualities onto them than onto mechanical-human models.Also, it is worth noting that on the three scales, positive scores (above neutral) were never very high, and neutral scores corresponded to a significant part of participant ratings.This may come from the lack of interactivity within our setup.These latter findings call for further reflection.
This study would benefit from additional replication and extension.First, further research might implement selected embodiment models into a more ecological driving environment, such as driving simulation.Indeed, such an experimental setup will allow for studying the impact on trust of actual interaction with the virtual agent.Interactivity might have possible side effects on driving performance and safety.Also, in this study, we did not explore the effect of individual differences on the trust evaluation of specific basic models.Indeed, human factor literature points to the effect of individual differences on trust in automation (Körber & Bengler, 2014;Matthews et al., 2019).Studies showed an effect of prior knowledge and context (Khastgir et al., 2018;Lee & See, 2004); the effect of cognitive abilities in managing automation unreliability (Rovira et al., 2017); and even neurobiological attributes that modulate overtrust in computerized decision aid (Parasuraman et al., 2012).A replication of this study with a controlled effect of individual differences might bring more understanding of how educational and professional background modulates trust in mainstream product automation.
Our findings might also need to be reproduced across different countries and cultures.Indeed, most of our respondents were nationals or residents of the country where this study was realized (i.e., France).Also, most of them held a high school degree or above (higher education) which does  2014), prolonged visual interactions are unsuitable in a manual driving context as driving is mainly a visual-manual task.Face-embodied interactions are suitable either when the car is stopped or in automated driving mode.This might allow for building a personalized and trustworthy relationship with the automation system.In case of a handover situation, the virtual driving assistant might switch from a visual embodiment to a voice-only embodiment to limit interference with the visual channel (Wickens, 2008).This raises important questions about the impact of prior visually embodied calibration on how trust is enduring variability of the situation (automated driving to manual), of the communication channel (visual to auditory), and the possible side effects on actual situational awareness and driving performance.

CoNCLUSIoN
Our rapid-prototyping approach based on a card sorting procedure and a survey, allowed us to select embodiment models that are likely to elicit trust toward a virtual driving assistant in a highly automated driving context.According to our data, human and mechanical-human models are more likely to be trusted in a highly automated driving context than animal and mechanical-animal ones.Follow-up research work is needed to investigate the extension of these results in an ecological environment, such as in driving simulation, individual differences that might impact the perception of trustworthiness in visual appearance models, and the impact of visual embodiment on trust calibration and robustness.

ACKNowLEDGMENT
This work has been supported by the French government under the France 2030 program, as part of the SystemX Technological Research Institute within the CAB project.

Figure 3 .
Figure 3. Basic models categories used in the online survey

Figure 5 .
Figure 5. Risky automated driving situation with high traffic density and fast reaction required by the presence of an ambulance in the rearview mirror

Figure 7 .
Figure 7. Diverging stacked chart representing trust ratings from Likert scales across basic model categories: The central percentage represents neutral judgments, while the percentages at the right and left are positive and negative judgments, respectively

Figure 9 .
Figure 9. Diverging stacked chart representing anthropomorphism ratings from Likert scales across basic model categories: The more the stacked chart is shifted to the right, the more liked the basic model

Table 4 . The p-values (significance threshold: p < 0.05) for the post hoc pair-wise comparison of anthropomorphism, with Bonferroni correction
(EuroStat, 2021) statistics for younger generations in France (Institut national de la statistique et des études économiques, 2021) and in Europe(EuroStat, 2021).However, we cannot conclude whether our findings may generalize to countries of different cultural backgrounds or individuals with lower graduation levels.Finally, while our study targeted a high automation level (SAE Level 4; SAE International, reflect