What's the evidence? A penny for your thoughts

April 2021

A penny for your thoughts

Young boy in a long-sleeved shirt at a table, looking like he’s thinking hard. His right hand supports his chin, his cheeks are puffed out, and he’s looking up at the ceiling.

Photo by mbpogue

Do you speak to yourself? Mutter under your breath, cuss, and swear even? You do? I wonder if you’d make a good candidate for think-aloud research. I’m thinking of the think-aloud protocol, where usability researchers ask you to speak your thoughts.

Usability professionals often use the method to identify usability issues or test a product or website. They ask users to say what they’re thinking as they carry out a task, to find out what participants think, why they’re doing something, what they meant to do but couldn’t, and even how they felt.

A 2012 international survey of usability professionals found seven in ten of them usually or always used the method (Figure 1; McDonald, Edwards, & Zhao, 2012).

A bar graph showing that usability professionals use the think-aloud protocol all or most of the time. 29% always use it and 42% usually use it.

Figure 1: The think-aloud protocol is popular with usability professionals.

So how good is it really? Let’s discuss how:

it adds value (what people say reveals more than eye-tracking can)
there are concerns about it
it changes behaviour
it’s more demanding on participants
unmoderated tests work.

What people say reveals more than eye-tracking can

When technical communicator Cooke (2010) tracked 10 participants’ eye movements, she found that verbalizations matched 80 percent of the time.

Elling, Lentz, and de Jong (2012) thought: if there was such a match and participants were only silent 16 percent of the time, why do we need them to talk? Let’s just track their eye movements. Their reservations included:

Some verbalisations were actually reading. However, just because someone’s reading something doesn’t mean they understand what they’re reading.
Verbalisations may be considered inaccurate because they do not correspond to eye movements, but research shows users process information faster than they can verbalize.
Eye movements don’t say it all. For example, they won’t reveal expectations, ‘comments on missing information, expressions of doubt and confusion, and thoughts like “I feel foolish working at this site”’ (p. 209).

They sent 60 participants to municipal websites to search for information, for example, on a subsidy for first-time homebuyers. They found:

Forty percent of verbalizations matched eye-tracking observations (half what Cooke found).
Participants were silent 27 percent of the time (much more than Cooke’s study).
Verbalisations were vastly different from Cooke’s study. In Cooke’s study, most verbalizations were reading. In their study, most verbalizations were observations such as ‘There’s a lot of links to choose from here’, and ‘I am not sure where to go from here’ (p. 214; Figure 2).

Types of verbalisations in the two studies differed greatly, especially for reading and observation. A column graph shows reading was 58% of the verbalisations in the first study vs 23% in the second. Observation was 10% in the first study vs 34% in the second.

Figure 2: A different research design meant participants made more observations and read less.

This suggests 60 percent of verbalizations added value, which the researchers attributed to better experimental design, especially the choice of participants, website, and tasks.

Silences aren’t a concern, by the way. They often occur because participants are so occupied with their tasks that they have ‘no cognitive energy left to describe what goes on in their minds . . . . not all cognitive activity, such as scanning website pages, can be easily verbalized’ (p. 217).

Concerns about the think-aloud protocol

So maybe the protocol is useful, though concerns remain (Alhadreti & Mayhew, 2017; Alshammari, Alhadreti & Mayhew, 2015; McDonald, McGarry, & Willis, 2013). These are the main ones I found:

Thinking aloud may change participants’ thoughts, actions, and performance, since they have to process their thoughts more than usual. Some scholars argue thinking aloud is unnatural, and may improve or hinder task performance, threaten test validity, point to more or fewer problems, and so on.
Verbalisations can be mainly reading or about users’ actions and procedures, instead of explanations. That’s not terribly insightful.
The classic think-aloud protocol requires evaluators to listen passively, something that can be uncomfortable. So some evaluators interact with participants, but this could affect performance and verbalization.

To search for solutions and explore new opportunities for the protocol, researchers have:

explicitly told participants to explain their thoughts
had participants explain their thoughts retrospectively
varied their interaction with participants
looked into unmoderated tests.

Close-up photo of a young girl looking intently at the camera. Her left hand obscures her mouth. Her long hair is slightly untidy.

Photo by Max Pixel

Explicitly telling participants to explain their thoughts

Applied scientists McDonald, McGarry, and Willis (2013) gave 20 participants ten tasks on a web-based encyclopaedia:

One group received the classic think-aloud instruction: ‘I want you to say out loud everything that you say to yourself silently. Just act as if you are alone in the room speaking to yourself’.
The other group was explicitly told to explain their thinking: ‘I want you to think-aloud, [and] explain your link choices as you complete each task’.

They found:

No difference in the overall no. of verbalisations — but when things got hard, the explicit group talked more.
No difference in performance on easy tasks — but when things got hard, the explicit group performed better and clicked fewer links to navigate.

This suggests explaining things aloud changes behaviour and strategy.

Having participants explain their thoughts retrospectively

Another approach is to ask for thoughts retrospectively. However, computer scientists Alshammari, Alhadreti, and Mayhew (2015) found that those who think aloud as they go (concurrently) raise more usability problems and more unique ones too, perhaps because they can report problems in real time.

What if we gave retrospective participants a memory jogger? Researchers Beach and Willows (2017) gave an innovative, virtual twist to retrospective thinking aloud. They had 45 elementary school teachers use a professional development website, splitting them into three groups:

Concurrent: thinking aloud while performing tasks.
Retrospective: thinking aloud after completing the tasks, without memory aids.
Virtual revisit: thinking aloud after completing the tasks, while watching a screen recording of their session.

They found:

Concurrent participants made way more comments. (By the way, ‘comments’ is my word. The authors called them ‘thought units’: comments with a pause and a different idea before and after.)
Retrospective and virtual revisit participants’ comments were more complex (Figure 3). Their comments involved more processing and long-term memory, such as comments about reasoning and planning. (Less complex comments are more mechanical, like those about procedures.)
Retrospective participants tended not to comment on their online actions or navigation.

Those thinking aloud later made a greater proportion of complex comments. The column for the concurrent condition shows 44% of the comments were more complex, 56% were less complex. The columns for the retrospective and virtual revisit conditions are similar to each other, with 80% of the comments more complex and 20% less complex.

Figure 3: Those thinking aloud after completing their tasks made nearly twice as many complex comments since they didn’t have competing mental demands. An example of a complex idea is: ‘I know that there is a lot that I can do with this in terms of reading and writing and oral and drama and themes like social justice and history and so on’ (Beach and Willows, 2017; p. 76).

This suggests:

Concurrent thinking aloud made more demands on participants’ thought processes, so their comments were less complex.
Retrospective and virtual revisit participants had lighter demands on their thought processes when thinking aloud, so they had energy for more complex ideas.
Screen recordings were helpful, since the virtual revisit group not only made more complex comments, they could also explain their navigation.

Varying interaction with participants

Could how evaluators interact with participants affect performance and results? Perhaps their ‘tones of voice, attitude, and friendliness’ could affect what participants say (cited in Alhadreti and Mayhew, 2017). It turns out this isn’t so.

THREE THINK-ALOUD METHODS FIND SIMILAR USABILITY PROBLEMS

Computer scientists Alhadreti and Mayhew (2017) tested the Durham University library website on 60 students. Evaluators varied their behaviour:

Traditional condition: they said ‘please keep talking’ if participants were quiet for 15 seconds.
Active intervention condition: they asked participants to explain and give more details.
Speech communication condition: they acknowledged participants’ comments with ‘Mm hmm’ and asking ‘And now…?’ if participants were quiet for 15 seconds.

They found:

All three groups encountered similar usability problems (Figure 4), and of similar severity.
Those in the active intervention group were least successful with their tasks and took longest. It was the most expensive group too, since the session and analysis took longest. Despite that, they only detected marginally more usability problems.
Satisfaction with the website was similar, though the active intervention group considered the evaluator’s presence ‘disturbing’.

This suggests there’s little difference between the groups for uncovering usability problems.

The three groups found a similar number of usability problems, but those in the active intervention condition found more unique problems. They found 19, compared to 16 and 12 in the traditional and speech communication conditions.

Figure 4: Not only did the groups find a similar number of usability problems, most problems were common to all the groups too

THINK-ALOUD METHODS WORK WELL, AND WORK BETTER THAN SILENCE

Bruun & Stage (2015), also computer scientists, ran a more elaborate experiment on a Danish stats website. They had 43 university staff and students in four conditions:

Traditional: evaluators only spoke to prompt participants to talk if they had been quiet.
Active listening: evaluators gave feedback or acknowledgement, such as ‘Um-humm’.
Coaching: evaluators encouraged, sympathised, gave feedback, and even prompted participants with questions like ‘what options do you have?’
Silent: evaluators only introduced the experiment and tasks. Participants were specifically asked not to think aloud.

They found:

Task completion and time taken were similar across the four conditions.
The think-aloud conditions found a similar number of usability problems (double the silent condition; Figure 5), with similar severity (critical, serious, or cosmetic).
Those in the traditional condition raised more unique problems.
Those in the coaching condition raised more types of problem.
Those in the coaching and traditional conditions were more satisfied with the website.

Interaction styles don’t seem to affect how many usability problems participants identify. The three interaction conditions found about 38 problems each. The silent condition, however, found only 19.

Figure 5: This study suggests evaluators should do more than just introduce a study and tasks to participants. Some interaction, even just prompting participants to speak when they have been quiet, makes all the difference to finding usability problems.

This suggests the think-aloud protocols had ‘limited influence on user performance and satisfaction’, and were significantly more successful than silence. Indeed, the authors conclude that ‘no single [think-aloud] protocol version is superior overall; each has strengths and weaknesses’ (p. 17).

Looking into unmoderated tests

In our increasingly online world, researchers conduct usability tests online too. Participants do these anytime and anywhere they like, without an evaluator in sight. Do unmoderated think-aloud sessions work?

Researchers Hertzum, Borlund, and Kristoffersen (2015) sent 14 participants to a music news site, splitting them into two groups:

Moderated condition: evaluators probed when participants fell silent for some time, became visibly surprised without verbalizing why, or had completed a task.
Unmoderated condition: participants were told to record the session and think aloud.

They found:

Both groups made a similar number and type of comments. Moderated participants also acknowledged the evaluator.
Unmoderated participants made proportionately more comments that clearly helped identify usability issues (high-relevance comments; Figure 6). However, they didn’t identify more problems. The authors think there was ‘more duplication [which] constituted a stronger set of evidence for the same usability issues’ (p. 14).

Unmoderated participants’ comments were more likely to be highly relevant to usability. In the high relevance group, 21% of the unmoderated participants’ comments were highly relevant, compared to 11% for moderated participants.

Figure 6: Unmoderated participants made more comments that were highly relevant to usability: comments that decisively identified usability issues. For example: ‘Well, what I really want to do now is just to go to Google and search because this is, eh’ (Hertzum, Borlund, & Kristoffersen, 2015; p. 10).

This suggests usability professionals should consider using unmoderated tests, at least to supplement moderated tests.

So what should we think?

These studies suggest it’s a good idea to get participants to think aloud, though explicitly telling them to explain their behaviour could change it. One way around it might be to record what they do and get them to think aloud retrospectively, while watching their screen recordings.

Don’t keep completely silent during the session, or ask them to; you might only find half the number of usability problems you otherwise would.

Finally, consider unmoderated tests using a recorded think-aloud method. Not only are they economical, you might find more highly relevant comments.

References

Alhadreti, O., & Mayhew, P. (2017). To intervene or not to intervene: an investigation of three think-aloud protocols in usability testing. Journal of Usability Studies, 12(3), 111-132. Retrieved from https://ueaeprints.uea.ac.uk/64914/1/Accepted_manuscript.pdf

Alshammari, T., Alhadreti, O., & Mayhew, P. (2015). When to ask participants to think aloud: A comparative study of concurrent and retrospective think-aloud methods. International Journal of Human Computer Interaction, 6(3), 48-64. Retrieved from https://ueaeprints.uea.ac.uk/57466/1/IJHCI_118.pdf

Beach, P., & Willows, D. (2017). Understanding teachers' cognitive processes during online professional learning: A methodological comparison. Online Learning, 21(1), 60-84. Retrieved from https://files.eric.ed.gov/fulltext/EJ1140245.pdf

Bruun, A., & Stage, J. (2015, September). An empirical study of the effects of three think-aloud protocols on identification of usability problems. In IFIP Conference on Human-Computer Interaction (pp. 159-176). Springer, Cham. Retrieved from https://hal.inria.fr/hal-01599881/document

Cooke, L. (2010). Assessing concurrent think-aloud protocol as a usability test method: A technical communication approach. IEEE Transactions on Professional Communication , 53(3), 202-215. Retrieved from IEEE database.

Elling, S., Lentz, L., & De Jong, M. (2012). Combining concurrent think-aloud protocols and eye-tracking observations: An analysis of verbalizations and silences. IEEE Transactions on Professional Communication, 55(3), 206-220. Retrieved from IEEE database.

Hertzum, M., Borlund, P., & Kristoffersen, K. B. (2015). What do thinking-aloud those say? A comparison of moderated and unmoderated usability sessions. International Journal of Human-Computer Interaction , 31(9), 557-570. Retrieved from https://www.researchgate.net/profile/Morten_Hertzum/publication/281369948_What_Do_Thinking-Aloud_Those_Say_A_Comparison_of_Moderated_and_Unmoderated_Usability_Sessions/links/56353f4e08aeb786b702c4cf.pdf

McDonald, S., Edwards, H. M., & Zhao, T. (2012). Exploring think-alouds in usability testing: An international survey. IEEE Transactions on Professional Communication , 55(1), 2-19. Retrieved from IEEE database.

McDonald, S., McGarry, K., & Willis, L. M. (2013). Thinking-aloud about web navigation: The relationship between think-aloud instructions, task difficulty and performance. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting: Vol. 57, No. 1 (pp. 2037-2041). Sage CA: Los Angeles, CA: SAGE Publications. Retrieved from https://fas-web.sunderland.ac.uk/~cs0kmc/mcdonld-mcgarry-willis%20full.pdf

What's the evidence? A penny for your thoughts

What's the evidence? A penny for your thoughts

A penny for your thoughts

What people say reveals more than eye-tracking can

Concerns about the think-aloud protocol

Explicitly telling participants to explain their thoughts

Having participants explain their thoughts retrospectively

Varying interaction with participants

THREE THINK-ALOUD METHODS FIND SIMILAR USABILITY PROBLEMS

THINK-ALOUD METHODS WORK WELL, AND WORK BETTER THAN SILENCE

Looking into unmoderated tests

So what should we think?

References

Login or Register