Laura A. McNamara, Susan Stevens-Adams, Charles Gieseler, John Greenfield, and Laura Matzen
Prepared by Laura A. McNamara
Human-Computer Interaction (HCI) literature describes a range of user interface (UI) evaluation approaches, from heuristic evaluation, to subjective usability evaluations, to more objective user studies organized around specific tasks. However, objective metrics that allow designers to compare the amount of work required for users to operate a particular interface, and to compare workload across different interfaces, are lacking. This is problematic for complicated information visualization and visual analytics packages, because evaluation techniques are highly subjective and application specific. We believe that working memory-based metrics might provide an objective and consistent way of assessing interfaces across a range of applications. In this Late Start LDRD, we conducted a 10-participant pilot study to evaluate the utility of a Sternberg task as source of information about cognitive load in simple software interfaces. Our pilot research demonstrates that dual-task approaches can provide a simple but highly valuable method for assessing the cognitive burden of user interface designs.
We are grateful to the New Directions Investment Area, and specifically the Cognitive Science and Technology area, for providing FY2010 late start monies for this pilot project. Chris Forsythe and the Capable Manpower team were kind enough to let us borrow their prototype video viewer and then “break” the user interface (UI), so that we had something for our participants to use in the study (fortunately, their users will never see the bad version of the UI). The Sandia Human Studies Board gave a thorough review to our proposal and provided insightful comments that helped enhance our research design. The Human Factors Team of the Networks Grand Challenge LDRD helped refine the pilot project as well. Special thanks to team member Susan Stevens-Adams for her efficient administration of the study sessions, to Charlie Gieseler for so thoroughly breaking the original Capable Manpower UI, and to Laura Matzen for developing and deploying the Sternberg task in E-Prime. We are also grateful to the research participants who put up with both the decent and the terrible user interfaces described in this report, and hope that the experience was as humorous for them as it was for us.
2.BACKGROUND AND CONTEXT 9
3.COGNITIVE WORKLOAD 9
4.THE STUDY 11
CMP Capable Manpower Project
DOE Department of Energy
HCI Human-Computer Interaction
InfoVis/VA Information Visualization and Visual Analytics
LDRD Laboratory Directed Research and Development
NASA TLX National Aeronautics and Space Administration Task Load Index
NGC Networks Grand Challenge
SNL Sandia National Laboratories
UI User interface
WM Working memory
This report describes a 10-participant pilot study to investigate the application of working memory (WM)-based evaluation methods for assessing user interface designs. Human-computer interaction (HCI) literature describes a range of user interface (UI) evaluation approaches, from heuristic evaluation, to subjective usability evaluations, to more controlled user studies organized around specific tasks. However, objective metrics that allow designers to assess cognitive workload for a particular interface, and to compare workload across different interfaces and users, are not currently a normal part of UI evaluation practice.
Instead, workload assessments tend to be derived from subjective reporting; e.g., using the NASA Task Load Index (TLX). The TLX is a post-event questionnaire that requires users to report their perceived workload along six different factors using a 21-point Likert scale, followed by a pair-wise comparison of those factors to rank the major contributors to overall workload. We have deployed this tool in our user studies for the Networks Grand Challenge (NGC) and found it (ironically) confusing and burdensome for our research participants, despite the fact that it is an established and widely accepted tool for assessing subjective cognitive load (Hart and Staveland 1988). One of the NGC team members, Courtney Dornburg, suggested that working-memory based techniques, specifically a Sternberg task (Sternberg 1966), might provide a real-time way to assess usability, by assessing the cognitive load induced as users complete tasks using different interface designs. We conducted a brief literature review and decided that the idea had merit; indeed, working memory is an important construct in cognitive workload evaluation, and Sternberg tasks are commonly used in a wide range of evaluation activities.
In the context of the Networks Grand Challenge, we were interested in deploying this metric to assess how different information visualization vocabularies support problem solving. However, we decided it would be prudent to run a completely separate study to gather additional data on the working memory metric idea itself. This would require deploying the metric in a different application context using a completely different type of software than the information visualization software being developed under the Grand Challenge. In essence, we wanted to pursue a separate pilot study to see if a Sternberg task would provide relevant data on cognitive load in user interface evaluation. Accordingly, a subgroup of staff working on the NGC decided to apply for a small Late Start LDRD project to conduct a separate, pilot study that would help us assess the feasibility of using working memory-based techniques. Rather than focus on complicated information visualization software, we decided to deploy our technique in assessing the cognitive workload associated with two different user interfaces for a simple video viewer and annotation package.
This report describes our pilot study. We begin with a brief literature review, then describe the study protocol and how we executed it, including a detailed description of the software that we used to assess Sternberg tasks for assessing cognitive load associated with user interface designs. We then present our findings and some conclusions. Our pilot research demonstrates that working memory tasks are a simple but highly valuable method for assessing the cognitive burden of user interface designs.
BACKGROUND AND CONTEXT
This project spun off from the much larger, three-year Networks Grand Challenge Laboratory Directed Research and Development project (NGC LDRD, Hendrickson/Kegelmeyer, PIs). One goal of the NGC was to use the Titan toolkit to develop prototype information visualization and visual analytics software tools for intelligence analysts. These tools were intended to enhance the information exploration and reasoning activities of real-world users. To support this goal, the NGC leadership formed a “Human Factors” team comprising several social scientists, computer scientists
, and intelligence anlalysts. This team was charged with supporting the design and evaluation of software from the users’ perspective.
As we reviewed literature to identify promising techniques for evaluating visual analytics software, we realized that the problem of cognitive load is understudied in the context of information visualization and visual analytics software. Courtney Dornburg, a member of the NGC research team, suggested that a Sternberg task (Sternberg 1966), could be usefully deployed to assess cognitive load associated with different information visualization software packages. Interestingly, although Sternberg tasks have been widely used in the human factors community to assess cognitive workload associated with a wide range of technologies, Sternberg tasks are not commonly employed in the software usability community.
Accordingly, rather than focus on the deeper problem of evaluating whether or not a particular software design facilitates “insight,” we decided to focus on the issue of cognitive workload in user interface designs, with an emphasis on UI designs in InfoVis/VA. As we discuss below, a literature review led us to explore concurrent working memory tasks, specifically Sternberg tasks, as a source of data for evaluating cognitive load in the context of InfoVis/VA software tools.
Cognitive workload refers to the cognitive resources being applied to a task at hand. Not surprisingly, measuring how humans experience workload is a complicated problem with physiological, performance, and subjective facets. It is also a well-studied problem, because cognitive workload has very practical implications for all of us: in a society that relies so heavily on complex technologies, understanding the design factors that influence human performance on those technologies is important for both individual and societal well-being.
Human cognitive resources are limited, and research in cognitive psychology has demonstrated the feasibility of measuring cognitive resource variability across many tasks. There are a number such cognitive workload methods, from physiological measurements (EEG) to secondary tasks (such as tapping a rhythm while performing a primary task). In secondary or dual task paradigms, participants perform a primary task, such as manipulating an item of machinery, while simultaneously performing a secondary task, such as remembering a set of letters or numbers. As the primary task taxes the participant’s cognitive resources, performance on the secondary task typically falls. For example, as the participant concentrates on solving a problem, their tapping might slow or stop.
Working Memory for Evaluating Cognitive Workload
One common method for measuring cognitive load involves using secondary tasks that tax working memory. Working memory refers to the cognitive system that “enables the transient storage and manipulation of information needed for effective, moment-to-moment interaction with the environment” (Clark, et al. 2004). Working memory enables people to maintain “task-relevant information during the performance of a cognitive task,” (Shah and Miyake 1999). Crucially, working memory comprises limited and measurable storage and processing capabilities; it is widely (and famously) believed that humans can handle between five and nine discrete items of information (Baddeley and Hitch 1974; Cowan and Morey 2007).
Researchers in cognitive psychology and human factors often employ working memory-based techniques in dual-task studies that assess cognitive load; for example, to evaluate technology usability. One well-validated secondary task is called a Sternberg task, based on Saul Sternberg’s research on the retrieval of symbolic information from recent memory (Sternberg 1966). In a dual-task study using a Sternberg task, the researcher presents the participant with a set of letters or numbers to be remembered during performance of a primary task. As the participant is performing the primary task, s/he is presented with a random string of letters or numbers in which items from the original target set are embedded. Upon being presented with one of the letters or numbers from the initial target set, the participant pushes a button. Measurements include response time and accuracy, although response time is widely accepted as the more significant metric of cognitive workload. Sternberg tasks may be auditory or visual, depending on the format in which the letter string is presented. The difficulty of the Sternberg task can vary with the size of the set being used (Sternberg himself varied the size of the set from 1-6 items; see Sternberg 1966).
Dual-task methods, including those using Sternberg tasks, are widely used to assess cognitive workload for technology design and evaluation projects, including software.. However, most mainstream UI evaluation approaches rely on heuristic evaluation, structured tasks, or subjective assessments of difficulty from users. While cognitive workload might be implicit in such studies, it is not typically an explicit focus of the evaluators, although some software evaluation studies have used dual-task approaches. For example, ergonomics researchers recognize information search and retrieval tasks as cognitively complex activities, for which measurement of cognitive load is an important construct in assessing technologies that support search and retrieval activities on the internet (Back and Oppenheim 2001; Gwizdka 2009; Gwizdka and Spence 2006).
More typically, however, when workload or mental effort is assessed, it is often measured indirectly; for example, by examining user behavior during structured evaluation tasks (for example, if a user stops in the middle of a task); by examining error and time-on task rates
; or post-facto
subjective reporting, either using the NASA TLX or by asking users to identify what parts of a software package or application are difficult to use, and why. As Tracey and Albers point out (2006), indirect assessments are not an optimal way to assess cognitive workload, since users often have difficulty articulating at what point the software became difficult to use; and traditional metrics such as time-on-task, completion, and error rates do not necessarily indicate increased cognitive workload. They suggest that a combination of dual-task approaches and structured tools such as the NASA TLX be used in tandem to gather information on cognitive workload for web applications (2006: 257).
We believe that incorporating measures of cognitive workload into UI evaluations would provide quantitative comparisons that can help researchers to more precisely assess features of user interfaces that facilitate a user’s operation of the interface. In much the same way as in other technology design and evaluation projects (e.g., Wickens et al 1986) it is reasonable to expect that users will expend greater mental compensating for a bad UI design, and that tax on their resources will show up as a drop in performance on a concurrent secondary task. If this is true, it has important implications for the design of software for more complex analytical and problem-solving activities: as user devotes more cognitive resources to navigating a difficult interface or interpreting a confusing UI, s/he will have fewer resources available for actually applying the tool to real-world data for real-world problems. Well-designed user interfaces should minimize cognitive load extraneous to the problem at hand.
Our pilot study was designed to provide preliminary information about the application of dual-task approaches, specifically using a Sternberg task, to evaluating cognitive workload associated with user interfaces
, in tandem with the NASA TLX. In our pilot study, ten participants performed simple tasks with two different versions of a UI. One version of the UI was designed to be easy to use, while the other was deliberately designed to be difficult to navigate. Participants were also given a concurrent auditory Sternberg task involving remembering three letters in a target set, and pressing a mouse button when they heard one of the target letters. The study design, implementation, and preliminary results are discussed below.
Pilot study design
: Users will experience increased cognitive load while interacting with a software interface purposely designed to be more difficult to use, given that the training and tasking conditions remained the same across both interfaces. Differences in cognitive load will be visible in decreased performance on the secondary Sternberg task (response time and accuracy), as well as on the primary task (completion rates, error rate, and time-on-task). Users will also report higher levels of perceived workload on the NASA TLX.
We recruited ten Sandia summer students to participate in a counterbalanced, within-subjects research study requiring performance of a concurrent Sternberg task while performing simple video annotation tasks using one of two Sandia-designed video viewers, described below. Our participants ranged in age from 19-32 (average 22.6 years). All were matriculating in technical fields. One participant was a graduate student, while the remaining nine were undergraduate students. Nine of the participants reported familiarity with video viewers. Five reported specific experience using video annotation/editing software.
The CMP Video Viewer.
The video viewers used for the primary task were designed for the Sandia Navy-funded Capable Manpower Project (CMP, Chris Forsythe, PI). Our study in no way supported the CMP project. Instead, we “borrowed” the CMP video viewer software and redesigned the interface to meet the needs of our study. Participants in our study performed the primary tasks using the two versions of the Sandia CMP Viewer. One version of the CMP viewer was designed after the Windows Media Player and was intended to be easy to use. Among the research team members
, we referred to this version of the CMP viewer as the “Good” viewer because it was easy to use. A second “Evil” version of the CMP viewer was purposely redesigned to be more labor intensive and confusing. We did not use the “Good” and “Evil” terminology with the participants so as not to bias their performance, but we use it in this report because it is easy to remember and conveys the essence of these two viewers – one very user-friendly, one quite less so.
The “Good” CMP Viewer’s interface organizes user interactions around the timeline below the video screen. Annotations on the timeline are denoted with a yellow marker, and a moveable orange scroll bar enables users to navigate to a particular annotation or time in the video. Users can hover over a particular annotation with their mouse, and the annotation time and content appear in a “tool tip” window. See Figure 1.
Figure 1: The "Good" CMP Viewer
In the “Good” CMP viewer’s interface, users can double-click on the timeline to add an annotation and can right-click to edit or delete an annotation. See Figure 2.
Figure 2: The "Evil" Viewer with Multiple Menus
o design the “Evil” CMP Viewer, we took the “Good” CMP Viewer and redesigned the interface to make every action more burdensome and confusing. Rather than embed the controls into the video timeline and make them accessible with a mouse click, the controls were separated out into a set of at the bottom of the screen. Finding a particular time in a film requires the user to type a specific time into the appropriate entry box. Working with annotations requires the user to move through layered menus. See Figure 3.
In addition, we created a confusing “delete confirmation” dialogue box in which multiple negative sentence construction and buttons were designed to trick the user into clicking the wrong button. See Figure 4.
Primary Study Tasks
. The primary task required users to perform simple search, time notation
, and editing tasks on one of two short Charlie Chaplin films. For one of the films they used the “Good” CMP Viewer, and used the “Evil” CMP Viewer on the second film. We presented each participant one set of four tasks for each video, for a total of eight tasks in two sets. The task sets included four separate tasks, structured as follows:
Make X annotation when you see Y event.
At what time does X event happen?
Delete X annotation.
Find X and Y annotations. How many of each annotation do you find?
Users also answered two questions about the video content, to assess how well they had focused on and comprehended the video’s content. As the study design was counterbalanced, half the participants started with the “Evil” CMP Viewer and half started with the “Good” CMP Viewer. In addition, we switched videos and task sets across the viewers, so that half the participants used the “Evil” CMP Viewer for one video, while half used the “Good” CMP viewer for the other, and vice versa.
Secondary Sternberg Task
. We programmed a three-letter target auditory Sternberg task using E-Prime software. The target letter set was presented to the participants visually on their computer monitor, and was refreshed prior to each new task in the task sets. Participants were asked to click a mouse button on a special mouse when they heard a letter from their target set. We used a separate target letter set with each viewer. The stimulus set of letters was presented over computer speakers at the participant’s laboratory workstation at a rate of one letter per second.
All ten sessions were held at the NGC Human Factors usability laboratory in Building 899, in Technical Area 1 of the Sandia Albuquerque site. Upon arrival at the lab
, each participant was given a copy of a Statement of Informed Consent that explained the study. Once the participant’s questions were answered and the Informed Consent signed, witnessed and dated, we showed the participants to the user workstation. Prior to beginning the actual study, participants had a short training session for each of the video viewers and had an opportunity to practice the Sternberg task as a 2-minute stand-alone activity. This provided us with baseline data on the participant’s reaction time and accuracy. We also provided light refreshments.
For the actual study, we loaded the video viewer for the participant and provided a worksheet that presented the four tasks the participant was to perform. The worksheet included spaces for the participant’s answers to the questions. However, the worksheet was not the only place where participants recorded responses. The tasks also required the participants to add, edit, and delete annotations on the video files.
Prior to beginning the primary video task, the participant watched the respective video all the way through with no distractions. We did this to ensure that all participants had the same knowledge foundation of the video before they were asked to answer questions about the video. We then presented the participant with the target letter set for the Sternberg task. When the participant indicated readiness to begin the primary task, we initiated the auditory Sternberg task simultaneously with initiation of the video task. When the participant had completed all tasks in one set with one of the versions of the CMP Viewer, s/he answered two questions about the video’s content, then rated the contributing factors to their workload using the NASA TLX. We then loaded the other version of the video viewer and started the process again.
Between video viewer task sets, the participants were given a five-minute break. Once participants were finished with all the tasks using both versions of the CMP video viewer, we had each fill out a simple questionnaire that requested demographic and educational information, and asked the user to indicate his/her experience with other video viewer software packages.
Analysis and Findings
Our study generated a large amount of data
, including primary task performance data, a substantial E-Prime dataset, and the NASA TLX responses. As this was a Late Start LDRD that was running concurrently with several other user studies our team was performing for the NGC, we only completed our data collection in Mid-August. We have not reviewed much of our data, although we do have some preliminary findings that indicate the “Evil” video viewer was indeed more difficult to use. For example, Table 1 shows the average NASA TLX scores across all participants and tasks, aggregated for each video viewer.
Table 1: Average NASA TLX Scores for All Tasks
NASA TLX FACTORS
Good CMP Viewer
Evil CMP Viewer
We also know that participants took nearly twice as long, on average, to complete the task set using the second video viewer than they did with the first video viewer. The average time to complete the task set for the “Good” CMP viewer was about 11:30, while completing similar tasks on the “Evil” CMP viewer required an average of 22:33.
Over the next few months
, we will clean and analyze the data from the Sternberg task, as well as analyze data to assess performance metrics for the primary tasks. Since we designed the two tasks sets to have “parallel” structures, we will be able to compare the participants Sternberg task accuracy and response time across CMP Viewer versions for each task. We expect to find that response times on the Sternberg task are significantly longer for each task performed using the “Evil” CMP Viewer, compared to the parallel task in the “Good” CMP Viewer.
In addition, we are in the midst of completing a secondary working memory task study comparing the cognitive workload of different graphical vocabularies. We expect to present the findings from the present study and the graph study at the Beyond Time and Error: Novel Methods for Information Visualization Evaluation (BELIV) workshop at the ACM SIGCHI meetings in April 2011.
We believe that memory-based metrics as objective way to assess cognitive load in user interface design was a basic but critical “first step” to a more comprehensive methodology for assessing how information visualization and visual analytics tools influence the users’ problem-solving processes. We decided it would be prudent to gather additional data on the working memory metric idea itself, by deploying it in a completely different context using a completely different type of software than the information visualization software being developed under the Grand Challenge. Although we are still in the process of cleaning and analyzing our data from this study, we expect the findings to provide useful information about the utility of Sternberg tasks in assessing cognitive workload associated with user interface designs, both generally and in relation to information visualization toolsets.
THIS PAGE INTENTIONALLY BLANK
Bck, J, and C Oppenheim
2001 A Model of Cognitive Load for IR: Implications for User Relevance Feedback Interaction. Information Research 2001 6(2):http://informationr.net/ir/6-2/ws2.html.
Baddeley, Alan D, and Graham Hitch
1974 Working Memory.
Clark, C. Richard, et al.
2004 Spontaneous Alpha Peak Frequency Predicts Working Memory Performance Across the Age Span. International Journal of Psychophysiology 53:1-9.
Cowan, Nelson, and Candice Morey
2007 How Can Dual-Task Working Memory Retention Limits Be Investigated? Psychological Science 18(8):686-688.
2009 Assessing Cognitive Load on Web Search Tasks. The Ergonomics Open Journal Bentham Open Access.
Gwizdka, Jacek, and I Spence
2006 What Can Searching Behavior Tell Us About the Difficulty of Information Tasks? A Study of Web Navigation. In Proceedings of the 69th Annual Meeting of the American Society for Information Science and Technology. Austin, TX.
Hart, S.G., and L.E. Staveland
1988 Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. Human Mental Workload 1:139-183.
Shah, P., and A. Miyake
1999 Models of working memory: An introduction. In Models of Working Memory: Mechanisms of Active Maintenance and Executive Control. Pp. 1-27. Cambridge, UK: Cambridge University Press.
1966 High Speed Scanning in Human Memory. Science 153:652-654.
Tracy, Janet Patton, and Michael J Albers
2006 Measuring Cognitive Load to Test the Usability of Web Sites. Usability and Information Design:256-260.
1 MS John Wagner 01432
1 MS0899 Technical Library 9536 (electronic copy)
1 MS0123 D. Chavez, LDRD Office 1011