程序代写案例-S 543

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CptS 543 Assignment #1
Critical Review of The Evaluator Effect: A Chilling Fact About Usability
Evaluation Methods
Jessamyn Dahmen
2/9/2016
Summary. This article analyzes three established usability evaluation methods (UEMs) in terms of “the
evaluator effect.” The authors define the evaluator effect as the general differences in how usability
evaluators detect and rate usability problems within a system when using one of the three UEMs. The
three UEMs evaluated were cognitive walkthrough (CW), heuristic evaluation (HE), and thinking-aloud
study (TA). To analyze the evaluator effect the authors examined the results of several previous studies
using these UEMs that specifically explored the evaluator effect as well as some that provided the
necessary results but not explicitly look at evaluator effect. For their basic measure of the evaluator effect
the authors preferred the any-two-agreement measures over the detection rate measure.

The authors found that there is no one “best” UEM and that each method demonstrates different strengths
and weaknesses in terms of the evaluator effect. The main contributions of this paper to the literature are
to show how each UEM is affected by the evaluator effect and also demonstrate that using more than one
evaluator is preferable to using more users and only one evaluator. The authors identified three aspects of
the UEMs analyzed that contribute to the evaluator effect: vague goal analysis, vague evaluation
procedures, and vague problem criteria. When trying to identify the severity and number of usability
problems they found that multiple evaluators are beneficial due to the variability that consistently occurs
between evaluators in the results they examined. It was also concluded that the evaluator effect will need
to be mitigated but is likely not possible to eliminate. Finally, the researchers concluded that while each
UEM has weaknesses in terms of evaluator effect, they were still among some of the best usability
evaluation techniques available at the time.

Critical Review. There are two listed authors for this paper: Morten Hertzum and Niels Ebbe Jacobsen.
The former was associated with the Center for Human-Machine Interaction in Riso National Laboratory
and the latter worked with Nokia Mobile Phones, both in Denmark. Morten Hertzhum has earned both a
Master’s and PhD in Computer Science from the University of Copenhagen and has extensive experience
in both industry and academia. It appears that Hertzhum did not start looking at the evaluator effect
specifically until about 1998, around 3 years prior to this paper’s publication, although he did have other
publications related to Human Computer Interaction (HCI). Niels Ebbe Jacobsen has high levels degrees
in Computer Science, Psychology, and Business. Jacobsen also has experience in industry and academia,
with an emphasis on industry and HCI. It appears that Jacobsen also started looking at the evaluator effect
specifically around 1998, often co-publishing papers with Hertzum.

Both authors appear to be well established with extensive experience in this area of research even prior to
this paper’s publication. In their citations the authors draw on research from a variety of perspectives and
countries, although these perspectives seem to be primarily limited to people based out of Europe and the
United States. This may bias the authors in terms of the differences in how UEMs are administered across
different countries. Potentially they could be missing information on how evaluator effect can be
mitigated, or effective ways to address the three problems related to vagueness. It may be the case
however, that at the time of publication this area of research was new enough to not have an extensive
body of literature on which to draw upon from people other than several experts based in Europe and the
United States.

One weakness of the paper that is in a way acknowledged by the authors is that the evaluator effect is a
measure of reliability only. This implies that studying the evaluator effect alone only addresses the
problems associated with studying the extent to which independent evaluations produce similar results. It
does not deal with validity, specifically the extent to which the problems identified during a usability
study show up in real world-use. To strengthen the contributions of this paper it would have been ideal if
the authors looked at both a measure of reliability and validity.

Another weakness is related to the sample sizes of the studies that were used by the authors. As Table 1
indicates several of the studies examined by the authors had a very small sample of evaluators, with the
highest number being 77 individuals and two studies using 6 and 3 laboratories respectively with an
unspecified number of individuals. The low number of evaluators in many of the studies is an issue that
the authors point out several times throughout the paper. However, it may have biased the extent to which
evaluators differ, especially their calculated average range of difference between two evaluators (5% to
65%). Furthermore, to measure evaluator effect the authors examine two different measures, detection
rate and any-two agreement. Each measure has its own set of drawbacks but the authors state a preference
toward the latter measure. Unfortunately, some of the studies they used did not have sufficient data to
calculate any-two agreement. Also, there is no discussion if there are other ways to quantitatively measure
evaluator effect beyond those two methods.

Based on the description of the studies evaluated the authors tried to compare a variety of different studies
using different evaluation methods and make general cohesive conclusions based on their findings. One
strength of the paper is that the authors do examine how differences between the procedures for each
UEM could affect outcome. However, the studies they examined very greatly in terms of number and type
of evaluators, evaluated system, and UEM used. There does not seem to be any same system that was
evaluated using all three techniques and similar evaluators. Trying to compare these different studies
without a more balanced representation of different techniques and types of evaluators may result in many
variables affecting outcome in addition to UEM procedures that the authors did not discuss as thoroughly.

Integration with Related Work. In terms of the work preceding this work this paper offers several novel
insights. In comparison to Jacobsen and Hertzum’s (1998) previous publication on the subject of
evaluator effect this paper offered a much more in depth analysis and discussion drawing on a larger
sample size of evaluators and different UEMs. In a larger context this paper also seems to have been one
of the first studies to specifically analyze evaluator affect across all the most popular UEMs of the time.
As demonstrated by a review comparing UEMs written by Gray and Salzman (1998) evaluator effect was
often indirectly studied but explicitly addressed until the time of this paper’s publication.

After this paper was published it has been cited a large amount of times by several studies examining
UEMs and other aspects of usability in general. In some studies such as the one conducted by Hvannberg,
Law, and Larusdottir (2007) the researchers are skeptical that differences in usability evaluations,
specifically Heuristic Evaluations, are completely due to the evaluator effect. This skepticism is based on
the small sample of evaluators used in this study. Another subsequent study conducted by Vatrapu and
Perez-Quinones (2006) explored how differences in cultures can impact the outcomes of usability studies.
The results of this study seemed to support the findings of Jacobsen and Hertzum, especially when
considering evaluators interviewing users from the same culture compared to evaluators interviewing
users from different cultures. However these findings were not entirely supportive of Jacobsen’s and
Hertzum’s paper as the evaluators did not make any judgment decisions about usability problems.

It would seem that the small sample size used in this study may bring into question the actual significance
of evaluator effect in terms of affecting usability testing outcome. Nevertheless, it would appear that
much of the subsequent literature that cites Jacobsen and Hertzman’s study does acknowledge that the
evaluator effect exists and can impact the reliability of UEMs.

Implications for HCI. A major implication for HCI researchers that is brought up by this study and related
work is that the validity of the UEMs has not been extensively studied enough. Even if researchers can
establish methods that reduce differences between evaluators, does doing this help each UEM detect
problems that are useful to addresses in real world settings? Evaluators may be able to more consistently
detect similar problems but they may not be the ones that matter to actual users. Another implication
mentioned by the paper for researchers deals with the lack of studies that examine whether evaluators are
consistent across evaluations. It may be the case that the differences measured by the evaluator effect are
not based fully on “true” disagreement between evaluators, but inconsistencies in an individual
evaluator’s performance and abilities. Finally, another implication that this paper touches upon is how
reliable the measures of evaluator effect are. It seems at the time of publication there were only two
measures detection rate and any-two agreement. Each measure has different strengths but it may be the
case that there are other measures that capture aspects of the evaluator effect that are not fully captured by
these two measures alone.

One main implication for HCI practitioners is that for any of the three UEMs analyzed in this paper, it is
not a good idea to have only one evaluator. If using only one evaluator practitioners not only miss major
usability problems, they may also fail to assign the correct severity levels for each detected problems. The
paper also implied that adding another evaluator may be more beneficial than adding more users for some
of the UEMs. Another implication for practitioners is that it is better to specific about goals and task
analysis for a subset of critical problems than be vague and try to capture problems for all aspects of a
system. According to the authors usability evaluations will always involve evaluator judgment to some
degree so there is no way to completely eliminate differences between evaluators. Rather than trying to
eliminate these differences it is more effective to leverage them and include more evaluator perspectives
to get better usability problem coverage. It is also important to clearly define goals and procedure and
focus on only the most important aspects of a system if possible.

An implication for users of technology is that many of the systems deployed, even several years after this
paper’s publication, can profoundly be affected by the evaluator effect and the vagueness that is often
used to guide evaluators when performing UEMs. These systems may suffer from more usability issues
than would be present if evaluators had included even one additional evaluator in their analysis. This
paper also implies that without further analysis of validity in addition to reliability, users may have to
interact with systems that may have produced consistent results when being evaluated but still may not be
useful in terms of real world problems.
References Cited

Gray, W. D., & Salzman, M. C. (1998). Damaged Merchandise? A Review of Experiments That
Compare Usability Evaluation Methods. Human–Computer Interaction, 13(3), 203-261.

Hvannberg, E. T., Law, E. L. C., & Lérusdóttir, M. K. (2007). Heuristic evaluation: Comparing
ways of finding and reporting usability problems.Interacting with computers, 19(2), 225-240.

Jacobsen, N. E., Hertzum, M., & John, B. E. (1998). The evaluator effect in usability tests. CHI
98 Conference Summary on Human Factors in Computing Systems - CHI '98.

Vatrapu, R., & Pérez-Quiñones, M. A. (2006). Culture and usability evaluation: The effects of
culture in structured interviews. Journal of usability studies, 1(4), 156-170.

欢迎咨询51作业君