Chat with us, powered by LiveChat Discussion 8 | The Best Academic Writing Website

read the instruction

Read the article(s): “Can Unicorns Help Users” and “Is that you, Alice”

(every question below should be academic and focus on engineering area)

1. List 3 takeaways from the paper that are most interesting to you.

• bullet point #1

• bullet point #2

• bullet point #3

2. List 3 strengths points from the paper.

• bullet point #1

• bullet point #2

• bullet point #3

3. List 3 weaknesses points from the paper.

• bullet point #1

• bullet point #2

• bullet point #3

Bullet points are good enough for the questions above.

Each point should have complete sentences (1 sentences for each one should be the most, no

more than 1). You should have a clear point and be specific. All those bullet points (All the

points above cannot be very general or talk about the writing skills)

4. Write other 2 short paragraphs about two different and specific points from the article and talk

about your own thoughts. You will be expected to expand a little bit on these thoughts.

Exceeded Expectations

For example, by drawing on materials outside the paper (e.g. news, other research, personal

experience)

Important:

You cannot make unsubstantiated claims. As scientists, it is important for us to be clear about

whether what we are saying is:

• a personal opinion

• our understanding without back up, or researched fact

In the latter case, one should back up that argument with appropriate references such as peer-

reviewed research articles or white papers, or news articles from reputable sources. In your

written submissions, you can either hyperlink or put a short reference at the bottom.

Can Unicorns Help Users Compare
Crypto Key Fingerprints?

Joshua Tan, Lujo Bauer, Joseph Bonneau†,
Lorrie Faith Cranor, Jeremy Thomas, Blase Ur*

Carnegie Mellon University, {jstan, lbauer, lorrie, thomasjm}@cmu.edu
† Stanford University, [email protected]

* University of Chicago, [email protected]

ABSTRACT
Many authentication schemes ask users to manually compare
compact representations of cryptographic keys, known as fin-
gerprints. If the fingerprints do not match, that may signal a
man-in-the-middle attack. An adversary performing an attack
may use a fingerprint that is similar to the target fingerprint, but
not an exact match, to try to fool inattentive users. Fingerprint
representations should thus be both usable and secure.

We tested the usability and security of eight fingerprint repre-
sentations under different configurations. In a 661-participant
between-subjects experiment, participants compared finger-
prints under realistic conditions and were subjected to a sim-
ulated attack. The best configuration allowed attacks to suc-
ceed 6% of the time; the worst 72%. We find the seemingly
effective compare-and-select approach performs poorly for
key fingerprints and that graphical fingerprint representations,
while intuitive and fast, vary in performance. We identify
some fingerprint representations as particularly promising.

ACM Classification Keywords
K.6.5 Security and Protection: Authentication; H.5.2 User
Interfaces: Evaluation/methodology

Author Keywords
usability; key fingerprints; authentication; secure messaging

INTRODUCTION
To protect the privacy of communications like email and in-
stant messaging, users can encrypt messages using public-key
encryption. For Alice to send a message to Bob that only Bob
can read, she needs to encrypt the message with Bob’s public
key. Bob will use his private key to decrypt the message.

While this method of securing communication is believed to
be technically sound, it hinges on Alice knowing Bob’s public

CHI 2017 May 06-11, 2017, Denver, CO, USA
© 2017 Copyright held by the owner/author(s).

ACM ISBN 978-1-4503-4655-9/17/05.

DOI: http://dx.doi.org/10.1145/3025453.3025733

key. To learn Bob’s key, Alice would typically look it up
on a web site (e.g., a public key server) that publishes such
information. Unfortunately, an attacker seeking to intercept
Alice’s communications to Bob might try to add his own key
to the key server under Bob’s name. When trying to find
Bob’s public key, Alice would then unwittingly download the
attacker’s key. Any messages she composed for Bob would
then be readable by the attacker, and not by Bob.

A more reliable method would be for Bob to deliver his public
key to Alice in person. Because public keys are long strings
of arbitrary bits, this approach is unfortunately unwieldy and
impractical. A common alternative is for Bob to give Alice
a fingerprint of his key, which is a short digest (hash) of the
key. Alice then manually compares (e.g., looks at them side
by side) the fingerprint received from Bob to the fingerprint
computed from the key she downloaded from the key server.
Fingerprints are by design long enough for it to be exceedingly
unlikely that two different keys will have the same fingerprint,
yet short enough for manual comparison to be feasible.

Fingerprint verification is only useful, however, if Alice is able
to determine easily and successfully whether the fingerprint
she obtained from Bob matches the one she computed. If
Alice is only comparing the first part of the two fingerprints,
for example, this opens the door to attackers who try to create
a public key whose fingerprint will be similar to Bob’s key’s
fingerprint, in the hope that Alice’s (cursory) examination will
not distinguish it from the real fingerprint.

Fingerprints can be represented in many ways, which may
impact the efficiency and accuracy with which users compare
them. Besides the commonly used hexadecimal format [2, 8],
other representations used in practice include ASCII art [22],
numbers [34], pronouncable strings [13], and avatars [20]. Ad-
ditional representations have been proposed, including abstract
art [23], sentences [1], snowflakes [18], and fractal flames [25].

In this paper we report on the results of a 661-participant on-
line study through which we compare the usability and efficacy
of a range of fingerprint representations and configurations.
We test eight different representations and examine how likely
users are to notice fingerprint mismatches caused by an at-
tacker who creates public keys whose fingerprints are similar
to the authentic key’s fingerprint. We include fingerprint rep-

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3787

rodkin
Typewritten Text
This work is licensed under a Creative Commons Attribution International 4.0 License.

resentations used in practice (e.g., hexadecimal strings, ASCII
art) as well as ones hypothesized to be good alternatives (e.g.,
randomly generated images of unicorns). We also examine
different approaches used to compare fingerprints (seeing two
fingerprints side by side versus selecting the correct fingerprint
from a list) and the effect on security against differently pow-
erful attackers. We pay special attention to our experiment’s
realism, simulating time pressure that users may feel in prac-
tice when performing security tasks, accounting for the fact
that attacks are rare rather than the typical case, and simulating
other practical challenges (e.g., comparing a fingerprint on a
business card to one on a screen).

Our results include findings important for designing systems
that involve comparing fingerprints. We find that graphical rep-
resentations, thought promising [11], have mixed success. Al-
though all allowed quick comparisons, some were much more
susceptible to attack than more standard representations. We
also find that the compare-and-select method of comparing fin-
gerprints resulted in many users failing to detect mismatches,
to the point where its use for fingerprint comparison should
be strongly discouraged. Echoing prior work [6], we find that
non-hexadecimal textual representations have promise, espe-
cially for usability. The traditional hexadecimal representation,
however, overall fares surprisingly well.

These findings lead to a set of suggestions for when a finger-
print representation may be appropriate. When security is
paramount, none of the representations tested seem adequate.
When the risk and impact of attack is low but usability is
paramount, visual representations excel. For casual use, when
security and usability need to be balanced, textual representa-
tions, including hexadecimal, seem to be the most appropriate.

BACKGROUND AND RELATED WORK
In this section we discuss the uses of fingerprints and prior
attempts to improve their usability. We also explain how to
use entropy to quantify the security afforded by fingerprint
representations. Finally, we relate attacker strengths to real-
world costs for performing brute-force attacks on fingerprints.

Fingerprint Applications
One type of attack in communication systems is a man-in-the-
middle (MitM) attack, in which an attacker inserts himself
or herself between two communicating parties in such a way
that neither party is aware of the attacker. Once inserted, the
attacker can eavesdrop or actively manipulate communications.
Fingerprint comparisons enable a user to detect MitM attacks.

Well-known applications that make use of fingerprints in-
clude GnuPG [2], a tool for encrypting communications, and
OpenSSH [8], commonly used for remote access to servers.
Off-the-Record (OTR) Messaging applications provide multi-
ple ways to authenticate message recipients, including finger-
print verification [21]. Many popular secure chat smartphone
apps also use fingerprints, usually in a layered approach in
which fingerprint comparison is optional [34, 36].

A variety of fingerprint representations and formats are used,
the majority of which are textual. Examples of textual repre-
sentations used in real systems are shown in Table 1. Graphical

GnuPG 3A70 F9A0 4ECD B5D7 8A89
D32C EDA0 A352 66E2 C53D

OpenSSH ef:6d:bb:4c:25:3a:6d:f8:79:d3:a7:90:db:c9:
b4:25

bubblebabble xucef-masiv-zihyl-bicyr-zalot-cevyt-lusob-
negul-biros-zuhal-cixex

OTR 4206EA15 1E029807 C8BA9366 B972A136 C6033804
WhatsApp 54040 65258 71972 73974

10879 55897 71430 75600
25372 60226 27738 71523

Table 1: Examples of textual fingerprint representations used
in actual applications.

representations have seen limited use; examples include Pee-
rio [20], which uses an avatar representation, and OpenSSH
Visual Host Key, which resembles ASCII art (Figure 2).

Usability of Fingerprints
Public key cryptography has long had usability and adoption
issues [3, 9, 10, 26, 35]. Fingerprints have been part of the
problem. In a user study on OTR messaging, participants
were confused by fingerprints and struggled to verify them
correctly [28]. In order to make fingerprint verification more
usable, researchers have explored a variety of approaches.

One way to make comparison easier is to shorten fingerprints.
This approach is used in Short Authentication Strings (SAS).
SAS can provide reasonable security using only a 15-bit
string [33], though they are primarily useful in synchronous en-
vironments. Alternatively, the computational security of small
fingerprints can be increased by slowing down the hashing
algorithm using a stretching function. In our study, we focus
on fingerprint comparisons that apply to public-key fingerprint
verification, which is traditionally asynchronous.

Adding structure to a fingerprint representation may make it
easier to compare. Textual fingerprint representations can be
separated into smaller chunks. Representing fingerprints as
pronounceable words or sentences may also facilitate compar-
ison. For graphical representations, structured images resem-
bling abstract art have been suggested for improving usability.
These include Random Art [23] and OpenSSH Visual Host
Key, which was inspired by Random Art [22]. Another way to
add structure is to represent fingerprints as avatars, such as uni-
corns [32] or robots [5]. In our study, we examine textual and
graphical representations with varying degrees of structure.

Different comparison modes have also been proposed, primar-
ily for SAS-based device pairing or synchronous authentica-
tion [7, 27]. These include compare-and-confirm (compare
two strings and indicate if they match), compare-and-select
(compare one string to a set of others and select the matching
option), and copy-and-enter (copy a string from one device to
another and let the device itself perform the check). Compare-
and-select has also been used for anti-phishing tools that ask
users to select the website they want to visit from a list, rather
than ask for a yes/no answer to whether users would like to
proceed to a given website [37]. In our study, we test both
compare-and-confirm and compare-and-select. In particular,
we explore compare-and-select due to its potential benefits for
inattentive users. Prior work has postulated that compare-and-

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3788

select may help prevent users from “verifying” fingerprints
without actually comparing them [7, 31].

A number of usability studies on device pairing have explored
representations and comparison modes similar to those we
test [14, 16, 17]. They considered representations such as num-
bers, images of visual patterns, and phrases, as well as various
comparison modes. In each of these studies, participants per-
formed comparison tasks similar to ours in a lab setting. In
some cases, participants were also subjected to a simulated
MitM attack [14, 17]. These studies had mixed findings; for
example, Kainda et al. conclude that compare-and-confirm
and compare-and-select should not be used because it is sub-
ject to security failures [14], while Kumar et al. recommend
that compare-and-confirm with numbers be used due to its low
error rate and comparison speed [17]. Although these studies
were useful for informing our selection of fingerprint represen-
tations and comparison modes, a number of differences limit
the extent to which their findings might extend to verifying
public key fingerprints. These differences include the size of
fingerprints tested (15-20 bits, compared to at least 128 bits
in our study) and the methodology used (lab studies are often
too small to perform statistical testing).

In a 400-participant online study, Hsiao et al. examined the
speed and accuracy of fingerprint comparisons under different
representations [11]. In contrast to lab studies, their methodol-
ogy enabled statistical testing. However, with one exception,
all fingerprints were sized between 22-28 bits, so it is unclear
how many of their findings might translate to our setting. The
exception to this caveat is that they tested Random Art [23],
which can encode bit sizes comparable to those we explore.
They found Random Art performed well in both accuracy
and speed, recommending its use for color-display devices
with sufficient computational power. We examine Vash, an
open-source implementation of abstract art fingerprints [4, 30].
Although the two representations are similar, Hsiao et al.’s
findings for Random Art may not extend to the attacks we
consider; the similar-looking pairs they used were selected
from only 2000 Random Art images, whereas we consider
attackers who can generate 260 candidate images.

Other than recent work by Dechand et al. [6], scant research
has tested representations suitable for key fingerprints. That
study involved a large-scale, within-subjects experiment on
Mechanical Turk (MTurk) in which participants compared fin-
gerprints displayed in different textual representations, includ-
ing hexadecimal, numbers, words, and sentences. They mea-
sured comparison speed and accuracy, and they also recorded
whether participants correctly compared fingerprints for mul-
tiple simulated 280 attacks. They found that hexadecimal
performed significantly worse than numbers and sentences in
both attack detection rates and usability ratings. In addition,
they found that sentences had a significantly higher attack
detection rate than numbers, while also being rated as more
usable.

Our work shares many similarities with Dechand et al.’s
study [6], particularly in the textual representations tested.
Similar to their study, we also subject participants to simulated
attacks for textual representations to determine the usability

and security of those representations. Unfortunately, direct
comparison of our work to theirs is difficult due to parameter
differences in the textual representations we test, in partic-
ular the chosen security level for fingerprints.1 The other
primary differences between the previous study and ours are:
we evaluate fingerprint comparisons under realistic conditions
of habituation and distraction using a between-subjects de-
sign; we explore graphical representations; we perform an
initial investigation on compare-and-select for cryptographic
key fingerprints; and we test additional attacker strengths.

Entropy as a Security Metric
Entropy can be used to measure the computational security
afforded by different fingerprint representations. If users com-
pare fingerprints fully, then the entropy of a fingerprint rep-
resentation quantifies the average work needed to find a key
whose fingerprint collides with the target fingerprint. If users
do not compare fingerprints completely, but only compare
certain aspects, then an intelligent attacker may attempt to
only match the aspects he expects will be actually compared.
This type of attack, in which the adversary attempts to find a
visually similar fingerprint to the target fingerprint, has been
explored for both the hexadecimal and graphical fingerprints
used in OpenSSH [19, 24]. More recently, the previously men-
tioned study by Dechand et al. investigated users’ ability to
detect such attacks for different textual representations [6].

The reduction in entropy of the original representation (af-
ter fixing the matched aspects) can be used to quantify the
work an attacker must spend to produce a key whose finger-
print matches specific aspects of the target fingerprint. This
approach to quantifying attacker work has the added benefit
of being independent of the particular binary encoding used
for a fingerprint. A 260 attack corresponds to an attacker that
generates 260 keys, computes the fingerprint for each, and then
(manually or programmatically) selects the key whose finger-
print maximizes similarity according to some metric. Stevens
et al. estimate the cost of renting CPU/GPU time from EC2
to find a SHA-1 collision, which requires resouces similar to
those to perform a 260 attack on fingerprints. They estimate
this cost to be between 75K and 120K USD, which they note
is within the budget of criminal organizations [29].

METHODOLOGY
We conducted a between-subjects experiment to evaluate and
compare the usability and security of fingerprint represen-
tations and configurations. We recruited participants from
MTurk in August 2016. We required participants be 18 years
or older and live in the United States. Our protocol was ap-
proved by our university’s IRB.

We advertised the study as a “role-playing activity involving
technology and communication in the workplace” that would
take about 20 minutes. We compensated participants $3, with
the opportunity to earn a $1 bonus. As our activity was not
designed for use on tablets or smartphones, we asked that
participants use a desktop or laptop computer. Participants

1We began our study prior to publication of their work, limiting our
ability to choose parameters consistent with their study.

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3789

Figure 1: Screenshot of the task. This example is the compare-
and-confirm, simultaneously visible, hexadecimal condition.

were randomly assigned to a condition, which determined the
fingerprint representation and configuration they saw.

We asked participants to imagine they worked as an accountant
at a company that was updating its employee database. To
perform this update, participants had to retrieve the social
security numbers (SSNs) for 30 employees and enter them
into a database. We chose SSNs to motivate the need for
secure communication. In the U.S., SSNs are highly sensitive
because knowing an individual’s SSN can enable identity theft
and have other financial ramifications. Our activity web page
divided the browser window into two sections that mimicked
the appearance of a computer screen and a desk (Figure 1). The
computer screen section displayed a spreadsheet-like database
where participants would need to fill in missing SSN details
for employees. Several business cards were sitting on the desk.
A stopwatch and information about the participant’s progress
completing the task appeared in the top right corner.

We provided the instructions for the activity via an interactive
tutorial. Participants could repeat the tutorial any number of
times until they were comfortable proceeding, and the on-
screen stopwatch did not begin until after the tutorial ended.

At the start of a task, a chat window appeared on the simulated
computer screen informing participants of an incoming mes-
sage from one of the 30 employees. To proceed, they had to
perform a security check. Shortly afterward, a dialog box was
displayed on the computer screen containing a fingerprint and
instructions to compare it to the fingerprint on the employee’s
business card, which appeared simultaneously on the desk.

For our baseline configuration, a security check involved com-
paring two fingerprints (one in the security check dialog box
and one on the business card) and pressing a button to indicate
if they were the same. We informed participants that this check
was needed to ensure a secure chat session and avoid potential
eavesdroppers. Depending on which fingerprint representa-
tion was shown, we provided guidance on what differences
participants should look for when comparing fingerprints.

If the participant indicated that the fingerprints matched, the
chat window displayed a message from the employee with her
SSN. The participant was instructed to type the SSN into the
database. If the participant instead indicated that fingerprints

did not match, the chat message instructed the participant
to instead enter “ERROR” in the database. Each participant
repeated this task for 30 employees.

We instrumented our activity to record participants’ database
entries, security-task decisions, and detailed timing informa-
tion. We also recorded their browser user agent strings to
determine whether participants used a tablet or smartphone.
For one condition in which users had to toggle between two
views, we recorded the number and timing of toggles.

Afterwards, participants filled out a survey. In this survey, we
told participants whether they had missed our attack and asked
them to explain why they thought they missed or detected
that attack. To aid in memory, we showed the fingerprint pair
corresponding to the attack alongside these questions. We
asked participants to describe their strategy for comparing
fingerprints, respond on a Likert scale to statements about the
fingerprints they saw, and provide general demographic data.

Security Task Design Considerations
Security tasks are rarely performed for their own sake. Finger-
print comparisons are secondary to a primary purpose, such as
communicating with someone. Combined with the pressures
and stresses of everyday life, this state of affairs often results in
users performing security tasks while distracted or otherwise
not fully attentive. In addition, few users will have previously
been the target of a MitM attack and might have little reason
to believe they would become such a target. Users asked to
compare fingerprints might only encounter mismatching fin-
gerprints due to device misconfiguration, the acquisition of a
new device, or security software re-installation. Many design
decisions for our activity reflect these real-world factors.

To increase distraction and stress, we incentivized participants
to perform the task both quickly and correctly by informing
them that the “15% fastest participants with the fewest mis-
takes” would receive an additional $1 bonus. We considered
a mistake to be entering an incorrect SSN into the database
or failing the security check for an employee. The interface
contained both a persistent reminder of this bonus and a timer
showing the elapsed time, a target time participants should
try to beat, and the number of employees remaining. We also
highlighted this box in the tutorial. To avoid cases where par-
ticipants felt under-pressured due to exceeding the target time,
we noted that beating the target time did not guarantee the
bonus as future participants could lower or raise it.

We expected most participants would have little to no expe-
rience comparing key fingerprints and thus would lack the
expectations typically held by users who frequently compare
fingerprints. In an effort to ingrain these expectations quickly,
we sacrificed realism with respect to the role-playing activity.
To habituate participants to benign situations, 28 of the 30
comparison tasks involved matching fingerprints. One finger-
print comparison task involved obviously different fingerprints,
reflecting what would commonly be seen by users in benign
situations (e.g., misconfigurations). This task was shown in a
randomly determined position between the second and fifth
pairs, inclusive. We also used this task to filter out participants
who mindlessly clicked through the entire activity.

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3790

Threat Model and Simulated Attack
In our threat model, an adversary attempts a MitM attack
on a specific user in the context of a fingerprint comparison
task. We assume preimage attacks on the user’s fingerprint
are infeasible. Instead, an adversary attempts to present a
key whose fingerprint is similar to the target user’s fingerprint.
We assume the adversary has finite resources, limiting how
similar the attack fingerprint can be to the target. Participants
were shown a single simulated attack according to this threat
model. The comparison task in which the attack appeared was
randomly chosen from between the 25th and 29th pairs, inclu-
sive. To minimize bias, we generated three attack instances
for each configuration and randomly selected one of these
three for each participant. Since we tested attack strengths
requiring computational resources not available to us, we sim-
ulated these attacks; we generated a 260 attack by creating a
version of the target fingerprint whose similarity was such that
it would take 260 bruteforce attempts on average to produce.

Applications can optionally use key stretching to increase re-
sistance to brute-force attacks. WhatsApp implements this
approach using iterated hashing [34]. The simulated attacks in
our study assume that hash strengthening techniques are not
applied. Thus, our attack detection results directly apply to
applications such as GnuPG and OpenSSH, which do not cur-
rently implement such defenses. However, the attack strengths
we test can be translated to applications that do use hash
strengthening. Specifically, if a fingerprint scheme employs
hash strengthening to require an additional 220 work per gener-
ated fingerprint (reasonable even on mobile devices), then our
results for 260 attacks on fingerprints without strengthening
translate to 280 attacks on fingerprints with strengthening.

Experimental Factors

Representations
For textual representations, we chose to target a fingerprint
security level of 160 bits. This is the same fingerprint secu-
rity provided by current implementations of GnuPG, which
uses SHA-1 for its hash function. Where applicable, textual
representations were chunked in groups of four, with chunks
separated by spaces. This chunk size performed well in prior
work [6]. For our graphical representations, we evaluate the
representation implementation in its original form, leaving the
fingerprint security as is. Table 2 demonstrates our textual
representations and Figure 2 our graphical representations.
As we introduce the representations, we note the number of
bits representing the space of possibilities that can be gener-
ated using that representation with those parameters. For all
representations, both textual and graphical, the number of pos-
sibilities that a human can distinguish can only be determined
empirically, which is implicit in the rate at which participants
in our study detect attacks.

For textual representations, we tested hexadecimal (uppercase
and lowercase), alternating vowels/consonants, words, num-
bers, and sentences. Hexadecimal was 40 characters long (160
bits), numbers was 48 digits long (159.5 bits), alternating was
48 characters long (161.1 bits), and words was 16 words long
(155.7 bits). We selected words from Ogden’s Basic English

Hexadecimal BAAA 9AE6 7B8B 0D41 BD83
05E7 5209 8EDF 1058 41F6

Alt vow./cons. bunu difu tura wefi wiwe haqe
tano haco qevu cori qife nufi

Words learning equal education bent
collar religion new shelf
angle table train sad
keep meal thing punishment

Numbers 7748 5689 7453 6977 5604 5939
2765 8791 5022 4957 3805 0309

Sentences The basket ends your right cat on his linen.
Her range repeats her nerve.
The smile tells secretly.
My clean cake pulls your waiting pocket.

Table 2: Textual fingerprint representations used for experi-
ments. For hexadecimal, we tested both uppercase and lower-
case variations.

(a) OpenSSH Visual
Host Key

(b) Vash (c) Unicorn

Figure 2: Visual fingerprint representations.

word list.2 For sentences, we used an implementation based
on a deterministic sentence generator [1] (159.8 bits). For this
representation, the average number of sentences is 3.6 (max: 7)
and the average length of the longest sentence is 9 words (max:
12). Each of hex, numbers, and alternating vowels/consonants
was equally spread over two lines. Words were spread over 4
lines. For the sentences representation, each sentence began
on a separate line, wrapping where necessary.

For graphical representations, we test OpenSSH Visual Host
Key (≤ 128 bits), Vash (≈ 5,438 bits), and unicorns [32]
(≈ 2,854 bits).3 Visual Host Key was included because it
is widely deployed in SSH software. Prior work explored
Random Art fingerprints [23], of which Vash is an open-source
implementation. We included unicorns to test fingerprints that
use avatar-like representations. We limited consideration to
those representations whose entropy was large enough for use
as cryptographic fingerprints in asynchronous settings (i.e.,
at least 128 bits). Unicorn fingerprints were generated by a
program in which unicorn attributes were set according to
numbers drawn from a pseudorandom number generator. The
key to be hashed served as the seed to the generator. For Vash
fingerprints, which can viewed abstractly as a graph, a similar
process was used to determine graph node types and properties,
which uniquely determine the fingerprint appearance.

2http://ogden.basic-english.org/words.html
3Given the computational difficulty of an exact calculation, for our
graphical representations we roughly estimate security.

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3791

Figure 3: Security task dialog for participants in the compare-
and-select, hexadecimal condition.

For all representations, we allowed the fingerprint to take
up a maximum of 50% of the card, horizontal or vertical.
For sentences, this required formatting the fingerprint using
a slightly smaller font size than for the other textual formats.
For textual representations, we advised participants that they
should ignore differences in font size or type. In addition, for
words and sentences, we informed participants that differences
in fingerprints due to misspelled words would not occur.

For each representation, we thus needed to pick the specific
aspects that would be matched and ensure that the reduction
in entropy matched the assumed attacker strength. Our overall
strategy was to match the aspects we expected users would
focus on during comparisons. For textual representations,
we primarily matched the beginning and end of lines. For
graphical representations, we mainly attempted to match “big
picture” elements, such as overall pattern and color. For Vash,
we matched node types up to a certain depth. For unicorns, we
chose to roughly match many aspects, such as background hue
and horn length. Images of each attack used in our experiment
are included as supplementary material.

Comparison Mode
We tested both compare-and-confirm and compare-and-select,
as previously described. While compare-and-confirm is the tra-
ditional method of fingerprint comparison, we were interested
in the performance of compare-and-select, given its potential
benefits for inattentive users. An example of compare-and-
select is shown in Figure 3. Our implementation is similar to
that used in SafeSlinger [7]; three fingerprints are shown in a
random order, with two fingerprints randomly generated and
one fingerprint corresponding to the received fingerprint.

Visibility Mode
In most cases, participants tasked with comparing fingerprints
are able to see both fingerprints simultaneously. For example,
to compare a fingerprint on a business card to one on a com-
puter screen, the user can simply hold the business card up to
the screen to place the fingerprints side-by-side. However, in
certain use cases, it may not be possible or easy to view both
fingerprints simultaneously in order to compare. As an exam-
ple, many versions of Android do not have any easy way to
view two applications in a split-screen view. In this case, if the
user needs to compare fingerprints shown in two applications
(say, a fingerprint shown in a secure chat app and one shown
on a website), the user will need to toggle back and forth in
order to compare.

We tested situations in which fingerprints are both visible
simultaneously as well as situations in which the user must
toggle between them. We expected the need to toggle to affect
representations differently, as it may be easier to place certain
representations in short-term memory than others.

Attack Strength
We considered three different attack strengths: 240, 260, and
280. These strengths correspond to the estimated capabilities
of an attack performed on commodity hardware, an attack
performed by a well-funded criminal organization, and an
attack performed by a state-sponsored actor.

Other Factors
We included two other experimental factors. For hexadecimal,
we tested the impact of letter case on performance. We were
interested in this because both types have been used for real
security applications (e.g., lowercase in OpenSSH and upper-
case in GnuPG). We also tested a variation of our activity in
which the target time-to-beat was doubled, from 540 seconds
to 1080 seconds. We tested this to provide insight into the
extent to which our results depend on the specific target time
that was used.

Experimental Conditions
Participants were randomly assigned to one of 17 experimental
conditions, which determined the specific fingerprint repre-
sentation, configuration, and attacker strength used for that
participant in the activity. Eight of our conditions assumed an
attacker strength of 260 and varied only according to finger-
print representation. To explore the effect of different attacker
strength assumptions, we tested four additional conditions:
240 attacks on uppercase hexadecimal and unicorns and 280
attacks on uppercase hexadecimal and Visual Host Key.

The four remaining experimental conditions were designed to
be compared with our baseline hexadecimal condition, namely
uppercase hex using the compare-and-confirm configuration in
which both fingerprints to be compared were simultaneously
visible, with an assumed attacker strength of 260. These condi-
tions varied from the baseline condition with respect to only
one factor: compare-and-select (comparison mode), toggle
(visibility mode), lowercase (hex letter case), and twice the
target time-to-beat.

We performed some modifications to test fingerprint compar-
isons under adverse conditions. In all conditions, fingerprints
were displayed in different font types and sizes. The business
cards were presented randomly tilted up to 10 degrees. For
the Vash and unicorn representations, we applied a random
gamma correction to the fingerprint shown on the computer
screen section in the range [0.8, 1.2], in order to simulate
the effect of an improperly calibrated display. For sentences,
we showed fingerprint pairs formatted such that line breaks
occurred in different places.

Statistical Analysis
We performed hypothesis tests for each metric we measured.
To test for significant effects with respect to the proportion
of security failures across conditions, we used Pearson’s chi-
squared test (for omnibus tests) and Fisher’s exact test (to

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3792

compare two conditions). To test for significant effects for
median comparison time, number of false positives, and Likert
ratings, we used the Kruskal-Wallis test (for omnibus tests) and
the Mann-Whitney-Wilcoxon test (to compare two conditions).
Given the large number of comparisons we made, we applied
the Holm-Bonferonni method and report corrected p-values.
All hypothesis tests used a significance value of α = 0.05.

Limitations
For each representation, we used an attack strategy to maxi-
mize similarity to the target fingerprint for particular aspects
(or in particular locations) of that fingerprint. Our goal was
to find the most effective attack given a fixed computational
budget. Nonetheless, it is unlikely that we selected the best
possible attack for each representation. Given this limitation,
we believe that our statistical power, sufficient to detect only
large differences in attack detection rate, is appropriate. Such
large differences, if they exist, may indicate weakness intrinsic
to a representation or configuration, not only in our attack
strategy.

Our participants were recruited from MTurk, which is not
representative of the general U.S. population; MTurk workers
have been found to be younger and better educated [15].

To habituate participants to benign security scenarios in a short
amount of time, we had to sacrifice some realism. However,
feedback from participants indicates that we achieved our goal
of recreating the types of conditions under which people would
likely perform fingerprint comparisons as a security task.

Since the primary goal of most MTurk users is to get paid, and
since we tied the bonus payment to participants’ performance
on the security tasks in our activity, the security tasks were
not secondary to some other task, as would be the case in the
real world. However, as previously explained, we believe we
captured the desirable characteristics intrinsic to security as a
secondary task.

Participants
A total of 677 participants completed our MTurk HIT. We
excluded 16: 3 participants used an Android or iOS device, 3
encountered technical issues, and 10 failed an attention check.
The average time spent on our HIT was 14 minutes and 12
seconds. We paid all workers that accepted our HIT.

Our reduced sample consisted of 661 participants. Participants’
ages ranged from 18 to 74 with an average of 33 years (σ =
9.7). Participants were 44% female and 55% male, with 1%
choosing not to specify. The two most common education
levels were a four-year college degree (35%) and some years of
college without finishing (27%). The most frequently reported
occupations were service (16%); business, management, or
financial (13%); and computer engineering or IT professional
(12%). We considered participants as technical if two out of
three of the following were true: they listed their occupation
type as computer engineering or IT professional, they knew
a programming language, or they indicated that people often
asked them for computer-related advice. According to this
definition, 18% of participants qualified as technical.

Frac. #
Condition Missed M(comp) Part.

hex, confirm, bothvis, 2^60 0.21 8.03 42
num, confirm, bothvis, 2^60 0.35 8.09 43
alt, confirm, bothvis, 2^60 0.17 8.71 40
word, confirm, bothvis, 2^60 0.14 6.70 42
sent, confirm, bothvis, 2^60 0.06 7.57 33
ssh, confirm, bothvis, 2^60 0.10 5.51 42
uni, confirm, bothvis, 2^60 0.54 2.04 39
vash, confirm, bothvis, 2^60 0.12 2.17 33

hex, select, bothvis, 2^60 0.72 5.21 47
hex, confirm, toggle, 2^60 0.30 9.79 40
vash, confirm, toggle, 2^60 0.27 5.18 33
hex, confirm, bothvis, 2^40 0.06 9.20 31
uni, confirm, bothvis, 2^40 0.67 1.84 42
hex, confirm, bothvis, 2^80 0.37 8.80 43
ssh, confirm, bothvis, 2^80 0.25 4.10 32
hex, confirm, bothvis, 2^60, 2x time 0.10 11.04 40
hex (low), confirm, bothvis, 2^60 0.21 8.22 39

Table 3: Summary statistics by condition, including me-
dian comparison time (M(comp)), fraction of participants that
missed the attack, and total number of participants.

RESULTS
We first describe the performance of different fingerprint rep-
resentations. We then describe the effects of different ways
of eliciting confirmation (compare-and-confirm vs. compare-
and-select) and varying whether users could see the two fin-
gerprints they were comparing one at a time or both at once.
Finally, we discuss participants’ self-reported strategies for
comparing fingerprints. An overview of our results is provided
in Table 3.

Representations

Attack Detection Rate
The fraction of participants who failed to notice a simulated
260 attack varied significantly by condition (χ 2 = 131.93,
d f = 16, p < .001). The best performing representation was
sentences, causing participants to miss just 6% of attacks; uni-
corns, surprisingly, were worst, with participants missing 54%
of attacks. Our baseline, the hexadecimal representation, was
roughly in the middle, with 21% of participants missing the
attack. The uppercase and lowercase variants of hexadecimal
had an attack success rate within 1% of each other. Figure 4
summarizes these results.

The difference in attack success rate between hexadecimal
(21%) and unicorns (54%) was borderline significant (p =
.052). Unicorns performed significantly worse than both Vash
(p = .003) and Visual Host Key (p < .001). In contrast to
related work by Dechand et al. [6], we did not observe any sta-
tistically significant differences in attack success rate between
textual representations.

We also tested whether participants who fell under our defini-
tion of technical were more successful at detecting attacks than
those who did not. Technical users were better at detecting
attacks (80% to 69%), but this difference was not statistically
significant after correction (p = .08). Similarly, for the three
representations in which we varied attack strength (Visual

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3793

hex (low), confirm, bothvis, 2^60

hex, confirm, bothvis, 2^60, 2x time

ssh, confirm, bothvis, 2 8̂0

hex, confirm, bothvis, 2 8̂0

uni, confirm, bothvis, 2 4̂0

hex, confirm, bothvis, 2 4̂0

vash, confirm, toggle, 2^60

hex, confirm, toggle, 2^60

hex, select, bothvis, 2^60

vash, confirm, bothvis, 2^60

uni, confirm, bothvis, 2^60

ssh, confirm, bothvis, 2^60

sent, confirm, bothvis, 2^60

word, confirm, bothvis, 2^60

alt, confirm, bothvis, 2^60

num, confirm, bothvis, 2^60

hex, confirm, bothvis, 2^60

0.0 0.2 0.4 0.6
Fraction that Missed Attack

Attack ID
1
2
3

Figure 4: Fraction of participants in each condition that missed
the attack, grouped by condition and attack instance. The
attack IDs correspond to three different attacks we generated
for each configuration.

●●● ● ●●● ●● ●● ●●● ●●●● ● ●● ●●●● ● ●● ● ●●● ●●●●● ●

●● ●● ●● ●●●● ●● ●● ●●

●●●● ● ● ●● ●●● ● ● ●● ●● ●● ● ●●● ●● ●●●● ●●● ●●●●●● ● ●●● ●●●● ●

●●● ● ●●● ●● ●●●● ●● ● ●● ●

●● ●●●●● ●●●●● ●●●●●● ●●● ● ●● ●● ● ●●● ● ●●●●●● ●●●●●●● ●● ●● ●● ●● ●●● ●●●● ●● ●●● ●● ●● ●● ●● ●● ●●●●● ●● ●●●● ●●● ●● ●●●● ●● ●● ● ● ●●●● ● ●● ●● ●

●●● ● ●● ●

●● ●● ●●● ●●● ●●●● ●●● ●● ●●●● ● ●● ● ●● ●● ●●●● ●● ●● ● ● ● ●● ●● ●●● ●●● ● ●● ● ●● ●● ●● ●●● ●●● ●● ●

● ●● ●● ●● ●● ●●●● ●● ●●●● ●●● ● ●● ●

● ● ●● ●●● ●● ●●●● ●● ●● ●●● ●●● ● ● ●● ●●●● ●● ●● ●●●● ● ●●● ●● ●●● ●●● ● ●●●● ●● ●●

●● ●●●● ●●● ●● ●●●● ● ●● ●● ●●● ●● ● ●● ● ●● ● ●●● ● ●● ●●●●●●●●● ●●●● ●● ●● ●● ● ● ●●●● ●● ●● ●●● ●●●● ● ● ●●●● ●● ●● ●● ●●●●●●

●●● ●●●● ● ●●●● ●● ● ●● ●●● ●●●● ● ●●●● ●● ●● ●●●●● ●●●● ● ● ●● ●●●● ●●● ●● ●● ● ●●●●● ●●● ●●●●● ● ●●●● ●●●●●●●

●●●● ● ● ●●●● ●● ●● ● ● ●●●●● ● ●● ●● ●●● ● ●●●● ●●●●

●● ● ●● ●● ● ●● ●● ● ●●● ●●● ●●● ●● ●● ●

● ●● ●●●● ●●● ●●● ●● ●●● ●

● ● ●● ●● ●●● ●● ●●●●● ●● ●●●● ●● ●●

●●● ●●● ●●●● ●● ●

●●● ● ●●● ●● ●● ●●● ●● ●●

hex (low), confirm, bothvis, 2^60

hex, confirm, bothvis, 2^60, 2x time

ssh, confirm, bothvis, 2 8̂0

hex, confirm, bothvis, 2 8̂0

uni, confirm, bothvis, 2 4̂0

hex, confirm, bothvis, 2 4̂0

vash, confirm, toggle, 2^60

hex, confirm, toggle, 2^60

hex, select, bothvis, 2^60

vash, confirm, bothvis, 2^60

uni, confirm, bothvis, 2^60

ssh, confirm, bothvis, 2^60

sent, confirm, bothvis, 2^60

word, confirm, bothvis, 2^60

alt, confirm, bothvis, 2^60

num, confirm, bothvis, 2^60

hex, confirm, bothvis, 2^60

0 10 20 30
Time (seconds)

Figure 5: Median comparison time by condition.

Host Key, uppercase hexademical, unicorns), we did not ob-
serve any statsticically significant differences in the fraction of
participants that missed an attack between attacks of different
strengths.

Comparison Time
The median time spent comparing fingerprints4 ranged from
2.0 seconds for the unicorns representation to 8.7 seconds for
the alternating vowels/consonants representation. Graphical
representations were generally faster to compare than textual
ones. The median comparison times for each representation
are shown in Figure 5.

Differences in comparison time between conditions were sig-
nificant (χ 2 = 289.39, d f = 16, p < .001). Looking at individ-
ual conditions, the median comparison time was significantly
lower for unicorns (2.0 s) compared to both hexadecimal (8.0
s; p < .001) and Visual Host Key (5.51 s; p < .001). In con-
trast to related work by Dechand et al. [6], we did not observe
any statistically significant differences in comparison time
between textual representations.

Subjective Ratings
We asked participants for their subjective ratings before we
revealed whether they had missed the attack. For all repre-
sentations, most participants (70% for alternating to 91% for
Vash) believed that the time it took to compare fingerprints
was reasonable for a security check. The majority also thought

4For participants in experimental conditions where the time to beat
was set at 540 seconds.

that it was easy to compare fingerprints and were confident
that they could do so correctly. We did not observe statistically
significant differences in ratings by condition for confidence
(p = .386), ease of use (p = .102), or reasonableness of com-
parison time (p = .117).

Compare-and-select
Compare-and-select participants did not spend much time
comparing fingerprints. Of the configurations involving textual
representations, the compare-and-select configuration had the
lowest median comparison time (5.2 seconds), though this
difference was not statistically significant.

Across all experimental conditions, the compare-and-select
condition had the lowest attack detection rate; 72% of partic-
ipants missed the simulated attack. The difference in attack
detection rate between our baseline compare-and-confirm hex-
adecimal condition and the compare-and-select hexadecimal
condition was statistically significant (p < .001).

Toggle Use Case
Our toggle use case explored the effect of requiring users to
toggle back and forth between fingerprints in order to make
comparisons, as opposed to being able to view both simul-
taneously. For hexadecimal, we did not observe statistically
significant differences in attack detection rate or comparison
time between toggle and simultaneously visible configurations.
For Vash, only the difference in comparison time between tog-
gle and simultaneously visible configurations was statistically

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3794

significant (median time of 5.2 seconds when toggling, com-
pared to 2.2 seconds when not; p < .001).

While it is not surprising that participants would take longer
to compare fingerprints when they have to toggle between two
views to perform that comparison, it is interesting to compare
the interaction between visibility mode and representation.
While participants did not take significantly longer comparing
hexadecimal fingerprints when they had to toggle (median tog-
gle: 9.8 s; both visible: 8.0 s), they did take significantly longer
comparing Vash fingerprints when they had to toggle (median
toggle: 5.2 s; both visible: 2.2 s). One explanation for this is
that fingerprints based on abstract art are difficult to commit
to memory, and so require more toggles (and thus more time)
to compare than textual representations like hexadecimal.

Target Time-to-beat
We set the time-to-beat to 1080 seconds in one condition,
which allowed approximately 36 seconds per task. All partici-
pants finished before the time-to-beat, with a median elapsed
activity time (reflected in the on-screen stopwatch) of 716
seconds. The median comparison time for participants in the
2x time configuration was significantly different from those in
the comparable configuration where the time-to-beat was 540
seconds (11.03 compared to 8.02 s; p = .005). The difference
in the fraction of attacks missed was not significantly differ-
ent between the 2x time configuration (10% missed) and the
corresponding regular configuration (21% missed).

Comparison Strategies
Participants had a variety of strategies for comparing finger-
prints. For textual fingerprints, participants often compared
some subset of the beginning, middle, and end of fingerprints.
Some participants also chose to compare random parts of the
fingerprint, including one participant who was shown the hex-
adecimal representation, who said: “I first checked the last set
of numbers, then randomly glanced at other sections until I
felt I had verified enough values.”

Other participants compared fingerprints in reading order, and
instead focused on methods for efficiently doing so. For ex-
ample, some participants had a strategy similar to the one
described by a participant shown uppercase hexadecimal: “I
would quickly read a segment and shift my eyes over as I
repeated it, then compare and immediately/cross over into
reading the next set from the other card to myself as I shifted
back and compared to the next set of numbers, etc. – constant
crossovers, but not having to cross over without going towards
the next step to reduce time.” Interestingly, one participant
chose to compare hexadecimal fingerprints in reverse reading
order: “I started with the rightmost set of numbers on the
first line of the card, found one of the offered fingerprints that
matched, then compared numbers back and forth in reverse
reading order. I thought it would be easier to be accurate if
using a technique that wouldn’t make me fall into an easy
‘reading’ mode.”

For all textual fingerprint representations besides sentences,
fingerprints were presented in chunks of a fixed size, which

provide a natural unit of comparison.5 Indeed, many partici-
pants reported comparing fingerprints chunk-by-chunk. How-
ever, some participants chose to devise their own chunking
strategy, a strategy prior work in the area of system-generated
PINs has observed [12]. Many participants chose to com-
pare multiple hexadecimal chunks at a time. Other reported
units of comparison for textual representations included sen-
tences, rows, and columns. Participants also described chunk-
ing strategies for graphical representations, such as comparing
fingerprints by quadrant. Interestingly, more than one partic-
ipant treated the visual host key representation more akin to
textual representations, and compared in units of lines of text.

Although some participants compared graphical representa-
tions according to a chunking strategy, more commonly par-
ticipants mentioned comparing particular features or charac-
teristics of the specific fingerprints shown. One participant
shown Vash said they tried “to pick out a couple things that
might be different between pictures and then alternate between
them to see if they are different.” For Visual Host Key, one
participant’s strategy was to “look at general cues like the
placement of the dots or big letters like B or S or E.”

Participants strategies sometimes distinguished between the
size of differences they looked for when performing compar-
isons, particularly for graphical representations. Some partic-
ipants only checked for large differences. Others adopted a
layered approach in which they looked for large differences
first, followed by a search for more subtle differences. For
example, one participant described this strategy for comparing
Visual Host Key fingerprints: “The first thing I did was to
try and glance at the whole fingerprint and see if anything
jumped out at me. If I saw no difference by looking at the
basic overview of it then I looked with a little more detail. If
I was still unsure of its validity then I examined each line as
quickly and accurately as possible.”

DISCUSSION

Impact of Methodology
A main difference between our study and closely related
work [6] was our focus on examining practical effects such as
habituation, stress, and difficulties in comparison caused by
variations in color, font, and position of the two fingerprints
relative to each other. Based on both quantitative and qualita-
tive data, we believe we successfully simulated some of these
practical constraints, which we show do affect user behavior.

For example, in practice, attacks are likely to be few and far
between. We simulated this by asking users to perform many
comparisons of matching fingerprints before exposing them
to an attack. Failing to find differences after several com-
parisions, most participants refined their strategy to compare
only selected parts of fingerprints, which in turn increased the
chance that they would miss attacks. Our experiment appeared
to successfully simulate situations where users feel pressed

5For the sentence representation, individual sentences served a sim-
ilar purpose, though the way we presented it made it difficult to
immediately discern sentences as individual units.

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3795

for time. When asked what the hardest thing about the activ-
ity was, many participants responded that they felt stressed,
pressured, distracted, or in a rush.

To verify that results weren’t unduly driven by participants’
need to finish tasks quickly, we included a condition in which
participants had twice the time to complete the activity. In this
condition, participants took only about 37% more time for each
comparison, suggesting that time pressure was no longer a
significant factor. These participants did make fewer mistakes
(though the difference was not significant) but continued to
miss attacks, suggesting both that time pressure in the study
was at least partly successful at simulating real-life stress and
that this stress was not the only factor which lead to mistakes.

Compare-and-select vs. Compare-and-confirm
We were surprised by how susceptible the compare-and-select
method of verifying fingerprints was to attacks. Compare-and-
select seeks to ease comparisons by offering users multiple
options from which to select a fingerprint that matches another
one that they are looking at. Over time, however, the compare-
and-select approach appears to train users that one of the
options is always correct (since some options always aren’t).
More specifically, at the end of the study we asked participants
to report their concern about false negatives (saying that the
fingerprints did not match when in fact they did) and false pos-
itives (saying that two fingerprints matched when they did not).
While there was no significant difference in the rate at which
compare-and-confirm and compare-and-select participants re-
ported concern about false negatives ( p = .657), compare-and-
select participants were significantly less worried about false
positives, i.e., failing to detect an attack (p < .001).

At the same time, while current implementations of compare-
and-select appear to be a poor fit at least for fingerprint com-
parisions, there has been discussion of a compare-and-select
approach where the options are chosen to be visually simi-
lar. A potential benefit of this approach is that users would
be forced to focus on small details (since through a cursory
comparison all options would look alike), leading to more
effective security. A potential downside is the usability cost of
performing detailed comparisions between all the options that
need to be compared.

Desirable Properties and Tradeoffs
For both textual and graphical representations, participants
struggled to decide how detailed a comparison to perform. For
graphical representations, participants noted slight differences
in color (as could potentially be caused by comparing a finger-
print printed on a business card to one on a computer screen)
that caused uncertainties.

Participants shown graphical fingerprints tended to look at the
big picture more often. While this is fine if small differences
do not exist, it may be feasible for a determined attacker to
find a key whose fingerprint is overall similar to the target
fingerprint but different in small details, as was the case for
our unicorn condition.

One advantage of textual formats like hexadecimal over image-
based formats like Vash is that the former allow a user to

know for certain whether two fingerprints are the same. For
textual formats, a motivated user can check each digit and
compare. For Vash, manual human comparison can only go
so far; the user cannot confirm each pixel value through just
visual inspection, and the representation does not convey what
the smallest difference the user should look for is.

Another advantage of textual representations relates to the
fact that they easily lend themself to being segmented into a
particular structure, e.g., chunks of four characters, lines of
text, etc. This structure seems to offer participants a useful
reference point at semantically arbitrary locations within a
fingerprint. Participants reported (unknowingly) taking advan-
tage of this structure by making multiple detailed comparisons
between various parts of corresponding fingerprints. This kind
of behavior seems likely to make successful attacks less likely.

Recommendations
Overall, all the representations and configurations we experi-
mented with exhibited higher rates of successful attack than
seems desirable for high-risk situations. This strongly sug-
gest that additional effort should be put towards removing
the human in the loop, e.g., by using a smartphone camera to
capture a printed fingerprint and having smartphone software
make the comparison. When manual fingerprint comparison is
necessary, the right choice likely depends on the context, since
the different fingerprint representations we experimented with
showed substantially different security and usability proper-
ties.

For all representations we tested, we observed participants
making rational (if not always well informed) assumptions
about how to go about comparing fingerprints. Graphical
representations in general seemed to be more susceptible to
comparison strategies that ignored fine details; at the same
time, they allowed seemingly easy and quick comparisons.
Consequently, unless the representation is accompanied by
measures to help the user compare small details between two
fingerprints, graphical representations appear not to be well
suited for high-risk situations, but could be of benefit in low-
risk environments, when attackers are not likely to be strong
and usability is paramount.

When security is paramount, the best option is likely one we
did not test: manually copying a printed fingerprint into a de-
vice and having software on that device make the comparison.
This virtually eliminates the possibility of missing an attack,
but at a high usability cost. For situations in which risk is
not high and there is a need to balance security and usability,
textual representations like hex (but also others like ASCII art
and sentences) may be appropriate.

REFERENCES
1. akwizgran. 2014. Basic English: Encode random

bitstrings as pseudo-random poems. (2014).
https://github.com/akwizgran/basic-english

2. J. Callas, L. Donnerhacke, H. Finney, D. Shaw, and R.
Thayer. 2007. OpenPGP message format. (2007).
https://tools.ietf.org/html/rfc4880

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3796

3. Sandy Clark, Travis Goodspeed, Perry Metzger, Zachary
Wasserman, Kevin Xu, and Matt Blaze. 2011. Why
(special agent) Johnny (still) can’t encrypt: A security
analysis of the APCO Project 25 two-way radio system.
In Proceedings of the 20th USENIX Conference on
Security (SEC’11).
http://dl.acm.org/citation.cfm?id=2028067.2028071

4. Terrence Cole. 2011. Vash: Visually pleasing and distinct
abstract art, generated uniquely for any input data. (2011).
https://github.com/thevash/vash

5. Colin Davis. 2016. Robohash. (2016).
https://robohash.org/

6. Sergej Dechand, Dominik Schürmann, Karoline Busse,
Yasemin Acar, Sascha Fahl, and Matthew Smith. 2016.
An empirical study of textual key-fingerprint
representations. In 25th USENIX Security Symposium
(USENIX Security 16).
https://www.usenix.org/conference/usenixsecurity16/

technical-sessions/presentation/dechand

7. Michael Farb, Yue-Hsun Lin, Tiffany Hyun-Jin Kim,
Jonathan McCune, and Adrian Perrig. 2013. SafeSlinger:
Easy-to-use and secure public-key exchange. In
Proceedings of the 19th Annual International Conference
on Mobile Computing & Networking (MobiCom ’13).
DOI:http://dx.doi.org/10.1145/2500423.2500428

8. J. Galbraith and R. Thayer. 2006. The Secure Shell (SSH)
public key file format. (2006).
https://www.ietf.org/rfc/rfc4716.txt

9. Simson L. Garfinkel and Robert C. Miller. 2005. Johnny
2: A user test of key continuity management with
S/MIME and Outlook Express. In Proceedings of the
2005 Symposium on Usable Privacy and Security
(SOUPS ’05). DOI:
http://dx.doi.org/10.1145/1073001.1073003

10. Shirley Gaw, Edward W. Felten, and Patricia
Fernandez-Kelly. 2006. Secrecy, flagging, and paranoia:
Adoption criteria in encrypted email. In Proceedings of
the SIGCHI Conference on Human Factors in Computing
Systems (CHI ’06). DOI:
http://dx.doi.org/10.1145/1124772.1124862

11. Hsu-Chun Hsiao, Yue-Hsun Lin, Ahren Studer,
Cassandra Studer, King-Hang Wang, Hiroaki Kikuchi,
Adrian Perrig, Hung-Min Sun, and Bo-Yin Yang. 2009. A
study of user-friendly hash comparison schemes. In 2009
Annual Computer Security Applications Conference. DOI:
http://dx.doi.org/10.1109/ACSAC.2009.20

12. Jun Ho Huh, Hyoungshick Kim, Rakesh B. Bobba,
Masooda N. Bashir, and Konstantin Beznosov. 2015. On
the memorability of system-generated PINs: Can
chunking help?. In Eleventh Symposium On Usable
Privacy and Security (SOUPS ’15). https://www.usenix.
org/conference/soups2015/proceedings/presentation/huh

13. Antti Huima. 2000. The Bubble Babble binary data
encoding. (2000). http://web.mit.edu/kenta/www/one/
bubblebabble/spec/jrtrjwzi/draft-huima-01.txt

14. Ronald Kainda, Ivan Flechais, and A. W. Roscoe. 2009.
Usability and security of out-of-band channels in secure
device pairing protocols. In Proceedings of the 5th
Symposium on Usable Privacy and Security (SOUPS ’09).
DOI:http://dx.doi.org/10.1145/1572532.1572547

15. Ruogu Kang, Stephanie Brown, Laura Dabbish, and Sara
Kiesler. 2014. Privacy attitudes of Mechanical Turk
workers and the U.S. public. In Proceedings of the Tenth
Symposium on Usable Privacy and Security (SOUPS ’14).
https://www.usenix.org/conference/soups2014/

proceedings/presentation/kang

16. Alfred Kobsa, Rahim Sonawalla, Gene Tsudik, Ersin
Uzun, and Yang Wang. 2009. Serial hook-ups: A
comparative usability study of secure device pairing
methods. In Proceedings of the 5th Symposium on Usable
Privacy and Security (SOUPS ’09). DOI:
http://dx.doi.org/10.1145/1572532.1572546

17. Arun Kumar, Nitesh Saxena, Gene Tsudik, and Ersin
Uzun. 2009. Caveat eptor: A comparative study of secure
device pairing methods. In 2009 IEEE International
Conference on Pervasive Computing and
Communications. DOI:
http://dx.doi.org/10.1109/PERCOM.2009.4912753

18. Raph Levien and Donald Johnson. 1998. Snowflake.
(1998). http://dlakwi.net/snowflake/snowflake.html

19. Dirk Loss, Tobias Limmer, and Alexander von Gernler.
2009. The drunken bishop: An analysis of the OpenSSH
fingerprint visualization algorithm. (2009).
http://dirk-loss.de/sshvis/drunken_bishop.pdf

20. Skylar Nagao. 2016. Avatars. (Oct 2016). https://peerio.
zendesk.com/hc/en-us/articles/202729949-Avatars

21. Off-the-Record Messaging. 2016. Fingerprints. (2016).
https://otr.cypherpunks.ca/help/fingerprint.php

22. OpenSSH. 2008. OpenSSH 5.1 release announcement.
(2008). https://www.openssh.com/txt/release-5.1

23. Adrian Perrig and Dawn Song. 1999. Hash visualization:
A new technique to improve real-world security. (1999).
https://users.ece.cmu.edu/~adrian/projects/validation/

24. Plasmoid. 2003. Fuzzy fingerprints: Attacking
vulnerabilities in the human brain. (2003).
https://www.thc.org/papers/ffp.pdf

25. David Roundy. 2014. Visual hash. (2014).
http://visual-hash.readthedocs.io/en/latest/

26. Scott Ruoti, Nathan Kim, Ben Burgon, Timothy van der
Horst, and Kent Seamons. 2013. Confused Johnny: When
automatic encryption leads to confusion and mistakes. In
Proceedings of the Ninth Symposium on Usable Privacy
and Security (SOUPS ’13). DOI:
http://dx.doi.org/10.1145/2501604.2501609

27. Maliheh Shirvanian and Nitesh Saxena. 2015. On the
security and usability of crypto phones. In Proceedings of
the 31st Annual Computer Security Applications
Conference (ACSAC 2015). DOI:
http://dx.doi.org/10.1145/2818000.2818007

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3797

28. Ryan Stedman, Kayo Yoshida, and Ian Goldberg. 2008. A
user study of Off-the-Record Messaging. In Proceedings
of the 4th Symposium on Usable Privacy and Security
(SOUPS ’08). DOI:
http://dx.doi.org/10.1145/1408664.1408678

29. Marc Stevens, Pierre Karpman, and Thomas Peyrin. 2016.
Freestart collision for full SHA-1. In Proceedings of the
35th Annual International Conference on Advances in
Cryptology (EUROCRYPT 2016). DOI:
http://dx.doi.org/10.1007/978-3-662-49890-3_18

30. Zettabyte Storage. 2011. Vash: The visual hash. (2011).
https://web.archive.org/web/20130127121849/http:

//www.thevash.com

31. Nik Unger, Sergej Dechand, Joseph Bonneau, Sascha
Fahl, Henning Perl, Ian Goldberg, and Matthew Smith.
2015. SoK: Secure messaging. In Proceedings of the
2015 IEEE Symposium on Security and Privacy (SP ’15).
DOI:http://dx.doi.org/10.1109/SP.2015.22

32. Ben Dumke v. d. Ehe. 2012. Unicornify! How does it
work? (2012). https://unicornify.appspot.com/making-of

33. Serge Vaudenay. 2005. Secure communications over
insecure channels based on Short Authenticated Strings.

In Proceedings of the 25th Annual International
Conference on Advances in Cryptology (CRYPTO’05).
DOI:http://dx.doi.org/10.1007/11535218_19

34. WhatsApp. 2016. WhatsApp encryption overview:
Technical white paper. (April 2016).
https://www.whatsapp.com/security/

WhatsApp-Security-Whitepaper.pdf

35. Alma Whitten and J. D. Tygar. 1999. Why Johnny can’t
encrypt: A usability evaluation of PGP 5.0. In
Proceedings of the 8th Conference on USENIX Security
Symposium (SSYM’99).
http://dl.acm.org/citation.cfm?id=1251421.1251435

36. Wickr. 2016. What is the key verification feature? (2016).
https://wickr.desk.com/customer/en/portal/articles/

2342342-what-is-the-key-verification-feature-

37. Min Wu, Robert C. Miller, and Greg Little. 2006. Web
Wallet: Preventing phishing attacks by revealing user
intentions. In Proceedings of the Second Symposium on
Usable Privacy and Security (SOUPS ’06). DOI:
http://dx.doi.org/10.1145/1143120.1143133

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA

3798

  • Introduction
  • Background and Related Work
    • Fingerprint Applications
    • Usability of Fingerprints
    • Entropy as a Security Metric
  • Methodology
    • Security Task Design Considerations
    • Threat Model and Simulated Attack
    • Experimental Factors
      • Representations
      • Comparison Mode
      • Visibility Mode
      • Attack Strength
      • Other Factors
    • Experimental Conditions
    • Statistical Analysis
    • Limitations
    • Participants
  • Results
    • Representations
      • Attack Detection Rate
      • Comparison Time
      • Subjective Ratings
    • Compare-and-select
    • Toggle Use Case
    • Target Time-to-beat
    • Comparison Strategies
  • Discussion
    • Impact of Methodology
    • Compare-and-select vs. Compare-and-confirm
    • Desirable Properties and Tradeoffs
    • Recommendations
  • References

This paper is included in the Proceedings of the
Thirteenth Symposium on Usable Privacy and Security (SOUPS 2017).

July 12–14, 2017 • Santa Clara, CA, USA

ISBN 978-1-931971-39-3

Open access to the Proceedings of the
Thirteenth Symposium

on Usable Privacy and Security
is sponsored by USENIX.

Is that you, Alice?
A Usability Study of the Authentication Ceremony

of Secure Messaging Applications
Elham Vaziripour, Justin Wu, Mark O’Neill, Ray Clinton, Jordan Whitehead, Scott Heidbrink,

Kent Seamons, and Daniel Zappala, Brigham Young University

https://www.usenix.org/conference/soups2017/technical-sessions/presentation/vaziripour

Is that you, Alice? A Usability Study of the Authentication
Ceremony of Secure Messaging Applications

Elham Vaziripour, Justin Wu, Mark O’Neill, Ray Clinton, Jordan Whitehead,
Scott Heidbrink, Kent Seamons, Daniel Zappala

Brigham Young University
{elhamvaziripour,justinwu,mto,rclinton,jaw,sheidbri}@byu.edu, {seamons,zappala}@cs.byu.edu

ABSTRACT
The effective security provided by secure messaging applica-
tions depends heavily on users completing an authentication
ceremony—a sequence of manual operations enabling users
to verify they are indeed communicating with one another.
Unfortunately, evidence to date suggests users are unable
to do this. Accordingly, we study in detail how well users
can locate and complete the authentication ceremony when
they are aware of the need for authentication. We execute
a two-phase study involving 36 pairs of participants, using
three popular messaging applications with support for secure
messaging functionality: WhatsApp, Viber, and Facebook
Messenger. The first phase included instruction about poten-
tial threats, while the second phase also included instructions
about the importance of the authentication ceremony. We
find that, across the three apps, the average success rates
of finding and completing the authentication ceremony in-
creases from 14% to 79% from the first to second phase,
with second-phase success rates as high as 96% for Viber.
However, the time required to find and complete the cere-
mony is undesirably long from a usability standpoint, and
our data is inconclusive on whether users make the connec-
tion between this ceremony and the security guarantees it
brings. We discuss in detail the success rates, task timings,
and user feedback for each application, as well as common
mistakes and user grievances. We conclude by exploring user
threat models, finding significant gaps in user awareness and
understanding.

1. INTRODUCTION
Recent disclosures of government surveillance and fears over
cybersecurity attacks have increased public interest in secure
and private communication. As a result, numerous secure
messaging applications have been developed, including Signal,
WhatsApp, and Viber, which provide end-to-end encryption
of personal messages [19].

Most popular secure messaging applications are usable be-
cause they hide many of the details of how encryption is
provided. Indeed, people are primarily using these applica-

Copyright is held by the author/owner. Permission to make digital or hard
copies of all or part of this work for personal or classroom use is granted
without fee.
Symposium on Usable Privacy and Security (SOUPS) 2017, July 12–14,
2017, Santa Clara, California.

tions due to peer influence, not due to concern over privacy
or security [5].

The strength of the security properties of these applications
rests on the authentication ceremony, in which users vali-
date the encryption keys being used. Unfortunately, there
is evidence that most users do not know how to successfully
complete this ceremony and are thus vulnerable to potential
attacks [15]. Any user who does not execute the authenti-
cation ceremony for a particular conversation is essentially
trusting the application’s servers to correctly distribute the
encryption keys. This leaves users vulnerable to compromise
threats that can intercept communications.

Several recent papers have shown that the authentication
ceremony in secure messaging applications is difficult to
use and prone to failure. A study of Signal showed that
users, all of whom were computer science students, were
highly vulnerable to active attacks [15]. A comparison of
WhatsApp, Viber, Telegram, and Signal, found that most
users were unable to properly authenticate [8], though after
being instructed on what to do most users were subsequently
able to authenticate after a key reset.

This state of affairs motivates our study, which examines to
what extent users can successfully locate and complete the
authentication ceremony in secure messaging applications if
they are aware of the need for authentication. To answer
this question, we conduct a two-phase user study of Whats-
App [22], Facebook Messenger [7], and Viber [21]. We chose
these applications because of their popularity and their differ-
ent designs. The authentication ceremony in WhatsApp uses
either a QR code or a numeric key representation that users
can compare. Viber presents a numeric key representation
and provides functionality for users to call each other within
the ceremony to compare the key. Facebook Messenger pro-
vides a numeric representation of the keys for both users.
In addition to these differences, WhatsApp and Viber offer
only secure messaging, while Facebook Messenger offers both
insecure and secure messaging. We are curious as to whether
the inclusion of an insecure messaging interface hinders the
ability of users to find and successfully use secure messaging
and the authentication ceremony.

In the first phase of our study, we asked 12 pairs of partici-
pants to complete a scenario where one participant needed
to send a credit card number to the other participant. They
were both instructed to verify that they were truly communi-
cating with their partner (authenticity) as well as to ensure
that no other party could read their messages (confidential-

USENIX Association Thirteenth Symposium on Usable Privacy and Security 29

ity). Participants were told the application would help them
accomplish these goals.

In the second phase of the study, we presented 24 pairs of
participants with the same task and scenario provided in
the first phase. However, unlike the first phase, participants
first read through an additional set of instructional slides
before beginning the task. These slides informed them about
traffic interception, that secure messaging applications use
a “key” to secure conversations, and that to be secure they
needed to confirm that they saw the same “key” as their
partner. Participants were not instructed on how to use the
applications to compare keys, nor shown any screenshots of
the authentication ceremony; they were only told that each
application had some way of providing this functionality. For
both study phases, the method used for authentication was
left to their discretion.

Each phase was a within-subjects study, and all participants
engaged with all three applications in each phase. Partici-
pants differed between the two phases, allowing us to capture
between-subjects differences in instruction between the two
phases. We measured success rates in completing the au-
thentication ceremony, time to both find and complete the
ceremony, and user feedback on the applications, which in-
cludes System Usability Scale (SUS) scores, ratings of favorite
application, ratings of trustworthiness for each application,
and qualitative feedback.

Our findings include:

• In the first phase, despite the instruction about poten-
tial threats, the overall success rate over all participants
and all applications was 14%, and only two of the twelve
pairs of participants successfully located and completed
the authentication ceremony. All other pairs attempted
to authenticate one another through video calls, asking
questions that required special knowledge to answer,
or other ad hoc methods.

• In the second phase, the overall success rate increased to
79% for location and completion of the authentication
ceremony. The success rates for the three applications
were: 96% for Viber, 79% for WhatsApp, and 63% for
Facebook Messenger.

• Viber’s higher success rate was statistically significant
when compared to the other two applications. This
is interesting because Viber’s authentication ceremony
uses an in-app phone call and provides a UI that helps
users view and read the encryption key during the
phone call. Both WhatsApp and Facebook Messen-
ger also provide manual verification of the encryption
key, but do not provide this assistance. For both of
these applications, numerous participants sent the key
through in-app text, voice, and video, with a minority
comparing the keys in person. Nearly half of partici-
pants chose to use the option WhatsApp provided for
scanning a QR code.

• Averaged across the three applications, discovery of
ceremony functionality took 3.2 minutes with ceremony
completion necessitating another 7.8 minutes.

• All applications were rated in the “C” range on the
System Usability Scale, indicating a need for significant
usability enhancements.

• Most participants had not heard of Viber prior to their
participation in our study. Trust ratings were very low
in the first phase, but increased significantly in the
second phase, when some instruction about security
was received. This provides some evidence that learning
about security features can enhance trust in a secure
messaging application.

• Numerous participants complained about the length of
the encryption key when having to compare it manu-
ally, taking shortcuts and often feeling fatigued by the
process.

• Our qualitative data indicates that our participants
have a healthy wariness for, and high-level understand-
ing of: impersonation attacks, government and devel-
oper backdoors, and physical theft. They are, however,
generally unaware of the existence of man-in-the-middle
attacks, both passive and active. Our data is incon-
clusive on whether users make the connection between
this ceremony and its security guarantees.

Our main takeaway is that even with an awareness of po-
tential threats, users are not aware of and do not easily
find the authentication ceremony in the secure messaging
applications we tested. If given some instruction on the
importance of comparing keys, they can find and use the
authentication ceremony, and Viber’s second-phase success
rate indicates that a high success rate is a realizable goal.
However, for all applications, the time to find and use the
authentication ceremony is unsatisfactory from a usability
standpoint. The natural tendency of our participants to use
personal characteristics for authentication, such as a person’s
voice, face, or shared knowledge, indicates that future work
could leverage this for a more user-understandable method
of authentication.

2. RELATED WORK
Several papers have studied the usability of the authentica-
tion ceremony in secure messaging applications.

Two papers study the usability of the ceremony in a par-
ticular application. Schröder et al. studied Signal, showing
that users were vulnerable to active attacks due to usabil-
ity problems and incomplete mental models of public key
cryptography [15]. This study included 28 computer scien-
tists; of the participants, four clicked through the warning
message, eight could not find the ceremony, and ultimately
only seven were able to successfully authenticate their peer.
Assal et al. asked participants to perform the authentication
ceremony in ChatSecure using different key representations,
which include a fingerprint, shared secret, and QR code [1].
Of the 20 participants in this study, 20% were successful for
the fingerprint, 85% for the shared secret, and 30% for the
QR code.

Two papers have compared the usability of various fingerprint
representations. Tan et al. compared eight representations,
including textual and graphical representations with varying
degrees of structure, in a simulated attack scenario [18].
Graphical representations were relatively more susceptible
to attack, but were easy to use and comparison was fast.
Participants used different strategies for comparison, often
comparing only a portion of the fingerprint or comparing
multiple blocks at a time. Dechand et al. studied textual key

30 Thirteenth Symposium on Usable Privacy and Security USENIX Association

verification methods, finding that users are more resistant to
attacks when using sentence-based encoding as compared to
hexadecimal, alphanumeric, or numeric representations [6].
Sentence-based encoding rated high on usability but low on
trustworthiness.

Herzberg and Leibowitz examined the usability of Whats-
App, Viber, Telegram, and Signal, finding that most users
were unable to properly authenticate, both in an initial au-
thentication ceremony and after a key reset [8]. The study
included 39 participants from a variety of backgrounds and all
were given instruction on end-to-end encryption. Most users
failed to authenticate on the first attempt; they were then
given additional instruction about authentication. About
three-quarters authenticated properly after the additional
instruction was given.

Our work differs from these studies in several important
ways. First, we study in detail the ability of users to discover
and use the authentication ceremony in a variety of secure
messaging applications, giving us insight into the differences
among these applications. Schröder et al. only study Signal,
and Dechand et al. do not study any particular applications.
Second, we use a paired participant methodology, so that
users are asked to identify a friend they already know, rather
than an unknown study coordinator. This method is more
realistic than most prior studies and yields important insights
into user behavior. For example, our study participants called
each other, verified through voice and vision, and asked
questions based on shared knowledge. Third, we conduct a
between-subjects study on the effects of instruction, so that
those receiving instruction are not biased by their previous
experiences. The first set of participants were asked to
authenticate given only general awareness of threats, while
the second set of participants received instruction about the
importance of comparing encryption keys.

Another important aspect of our work is that it provides
replicability that is not possible with prior work. Herzberg
and Leibowitz report a similar result, that participants au-
thenticated properly after additional instruction about au-
thentication was given. However, their paper provides few
details about the instruction given and does not report de-
tailed statistics, so it is difficult to draw any quantitative
conclusions about the effect of the instruction or the rel-
ative merits of the different applications they tested. We
report detailed statistics about what methods users tried
with each application, the time taken to authenticate, SUS
scores, trust ratings, and favorite systems. We include our
full study materials in the appendix and provide our dataset
on a companion web site.

Significant work in the area of secure email has also exam-
ined issues related to usable authentication. Obtaining and
verifying the key for a recipient is an important use case for
email, and lessons learned may apply to secure messaging as
well. Numerous papers attest to the difficulties users have
with this and other key management steps [23, 16, 12].

The most success in this area has been in the use of automatic
authentication using a trusted key server. Bai et al. [3] has
shown that individuals recognize the security benefits of
manual key exchange, but prefer a centralized key server
that authenticates users and distributes keys associated with
their email address, due to greater usability and “good enough”

security. This model has been simulated by Atwater et al. [2]
and implemented using IBE by Ruoti et al. [11]. Likewise, the
use of secure messaging applications is generally considered
a success for automatic key management.

3. APPLICATION DESCRIPTIONS
The three secure messaging applications used in our study
are WhatsApp, Viber, and Facebook Messenger. These three
applications were chosen because they present users with
distinct key verification experiences and because of their
popularity and large installation base.

3.1 WhatsApp
WhatsApp is perhaps the most well-known and widely-used
messaging application, boasting a user base of over one bil-
lion users. While it did not originally offer secure messaging
functionality at its inception, in November of 2014, Whats-
App partnered with Open Whisper Systems to incorporate
end-to-end encryption using the Signal encryption protocol.

When a conversation is initiated, WhatsApp inserts a mes-
sage informing users that messages they send are encrypted
with end-to-end encryption. Users are given two options for
key verification: QR code scanning and key fingerprint veri-
fication (both parties see the same fingerprint). In accessing
this dialog, a short caption accompanies the “Encryption”
option in the previous menu, informing users that they can
“Tap to verify.” Doing so brings up the verification dialog
shown in Figure 1a.

3.2 Viber
Viber is another widely-used messaging application with an
install base of over 800 million users. As with WhatsApp, it
did not originally offer end-to-end encryption, adding this
functionality in April of 2016. Its encryption protocol is a
proprietary design allegedly based on the principles of the
Signal protocol.

While—as with the other two applications—Viber does not
immediately make apparent the need to verify keys, once
begun, it does—unlike the other two applications—carefully
guide the user through the process with a set of instructional
dialogs. In displaying this functionality, Viber does not opt
to use the terms “encryption” or “key” at the outset, instead
characterizing the verification process as “trust[ing]” one’s
conversation partner. Only after the user selects this option,
are they prompted with a dialog that explains the need
to confirm that “secret keys are identical.” This process is
facilitated via a free Viber call. After making the call, both
sides may see their keys by tapping a lock icon that appears
during the call, allowing for verification. This dialog is shown
in Figure 1b. It should be noted, however, that Viber does
not allow the user to view their keys without initiating this
call, nor does it allow the user to view these keys once a
contact has been marked as trusted.

3.3 Facebook Messenger
Facebook Messenger is the messaging utility designed by Face-
book to integrate into their chat system, and, like WhatsApp,
has a user base of over 1 billion users. Again, as with the
other two applications, it did not originally offer end-to-end
encryption, adding this functionality in October of 2016. It
also uses the Signal protocol.

USENIX Association Thirteenth Symposium on Usable Privacy and Security 31

(a) WhatsApp (b) Viber (c) Facebook Messenger

Figure 1: Authentication ceremonies in each of the applications.

The user experience of Facebook Messenger’s encryption
functionality differs substantially from WhatsApp and Viber.
While the first two applications encrypt all communication au-
tomatically, Facebook Messenger defaults to an unencrypted
chat session, with users required to initiate a standard chat
session before accessing a “Secret Conversation” function via
the conversation menu. Once within the secret conversation
menu, users can access their device keys via the context
menu. At this point, the experience again diverges from
the two other applications, as the key verification dialog
presents users with two keys instead of one. Furthermore,
the Facebook Messenger key verification interface does not
easily facilitate a way for users to communicate these key
values to the other party. This dialog is shown in Figure 1c.

4. METHODOLOGY
We conducted an IRB-approved, two-phase user study exam-
ining how participant pairs locate and complete the authen-
tication ceremony in three secure messaging applications:
WhatsApp, Viber, and Facebook Messenger. Our study ma-
terials are shown in Appendix B and our full data set is
available at https://alice.internet.byu.edu.

In both phases, we asked participants to complete a scenario
where one participant needed to send a credit card number
to the other participant. We instructed participants to verify
that they were truly communicating with their partner and
to ensure that no other party could read their messages.
Our instructions informed participants that the application
would help them accomplish these goals, but they were left
in control of the methods used to ensure these conditions
were met. In the second phase, participants viewed and read
aloud an instructional set of slides that informed them about
the importance of comparing encryption keys.

Each phase was a within-subjects study, and all participants
used all three applications in each phase. The participants
differed between the two phases, allowing us to see between-
subjects differences in instruction between the two phases.

To choose the three applications we compared the authentica-
tion ceremony in 10 secure messaging applications—WhatsApp,
Telegram, Signal, Zendo, Facebook Messenger, Viber, Chat-
Secure, Allo, Line, SafeSlinger. We binned the applica-

tions into groups, based on the authentication methods
used. We then narrowed our choices to the following: Sig-
nal/WhatsApp (use both QR codes and manual verification),
Telegram/Facebook Messenger (use manual verification, in-
clude non-secure chatting), and Zendo (uses NFC or QR
code, requires verification before chatting). We chose Whats-
App over Signal and Facebook Messenger over Telegram
because of their greater popularity in the United States. As
explained below, we were unable to proceed with Zendo
in the study. We chose Viber as an alternative because it
provides a method for manually comparing encryption keys
using a phone call built into the application. This provided
us with three different applications that use a variety of
authentication methods.

4.1 Pilot study
We conducted a pilot study of the first phase with three pairs
of participants, using WhatsApp, Facebook Messenger, and
Zendo. The Zendo secure messenger employs key verification
as a forcing function: users must first scan each other’s QR
codes, or use NFC communication, before the conversation
can begin. Unfortunately, we experienced multiple, severe
technical difficulties with the application during the pilot
study, leading us to abandon it in favor of Viber.

4.2 Study recruitment and design
We placed flyers advertising the study around the campus
of a local university. These flyers contained a link that par-
ticipants could use to schedule online, and they included a
requirement that all participants bring a friend and smart-
phones in order to take part in the study. Recruitment
proceeded from February 3, 2017 to February 28, 2017, with
39 unique participant pairs being recruited in total: 12 for
the first phase of the study, and 24 for the second.1

1One second-phase participant pair experienced difficulty
because one participant had limited English proficiency and
our study was executed entirely in English (this participant
thought that they were being tasked with locating a physical
key). Technical errors occurred during the data collection
of two other pairs and they were presented with incorrect
post-task questionnaires. Accordingly, the data for these
three pairs were excluded from the study and we recruited
replacements in their place.

32 Thirteenth Symposium on Usable Privacy and Security USENIX Association

What Is Secure Messaging?
When you use regular text messaging, your

phone company can read your text messages.

When you use secure messaging apps, you are
having a private conversation with your friend.

Not even the company running the service can
see your messages.

But you still need to be careful. A hacker could
intercept your traffic.

The bad guy

To make sure your conversation is secure, these
applications assign a “key” to each person.

You need to make sure the key you see is the
same key your friend sees.

???

???

Secure messaging apps provide a way for you to
compare these keys.

We want to see how well the application helps
you do this.

Figure 2: Instructional slides used in the second phase.

To ensure different pairs of participants tried applications in
different orders, we calculated a complete set of permutations
listing the order in which each of the three applications
would be used by a given pair. We then randomized the
permutation that was assigned to each participant. This
ensured a collectively uniform distribution of sequences while
keeping the assignment of a given sequence to a particular
pair random. Each ordering of the three systems occurred
exactly twice in the first phase and four times in the second.

The study was conducted in two phases, spanning a period
of one month. The first phase ran from February 3, 2017
to February 16, 2017. It took roughly 40 to 45 minutes for
each pair of participants to complete, for which they were
compensated $10 USD each. The second phase ran from
February 17, 2017 to March 2, 2017. The second phase stud-
ies were more involved and took longer to complete, roughly
60 minutes each, and so all participants were compensated
at a higher rate of $15 USD.

When participants arrived for their scheduled appointment,
we presented them with the requisite forms for consent and
compensation. We instructed them to download and install
any of the three applications—WhatsApp, Viber, and Face-
book Messenger—that they did not already have on their
phones, to minimize the likelihood of technical difficulties
during the study.2 We then read them a brief introduction
describing the study conditions and their rights as study
participants. We informed them that they would be placed
in separate rooms, but could freely communicate or meet
with one another if they deemed it necessary to complete
their task. We also informed participants that a study coor-
dinator would be with them at all times and would answer
any questions they might have.

We randomly assigned one member of each pair as Partici-
pant A, with his or her counterpart becoming Participant B,
delineating their roles in the subsequent tasks. We then led

2In our pilot study, several participants lacked sufficient space
on their phones to install the applications or had phones
that were too old to run the applications properly. We
subsequently adopted this measure in an attempt to forestall
re-occurrence.

them to their respective rooms, seating them at a computer,
and initiating audio recording. We preloaded each computer
with a Qualtrics survey that guided participants through
the study, and it included a demographic questionnaire, in-
structions regarding the three tasks they were to perform,
and post-task questionnaires. Each of the three tasks was
identical in nature, differing only by which of the three secure
messaging applications participants were to use to complete
the task. Throughout the study, study coordinators were
available to answer general questions about the study, but
were careful not to provide any specific instructions that
would aid in the use of the applications themselves.

4.3 Task design
In both phases, the tasks participants completed were the
same: Participant A was to securely retrieve a credit card
number in Participant B’s possession by using the application
that was being tested. This scenario was intended solely as a
narrative backdrop for the tasks we were truly concerned with:
finding and completing the authentication ceremony. When
asked to complete the task, participants were instructed as
follows:

Your task is to make sure that you are really talk-
ing to your friend and that nobody else (such as
the service provider) can read your text messages.
The application should have ways to help you do
this.

Accordingly, despite a difference in roles, there were no prac-
tical differences between the tasks Participant A and Par-
ticipant B needed to complete. Participants were instructed
and encouraged to “talk aloud” as they completed the task,
explaining the choices they made and the actions they took.

Additional instruction was given in the second phase. Before
participants were introduced to the task, they were asked to
read aloud a short set of slides, shown in Figure 2. These
slides informed them that traffic interception was a possibility,
that secure messaging applications accordingly provide a “key”
that could be compared to ensure that conversations were
indeed secure, and that they needed to make sure that they
saw the same key as their counterpart. Furthermore, on the

USENIX Association Thirteenth Symposium on Usable Privacy and Security 33

first task, if second phase participants had failed to verify
one another’s identity either prior to sensitive data exchange
or after ten minutes had passed, they were marked as having
failed the task and prompted by study coordinators to look
for a way to authenticate properly.

4.4 Study questionnaire
Participants were led through the study by a web-based
Qualtrics survey. We first discuss those aspects that were
held constant for both phases, followed by an explanation of
how the questionnaire differed in the second phase.

Upon beginning the survey, participants first answered a set
of demographic questions. They then answered questions
about their past experience, if any, with secure messaging
applications. This included questions about which applica-
tions they might have used, their reasons for doing so, and
their general experiences with sending sensitive information.
Participants were next shown a description of their first task
(all three tasks were identical, diverging only on the sys-
tem being used). Each task was followed with a post-task
questionnaire assessing their level of trust in the application,
whether or not they believed they had successfully verified
their partner’s identity and why, and who they believed was
capable of reading their conversation. After all three tasks
had been completed, participants were then asked which of
the three applications was their favorite and why.

In the second phase, participants were given supplementary
instructions and asked additional questions. First, after the
demographic questions, participants were asked a series of
six questions intended to gauge their relative familiarity with
end-to-end encryption. Next, prior to beginning the first
task, they were presented with, and asked to read aloud, a
set of six slides that very briefly introduced the role of keys
and informed them that the applications they were about
to use would provide a way for them to compare these keys.
These instructional slides are shown in the appendix. Finally,
at the end of each task, the post-task questionnaire from
the first phase was augmented by the ten questions from the
System Usability Scale (SUS).

4.5 Post-study debrief
At the conclusion of each study, participant pairs were gath-
ered in the same room and asked a series of questions. This
served as a complement to the questionnaires that they had
answered individually, and gave them an opportunity to re-
act to one another. Participants were prompted regarding
incidents specific to their experience—e.g., if they had ev-
idenced visible frustration with a particular app—as well
as general questions. Examples of the latter include having
participants contrast the authentication ceremony used by
each application, as well as asking them to explain what role
they thought keys played in verifying one another’s identity.

4.6 Demographics
Our sample population skewed slightly female (n=40, 56%)
and young, with 74% (n=53) between the ages of 18 and 24,
and 26% (n=19) between 25 and 34. Because we distributed
recruitment flyers on a university campus, most of our partic-
ipants were college students (n=48, 61%), with 17% (n=12)
having less educational experience than that, and 22% (n=16)
having at least finished college. Participants had a variety
of backgrounds, with roughly even representation between

technical (i.e., STEM; n=34, 48%) and non-technical back-
grounds (n=37, 52%), and 10 (14%) in explicitly IT-related
fields. (One participant failed to identify their field of study
or occupation.)

In the second phase, the questionnaire included a series
of six multiple-choice questions intended to assess partic-
ipants’ knowledge of end-to-end encryption. We assigned
equal weights of one point to each question, and scored each
participant from 0-6, corresponding to the number of cor-
rect answers given by the participant. Participants were
further placed into categories of “beginner,” “intermediate,”
and “advanced” for scores in the range of 0-2 for beginners,
3-4 for intermediate, and 5-6 for advanced. There were an
equal number of participants with beginner and intermediate
ratings—21—with 6 participants netting an advanced rating.
Beginners were mostly female (3:18), intermediate partici-
pants were mostly male (15:6), while the advanced category
had an even gender split (3:3).

4.7 Limitations
The instructions given to the first three participant pairs
of the first phase were slightly different from those given
to the remaining nine. They were directed to ensure that
no one was “listening in” on their conversation, a directive
participants took literally as they would visibly scan the
room for potentially intrusive parties. This wording was
subsequently altered, with participants instead instructed to
ensure that “nobody else (such as the service provider) can
read your text messages.”

The slides we provided participants to teach about crypto-
graphic keys were necessarily simplified so that they could
be understood by novices. In this material we mentioned
that participants should ensure the key they see is the same
as their partner’s. While this was sufficient in describing
tasks for Viber and WhatsApp, Facebook Messenger actually
utilizes two keys, one for each partner. This subtlety was not
mentioned by any participant nor did it seem to adversely
affect their performance.

Finally, due to our method of recruitment, our participants
were largely students and their acquaintances, and subse-
quently exhibited some degree of homogeneity, e.g., all par-
ticipants were between 18 and 34 years of age. They are thus
not representative of a larger population. Furthermore, while
an effort was made to place participants in a more organic
setting—e.g., by having them communicate with real mem-
bers of their social circle as opposed to study coordinators—
this was still ultimately a lab study and has limitations
common to all studies run in a trusted environment [10, 17].

5. FIRST PHASE RESULTS
In the first phase of the study, only 2 of the 12 pairs experi-
enced some success in locating and completing the authenti-
cation ceremony, with an overall success rate of 14% across
all pairs and applications.

Participants used a variety of ad hoc methods for authen-
tication. Listed in the order they appear in Table 1, these
methods were: utilization of a picture for visual identifica-
tion, utilization of a live video feed for visual identification,
utilization of shared secrets for identification, utilization of
contact information (e.g., phone number, profile picture) for
identification, utilization of a shared second language for

34 Thirteenth Symposium on Usable Privacy and Security USENIX Association

Send Recognize Recognize Shared Contact Second Authentication
Application Picture Video Voice Knowledge Info Language Ceremony

WhatsApp 0 0 13 10 3 2 2
Viber 0 10 4 7 2 2 4
Facebook Messenger 2 12 2 7 0 0 2

Table 1: Methods of authentication used in the first phase by pairs of participants.

identification, and performing the actual authentication cer-
emony. These categories were compiled by asking users how
they authenticated the other party, and are not mutually
exclusive (some used more than one method).

We examined the two pairs that were successful to better un-
derstand their experiences. One pair was successful because
of their curiosity, which led to them exploring the application
settings. This pair started with Viber and began to verify
each other simply through a phone call, when they suddenly
noticed the option in Viber to authenticate a contact, mak-
ing that contact “trusted.” They subsequently verified the
encryption key through the phone feature embedded in the
authentication ceremony. After this experience, this pair no-
ticed they should be looking for similar functionality in the
other applications. The followed the on-screen instructions
in WhatsApp to scan the QR code, and they exchanged a
screenshot of the authentication code in Facebook Messenger.

A second pair started the study with Facebook Messenger.
This pair called each other using an insecure phone call,
spoke in Korean, and transferred the credit card number
used in the scenario without completing the authentication
ceremony. They next used WhatsApp, and because it was
their first time using the application, they were prompted
with a notice about end-to-end encryption after sending their
first message. After clicking to learn more, this pair was able
to locate and complete the authentication ceremony by using
a phone call to read and verify the key. After this experience,
the pair was also able to locate the lock icon in Viber, follow
the instructions in the ceremony, and use a phone call to
verify the key. However, they were unsure about the role
of the key and still verified each others’ identity by asking
questions that relied on their common knowledge.

6. SECOND PHASE RESULTS
In this section we discuss results regarding participant use
of the authentication ceremony for the second phase, when
additional instruction was given regarding the importance of
comparing keys.

6.1 Success Rate
The success rate for completing the authentication ceremony
in the second phase was drastically higher than for the first
phase. Overall, the success rate was 78% across all partic-
ipant pairs and the three applications. Table 2 shows the
breakdown of the success rate for each application. Failures
occurred when participants transmitted sensitive data before
verifying keys, or if they failed to find and validate the keys
within ten minutes of opening the application. Successes
indicate that participants identified and compared keys in
some fashion. The Error column indicates three cases where
Facebook Messenger failed to deliver messages or failed to
display important UI elements that allow participants to
access key information. We noted various mistakes made by

Application Success Fail Error

WhatsApp 19 (79%) 5 (20%) 0 (0%)
Viber 23 (96%) 1 (4%) 0 (0%)
Facebook Messenger 15 (63%) 6 (25%) 3 (13%)

Table 2: Success rates per pair of participants for the au-
thentication ceremony in the second phase.

participants, but these were considered distinct from failures
and are discussed later.

The leap from a 14% success rate in the first phase to 78% in
the second phase suggests that users are capable of locating
and performing the authentication ceremony when prompted.
Some of these applications indicate that keys need to be
validated, yet our results from phase one indicate that these
instructions are largely ignored, thus we suspect that the
independent prompts from our study accounted for much of
the difference seen in authentication ceremony success rates.

To test whether there are any differences between the ap-
plications, we used Cochran’s Q test. We found that the
success rate was statistically different for the applications
(χ2(2) = 15.429, p < .0005). We then ran McNemar’s test
to identify the significant differences among the pairs of
applications. We found there is a significant difference be-
tween WhatsApp and Viber (p = 0.008) as well as between
Facebook Messenger and Viber (p < 0.0005).

It is interesting that Viber’s success rate is significantly higher
than the other two applications. Viber’s authentication
ceremony uses an in-app phone call and provides a UI that
helps users view and read the encryption key during the
phone call. Both Facebook Messenger’s authentication also
provides only manual verification of the encryption key, but
does not provide this assistance.

6.2 Verification Methods
The methods used by participants to perform the authen-
tication ceremony are shown in Table 3. Note that some
participants used more than one method. We do not include
methods for three pairs of participants who encountered
errors when utilizing Facebook Messenger. These errors pro-
hibited us from assessing how these participants would have
interacted with the authentication ceremony.

The most-selected method for the ceremony through Whats-
App was scanning the QR code of the key fingerprint in per-
son. Of the applications we studied, this method is unique to
WhatsApp. Some pairs opted to take a screenshot of the key
or QR code and send it this way, while others remembered
substrings of the key fingerprint and repeatedly visited the
text screen to send pieces of it to their partner. This behavior

USENIX Association Thirteenth Symposium on Usable Privacy and Security 35

Action WhatsApp Viber Messenger

Secure Methods

Scanned QR code in
person

11 (46%) N/A N/A

Read key in person 1 (4%) 0 (0%) 7 (29%)

Called out of band
or used Viber’s call
method to provide
key

1 (4%) 23 (96%) 1 (4%)

Less Secure Methods

Sent key through in-
app text

7 (29%) N/A 10 (42%)

Sent key through in-
app video

3 (13%) N/A 4 (17%)

Sent key through in-
app voice

1 (4%) N/A 1 (4%)

Failures

Sent sensitive infor-
mation before valida-
tion

5 (21%) 1 (4%) 5 (21%)

Failed to find key
within 10 minutes
and after a hint

1 (4%) 0 (0%) 1 (4%)

Table 3: Methods used for the authentication ceremony in
the second phase. Numbers indicate pairs and percentages
are out of the total number of pairs.

occurred when participants discovered the QR code and key
fingerprint but were confused as to what to do next.

Numerous participants using WhatsApp read the key data in
person, read the key using a voice or video call, or sent the
key using text. Most participants using Facebook Messenger
used these methods, since they were the only ones available.

Viber provides a much stricter interface once a user has lo-
cated the option to verify his partner’s identity. Instead of
offering key material immediately, an in-app call must be
initiated before the key material is provided to the user. As
a result, all pairs who successfully completed the ceremony
utilized this feature to verify their keys. We note that this
policy resulted in no mistakes made for the authentication
ceremony. However, the process confused some participants,
and three pairs sent sensitive information through the appli-
cation without performing this procedure.

6.3 Timing
We timed each pair of participants to obtain two metrics:
the time taken to locate and identify the authentication
ceremony as it is presented within the application interface
and the time taken to complete the ceremony successfully.
In the case of finding the ceremony, the time reported is the
time taken for the first partner to identify the key material
or complete the task. We consider timing data only for
cases where the pair succeeded in authenticating successfully
because we stopped participants after 10 minutes if they
could not find the ceremony.

Figure 3: Timing for finding and using the authentication
ceremony in the second phase. Lighter shades indicate the
time taken to find the ceremony and the full bar indicates
time taken for completing the ceremony.

Figure 3 shows the geometric mean of both time metrics for
the three applications tested.3 Applications that are selected
to be evaluated first in a given study have a disadvantage with
respect to time because it is users’ first exposure to the task
and possibly keys in general. To account for this, Figure 3
also includes comparisons showing timing data from when
each application was studied first and when the application
was not studied first.

To test whether there is a significant difference in the time to
complete these tasks among the three different applications,
we used the Kruskal-Wallis test. We found that there are
statistically significant differences among the applications
for both finding the ceremony (p = 0.031) and completing
the ceremony (p = 0.043). We next ran pairwise post-hoc
Dunn’s tests to determine where the differences occur. We
found a significant difference between Facebook Messenger
and WhatsApp for finding the ceremony (p = 0.030), with
Facebook Messenger being faster (mean time, Facebook Mes-
senger=2.5 minutes, WhatsApp=3.7 minutes). We also found
a significant difference between Viber and WhatsApp for com-
pleting the ceremony (p = 0.045), with Viber being faster
(mean time, Viber=6.9 minutes, WhatsApp=8.5 minutes).

A major takeaway from the timing data shown is that key
discovery and key verification both require substantial time
for all three applications. On average, across all applications
discovery of the ceremony required 3.2 minutes and ceremony
completion required another 7.8 minutes. Given that the
participants were informed about the existence of the keys
beforehand and told explicitly to verify them, these times
are unsatisfactory from a usability standpoint. The usability
issues and concerns voiced by participants responsible for
these times are discussed in Section 7.

7. APPLICATION FEEDBACK
In this section we discuss feedback that participants provided
regarding the secure messaging applications, including us-
ability, their favorite application, and the trustworthiness of
the applications.

3Sauro and Lewis recommend using the geometric mean for
task timing [14] because timing data is positively skewed and
the geometric mean reduces error.

36 Thirteenth Symposium on Usable Privacy and Security USENIX Association

SUS subcategory WhatsApp Viber Messenger

Overall 65.45 67.45 67.78
First system 65.47 67.97 69.22
Not first system 64.45 66.02 67.97

Success 64.41 67.86 72.71
Failure 66.25 63.13 69.50

Table 4: SUS scores for the applications in the second phase.

7.1 Usability
During the second phase of our study, participants evaluated
each application using the System Usability Scale (SUS).
Table 4 presents the breakdown of the scores for each system
across various subcategories. The values shown are the mean
values for each subcategory, while bolded values highlight
the highest SUS score for each subcategory.

We report SUS scores across five subcategories for each ap-
plication: overall SUS score, the mean SUS score when the
application was the first of the three presented, the mean
SUS score when the application was not the first shown, the
mean SUS score for participants who succeeded at the task
using the given application, and the mean SUS score for
participants who failed the task.

Although SUS scores range from 0 to 100, this is not a
percentile value and can thus be difficult to interpret. Ac-
cordingly, to help contextualize the values shown, we draw
on the findings of researchers familiar with SUS. Sauro [13],
extending work from other researchers such as Bangor et al.
[4], created a guide for interpreting a given SUS score by nor-
malizing it relative to those achieved by other systems. This
framework associates SUS scores with percentile rankings
and with letter grades (from A+ to F).

For reference, the applications’ overall SUS scores fall within
the “C” range, landing somewhere within the 41st to 59th
percentile. The single lowest SUS score—Viber’s mean failure
score—nets a “C-” grade, falling within the 35th to 40th per-
centile. The highest SUS score—Facebook Messenger’s mean
success score—achieves a “C+” grade, somewhere within the
60th to 64th percentile.

7.2 Favorite application
Participants were asked to select which, if any, of the three
applications was their favorite and why. Table 5 shows the
breakdown of responses for each phase. Facebook Messenger
was the most preferred system, followed by WhatsApp. We
ran a Chi-Square test to determine if the differences in the
ratings between phase one and phase two were statistically
significant and they were not.

Though numerous reasons were given for why a particular
system was a participant’s favorite, familiarity was by far
the most commonly cited reason for preference (except with
Viber, which was not previously used by any of our par-
ticipants). The next most common reason given, and one
that held true for each of the three systems, was ease-of-use,
with what constituted “easy to use” varying from system to
system. Some WhatsApp users, for example, appreciated its
ability to scan QR codes for key verification, obviating the
need to read aloud the long string of digits comprising a key
fingerprint. Those who liked Viber found its key verification

Study phase WhatsApp Viber Messenger None

One 39.1% 8.7% 47.8% 4.4%
Two 31.3% 22.9% 43.8% 2.0%

Table 5: Participants’ favorite applications. Each cell con-
tains the fraction of participants from each phase who, when
prompted for their favorite system, gave the respective re-
sponse.

(a) Trust ratings in the first phase.

(b) Trust ratings in the second phase.

Figure 4: Participant ratings of trust for each application.

process the simplest to access and execute. By contrast,
those who mentioned ease-of-use relative to Facebook Mes-
senger typically associated it with familiarity as opposed to
any mechanism in particular.

7.3 Trust
As part of each post-task questionnaire, participants were
asked to rate their trust in each application. They were
presented with the statement “I trust this application to be
secure” and asked to rate the statement on a 5-point Likert
scale ranging from “strongly disagree” to “strongly agree.”
Responses for the two phases are shown, normalized, in
Figure 4.

Comparing the trust scores from the two phases, two points
stand out. First, a “strongly disagree” response—indicating a
total lack of trust in the application—appeared for all three
of the applications in the first phase, but not at all in the
second phase. This is mostly due to one participant from
the first phase who chose “strongly disagree” for all three
systems. Secondly, responses of “strongly agree”—indicating
confidence and trust in the application—are much more
prevalent in the second phase.

USENIX Association Thirteenth Symposium on Usable Privacy and Security 37

To compare the trust scores in more detail, we ran a mixed
model ANOVA Test, which allowed us to see the the inter-
action between the two independent variables (application
and phase). We found that there is a significant interaction
between the application and the study phase (F(2,140) =
5.023, p = 0.008, partial η2 = 0.067).

To determine whether there was a simple main effect for
the application, we ran a repeated measures ANOVA on
each phase. There was a statistically significant effect of
the application on trust for phase one (F(2,46) = 4.173, p
= 0.022, partial η2 = .154). By examining the pairwise
comparisons, we found that the trust score was significantly
lower for Viber as compared to WhatsApp in the first phase
(M = 0.542, SE = 0.180, p = 0.19).

To determine whether there was a simple main effect for the
study phase, we ran a one-way ANOVA on each application
to compare the trust between the two phases. There was a
statistically significant difference in trust ratings between the
two phases for Viber (F(1,70)=14.994, p < 0.0005, partial
η2 = .176). The mean trust for Viber in the first phase was
3.58, and in the second phase it increased to 4.40.

Altogether, this analysis indicates that Viber was trusted
less than WhatsApp in the first phase, but then was trusted
significantly more in the second phase, after some instruction
about the importance of the authentication ceremony. The
trust for Viber increased in the second phase to the point
that it was not significantly different from WhatsApp.

Participant commentary raised two other points of interest.
First, participants strongly associated reputation with the
trustworthiness of applications. Viber, for example, despite
possessing a large user base outside of the United States,
was essentially unknown to our participants, leading them
to express wariness of this application. Facebook’s status
as a household name both inspired confidence and distrust.
While its reputation as a large and established company
reassured some, others were discomfited by the many nega-
tive stories they had heard about account hacks and privacy
invasions on Facebook. Second, responding to descriptions of
end-to-end encryption and promises of secure communication
by the various applications, multiple participants remarked
that they had no way to truly gauge the validity of those
statements. Both these sentiments are captured by a remark
from R10B, “I would say it’s a double-edged sword because
Facebook—everyone knows Facebook—but it has that repu-
tation of getting hacked all the time. But I’ve never heard
of Viber or WhatsApp, so it could easily be some third-party
Ukrainian mean people who want to steal information because
that’s just who they are. And whether it states that they’re
not gonna read or listen to the conversations and stuff like
that… well, who knows?” However, most opted to believe,
for as one participant concluded, “at some point, you have
to trust something.”

8. OBSERVATIONS
During our study, certain participant experiences and com-
mentary stood out, highlighting a handful of concerns about
each of the three applications individually, and in general.
We feel that these observations are worthy of note in that
they suggest directions for focus and improvement in the
domain of secure messaging.

8.1 Single key fingerprint
WhatsApp and Viber both generate a single key fingerprint
to be shared between pairs. While alternating recitation of
segments of the key is likely the intention of developers, in
practice, relationship dynamics complicate the issue. We
observed several instances where the dominant partner in
the relationship read the entire key on their own, with their
partner simply offering an affirmation in response. When
key verification is done in this manner, one party never
actually demonstrates any knowledge of the shared secret—it
is entirely possible that a man-in-the-middle could simply
convey validation of the key when their key is, in actuality,
different. This effect is further emphasized when, as we
saw in one instance, the listening party asks the speaking
party to repeat the first part of the key, reinforcing the
speaking party’s belief that their partner is in possession
of the correct key. It is, however, worth noting that this
“extended” validation once again did not demonstrate any
actual knowledge of the secret.

8.2 Key length
It was often observed during the study that participants were
surprised at the length of the key data they were intended to
relay to their partners. Though every application used a form
of fingerprinting to greatly reduce the total characters that
needed to be read, users often verbally remarked that strings
were too long. During the key exchange process we often
witnessed fatigue, as participants would read only half the
key and claim that was “good enough” and some recipients
even ignored the key being read to them by their partners
after the first few numbers matched. R27A used a QR code
transmission to handle her first authentication ceremony with
WhatsApp. Upon realizing that no such option existed for
Viber, her second application used, she looked at the key and
exclaimed, “It’s about eight years long!”. R27A successfully
checked every digit of the key data with her partner, but
voiced her disapproval of its length repeatedly throughout.

8.3 Viber-specific issues
We observed two issues with Viber. The first relates to its
mechanism for verifying a new user’s phone number. While
most applications send a confirmation text containing a code,
as does Viber, it nevertheless defaults to calling the new user
first as a primary and alternative confirmation mechanism.
This took many of our participants by surprise and left
them ill-at-ease to see an unknown number suddenly calling
them. Secondly, and far more concerningly, Viber does not
provide a mechanism to revoke trust. While this is likely a
conscious decision on the developers’ part, it can cause issues
in practice. More specifically, one participant inadvertently
tapped the trust button while trying to figure out how to
verify his partner’s key, thus accidentally conveying to the
application in an apparently irreversible manner that this
individual was now trusted.

Many users were also critical of the Viber UI’s phrasing
for the option to begin the process of key verification. The
option is labeled “Trust this contact,” which many users
hesitated to press, unsure if it would inform the application
to trust the contact or if it would bring up further dialogues
to perform the validation. R36A visibly hesitated during this
step during the study and articulated this concern in the
exit interview: “if I click ‘Trust this Contact’ but I haven’t
verified [my partner] yet, it’s kind of weird.”

38 Thirteenth Symposium on Usable Privacy and Security USENIX Association

8.4 WhatsApp-specific issues
We observed several issues with WhatsApp. WhatsApp ap-
pends a pair of checkmarks next to each message, representing
the delivery and read status of the respective message. How-
ever, a handful of participants mistakenly associated these
checkmarks with security, operating under the misconception
that a checkmark beside a given message indicated that it
had been secured. The other two issues concern the key veri-
fication mechanism. When a matching QR code is scanned,
the application briefly flashes a green checkmark logo over
the QR code area, indicating that the fingerprint has been
validated and is correct. However, because it disappears
quickly, leaving no lasting indication that verification has
occurred, numerous participants wondered if they had veri-
fied the key or not. Additionally, the key verification screen
includes a button to share a screenshot of the verification
screen. Some of our participants assumed that they could
use this to send a screenshot to their partner, who could
then scan the QR code contained therein. Unfortunately for
them, WhatsApp does not provide functionality to scan a
QR code from an image, serving to confuse those who tried.

8.5 Facebook Messenger-specific issues
In addition to the usability concerns already described, such
as the difficulty in locating device keys, Facebook Messenger’s
Secret Conversation functionality—its mechanism for secure
communication—errored more than a few times during our
study. More importantly, however, was that these errors were
not apparent to participants. Participants were thus unaware
that the Secret Conversation was not operating as intended,
and instead blamed themselves or their counterparts for
failure. One example we encountered several times was
that encrypted messages sent via this mechanism appeared
normally on the user’s phone despite never being received
by their partner. One such participant began shouting in
exasperation at her phone, exclaiming, “I feel like I am having
a conversation with myself ! What’s wrong with this app?!”

8.6 Key changes
One important issue that secure messengers must deal with
in practice is a key change occurring mid-conversation. As
this was not tested by our participants during our study, we
recreated this scenario in each of the three applications to ob-
serve their respective reactions. Facebook Messenger inlines
a message when one’s conversation partner’s key changes,
informing the user that their device has changed and that
their key has changed. While it does not explicitly instruct
the user to re-verify the key, of the three applications, it
makes the user aware that key change has occurred. Viber
gives no proactive notification to the user that key change has
occurred, but when the conversation menu is again accessed
post-change, Viber includes an explicit message warning the
user that they will need to re-verify the identity of their
conversation partner. WhatsApp presented no notification
that we could observe. It neither inlined a notification as
Facebook Messenger did, nor does it indicate to the user
that re-verification must be performed. In fact, WhatsApp
presents no lasting UI change that allows a user to confirm
that verification has occurred at all.

9. USER THREAT MODEL
Two authors jointly coded responses to two survey questions
used in both phases regarding participant perception of the
authentication ceremony. These questions were:

• Please explain why you think you have (or have not)
verified the identity of your friend.

• Who do you think can read your message except you
and your friend?

In reviewing the coded data, some details of the threat models
perceived by users became evident.

Note that, if correctly followed, completing the authentica-
tion ceremony successfully guarantees that a participant has
authenticated their partner and no other party can listen in
on the conversation. This of course assumes that the appli-
cations have properly implemented cryptographic protocols.
None of the applications studied are open source, so their
claims cannot be verified.

Of the 141 times the first of these prompts was presented
(excluding Facebook Messenger errors), 109 responses indi-
cated that the authentication ceremony was a primary reason
for successful identification. This is encouraging, but also
expected given the focus that the study placed on its signifi-
cance, which may have biased participants. For example, in
response to the first prompt, R13B stated “…I asked him a
person[al question] that he responded [to] in the right man-
ner, but also because our messages were encrypted and our
personal keys matched.” The use of questions that rely on
shared knowledge was a common response to this prompt,
and it was often coupled with a reference to verifying the
key.

Where verification of personal inquiries are mentioned in
tandem with key verification as a reason for verified identi-
ties, it is unclear whether participants believe the inquiry
can be used as a substitute for key verification or if they
are expressing the more secure notion that proper key ver-
ification includes explicit identity matching. To mitigate
any mislabeling due to this lack of clarity, we focus on the
responses that did not mention key verification as the reason
for identity verification, which occurred 32 times. These
responses focused on verifying features of their partner and
considered impersonation or physical duress attack vectors.
For example, R24A asserted he had verified the identity of
his partner because he had “asked personal questions that
are difficult to know from online material/searches and R36B
confided that his partner “was able to tell [him] something
that no one else would know. Unless he was being held at
gunpoint.” Of these 32 responses, 28 (88%) of them mention
using features of their partner as the method of verifying
identity (e.g. physical appearance in video, shared private
knowledge, familiar voice). Two others mentioned trust in
the application itself, one admitted no attempt to verify, and
one trusted that their partner verified on their behalf.

The second prompt listed above provided some insight into
the set of possible attackers considered by participants. This
question was issued 141 times as well, immediately following
the prompt mentioned earlier. Though 109 responses indi-
cated that the identity of their partner had been verified,
only 76 (70%) responses indicated that no other party could
read messages exchanged between the two partners. The
responses of those who indicated that other parties may be
privy to the information were coded to determine the na-
ture of the suspect parties. Five distinct entities were found
to be mentioned in those responses: government, cellular
service providers, physical accessors (e.g., shoulder surfers,

USENIX Association Thirteenth Symposium on Usable Privacy and Security 39

Type Times Mentioned

Service Provider 4
Government 8
Hackers 17
Physical Accessors 18
Application Developer 19

Table 6: Attacker types suspected by participants.

thieves), the application developer, and remote “hackers”.
The number of times each of these entities was mentioned
in a response are recorded in Table 6. Thirty-three of these
labels come from persons who identified the importance of
the key in verifying their partner’s identity but obviously
remained skeptical as to the full security of the application.

It is interesting to note that man-in-the-middle attacks were
not explicitly mentioned as a possible attack vector in the
responses to either of the prompts evaluated here. Imper-
sonation was mentioned frequently in responses to the first
prompt, and various tampering by governments and those
with physical access to phones and their software were men-
tioned in responses to the second prompt. The apparent
lack of awareness of man-in-the-middle attacks seemed to
influence the trust users had in each other’s identity, based
on the frequent mentions of things like shared knowledge
and videos used when identifying users. Many respondents
further demonstrated this unknown attack surface through
additional commentary. For example, R24A said he “just did
not consider verifying her identity. Thought [it] would [be]
hard to replicate it within this short time.”

Many users did seem to grasp that there were other attacks
possible, but used the term “hacker” as a generic catchall for
these. For example, R27B mentioned that no one could read
the messages sent between her and her partner “unless people
read over our shoulder or people hack into our Facebook
accounts and read them before we delete them.” Similarly,
R36A and R28A stated that the only people who could
read the encrypted messages were “just the two of us unless
there were hackers” and “not WhatsApp or third parties! But
probably people with skills,” respectively.

In addition to being a catchall, use of the “hacker” response
may also be providing insight into belief in a theoretical
ceiling of network security by users. Since most users are
unfamiliar with the mathematical foundations of cryptog-
raphy and the details of security protocols, many struggle
to adopt secure practices and understand the nature of var-
ious threats. On the other hand, users are often aware of
their own ignorance in such matters, and these responses
might indicate that users account for this in mental models
by incorporating a “hacker” entity that is always capable of
subverting any piece of the system. In this sense, lack of
security knowledge affects both users’ ability to make secure
decisions and lowers their confidence in security itself.

Some users also expressed some suspicion of the applications
themselves for government and/or developer eavesdropping.
R24B was suspicious of both: “Viber (if they want to) &
government investigation agencies”. Others respondents ex-
plicitly mentioned “backdoors” built into the applications or
general suspicions like R29B: “I still feel like WhatsApp can

read the messages even though they say they can’t.” Finally,
some users were wary of logging, as exemplified by R15A:

“The company I’m sure has records of the texts but [security]
depends on if they go through them or not.”

Overall, the responses indicate that users have a healthy wari-
ness and high-level understanding of impersonation attacks,
government and developer backdoors, and physical theft, but
that the same cannot be said for man-in-the-middle attacks,
both passive and active. It is assumed that some of the
mentions of “hackers” refer to this, but these responses were
far less specific than for other attacks. In other words, it
appears that users’ threat models do not include the ability
for attackers to be positioned in between the two endpoints of
a conversation. If this was understood, we hypothesize that
far less respondents would have relied on physical appearance
or shared knowledge as an identity verification mechanism.
Since one of the primary goals of the secure exchange of keys
is to thwart man-in-the-middle attacks, work may be needed
to help users understand this attack vector.

10. CONCLUSION
We used a two-phase study to examine whether users are
able to locate and complete the authentication ceremony
in secure messaging applications. In the first phase, users
were aware only of the need to authenticate and ensure
confidentiality, and only two of twelve users were able to
locate the authentication ceremony, with an overall success
rate of 14%. Participants instead primarily used personal
characteristics, such as a person’s voice, face, or shared
knowledge. In the second phase, users were instructed about
the importance of comparing encryption keys in order to
authenticate a partner, leading to an overall success rate of
78%. Users were significantly more successful using Viber.
However, the time required to find and use the authentication
ceremony was 11 minutes, combined, on average across all
applications, which may be so long that it would discourage
users from authenticating each other.

Based on our findings, we believe that many users can locate
and complete the authentication ceremony in secure messag-
ing applications if they know they are supposed to compare
keys. However most people do not understand the threat
model, so it is not clear that they will know how important
it is to compare keys.

An open question is how secure messaging applications can
prompt the correct behavior, even without user understand-
ing. It may be possible to leverage the tendency users have
to rely on personal characteristics for authentication. We are
exploring the use of social authentication [20] as a way of
translating authentication of encryption keys into a method
that is more understandable to users.

Another area for future work is improving the authentication
ceremony so that it does not take so long to complete. A
system like CONIKS [9] may help to automate the process of
discovering another person’s key without relying on a single
trusted party, while also providing non-equivocation so that
key servers cannot deceive users.

11. ACKNOWLEDGMENTS
The authors thank the anonymous reviewers and our shep-
herd, Lujo Bauer, for their helpful feedback. This material
is based upon work supported by the National Science Foun-
dation under Grant No. CNS-1528022.

40 Thirteenth Symposium on Usable Privacy and Security USENIX Association

12. REFERENCES
[1] H. Assal, S. Hurtado, A. Imran, and S. Chiasson.

What’s the deal with privacy apps?: A comprehensive
exploration of user perception and usability. In
International Conference on Mobile and Ubiquitous
Multimedia. ACM, 2015.

[2] E. Atwater, C. Bocovich, U. Hengartner, E. Lank, and
I. Goldberg. Leading Johnny to water: Designing for
usability and trust. In Eleventh Symposium On Usable
Privacy and Security (SOUPS 2015), pages 69–88,
Montreal, Canada, 2015. USENIX Association.

[3] W. Bai, D. Kim, M. Namara, Y. Qian, P. G. Kelley,
and M. L. Mazurek. An inconvenient trust: User
attitudes toward security and usability tradeoffs for
key-directory encryption systems. In Twelfth
Symposium On Usable Privacy and Security (SOUPS
2016). USENIX Association, 2016.

[4] A. Bangor, P. Kortum, and J. Miller. Determining
what individual SUS scores mean: Adding an adjective
rating scale. Journal of Usability Studies (JUS),
4(3):114–123, 2009.

[5] A. De Luca, S. Das, M. Ortlieb, I. Ion, and B. Laurie.
Expert and non-expert attitudes towards (secure)
instant messaging. In Twelfth Symposium On Usable
Privacy and Security (SOUPS 2016). USENIX
Association, 2016.

[6] S. Dechand, D. Schürmann, T. IBR, K. Busse, Y. Acar,
S. Fahl, and M. Smith. An empirical study of textual
key-fingerprint representations. In Twenty-Fifth
USENIX Security Symposium (USENIX Security 2016).
USENIX Association, 2016.

[7] Facebook. facebookmessenger.com.
https://www.messenger.com/. Accessed: 8 March,
2017.

[8] A. Herzberg and H. Leibowitz. Can Johnny finally
encrypt? Evaluating E2E-encryption in popular IM
applications. In Sixth International Workshop on
Socio-Technical Aspects in Security and Trust (STAST
2016), Los Angeles, California, USA, 2016.

[9] M. S. Melara, A. Blankstein, J. Bonneau, E. W. Felten,
and M. J. Freedman. CONIKS: Bringing key
transparency to end users. In Twenty-Fourth USENIX
Security Symposium (USENIX Security 2015), pages
383–398. USENIX Association, 2015.

[10] S. Milgram and E. Van den Haag. Obedience to
authority, 1978.

[11] S. Ruoti, J. Andersen, T. Hendershot, D. Zappala, and
K. Seamons. Private Webmail 2.0: Simple and
easy-to-use secure email. In Twenty-Ninth ACM User
Interface Software and Technology Symposium (UIST
2016), Tokyo, Japan, 2016. ACM.

[12] S. Ruoti, J. Andersen, D. Zappala, and K. Seamons.
Why Johnny still, still can’t encrypt: Evaluating the
usability of a modern PGP client, 2015. arXiv preprint
arXiv:1510.08555.

[13] J. Sauro. A practical guide to the system usability scale:
Background, benchmarks & best practices. Measuring
Usability LLC, 2011.

[14] J. Sauro and J. R. Lewis. Average task times in
usability tests: what to report? In Twenty-Eighth ACM
Conference on Human Factors in Computing Systems
(CHI 2010), pages 2347–2350. ACM, 2010.

[15] S. Schröder, M. Huber, D. Wind, and C. Rottermanner.
When SIGNAL hits the fan: On the usability and
security of state-of-the-art secure mobile messaging. In
First European Workshop on Usable Security
(EuroUSEC 2016), 2016.

[16] S. Sheng, L. Broderick, C. Koranda, and J. Hyland.
Why Johnny still can’t encrypt: Evaluating the
usability of email encryption software. In Poster
Session at the Symposium On Usable Privacy and
Security, Pitsburg, PA, 2006.

[17] A. Sotirakopoulos, K. Hawkey, and K. Beznosov. “I did
it because i trusted you”: Challenges with the study
environment biasing participant behaviours. In SOUPS
Usable Security Experiment Reports (USER) Workshop,
2010.

[18] J. Tan, L. Bauer, J. Bonneau, L. F. Cranor, J. Thomas,
and B. Ur. Can unicorns help users compare crypto key
fingerprints? In Thirty-Fifth ACM Conference on
Human Factors and Computing Systems (CHI 2017),
pages 3787–3798. ACM, 2017.

[19] N. Unger, S. Dechand, J. Bonneau, S. Fahl, H. Perl,
I. Goldberg, and M. Smith. SoK: secure messaging. In
Thirty-Sixth IEEE Symposium on Security and Privacy
(SP 2015), pages 232–249. IEEE, 2015.

[20] E. Vaziripour, M. O’Neill, J. Wu, S. Heidbrink,
K. Seamons, and D. Zappala. Social authentication for
end-to-end encryption. In Who Are You?! Adventures
in Authentication (WAY 2016). USENIX Association,
2016.

[21] Viber. Viber.com. https://www.viber.com/en/.
Accessed: 8 March, 2017.

[22] WhatsApp. Whatsapp.com.
https://www.whatsapp.com/. Accessed: 8 March,
2017.

[23] A. Whitten and J. Tygar. Why Johnny can’t encrypt:
A usability evaluation of PGP 5.0. In Eighth USENIX
Security Symposium (USENIX Security 1999), pages
14–28, Washington, D.C., 1999. USENIX Association.

APPENDIX
A. STATISTICAL TESTS
This section contains the details of the statistical tests we
ran.

A.1 Success and Failure Rates
This data measures whether the participants were successful
in using the authentication ceremony for each application
in the second phase of the study. We want to test whether
there are any differences between the applications.

Because the data is dichotomous we used Cochran’s Q Test
and found that the success rate was statistically different for
the applications (χ2(2) = 15.429, p<.0005).

We then ran McNemar’s test to find the significant differences
among the pairs of applications. As shown in Table 7, and
after applying a manual Bonferroni correction for the three
tests (requiring p<0.0167), there is a significant difference
between WhatsApp and Viber (p=0.008) as well as between
Facebook Messenger and Viber (p<0.0005).

A.2 Task Completion Times

USENIX Association Thirteenth Symposium on Usable Privacy and Security 41

Fail Success N Exact Sig.

Viber

WhatsApp
Fail 2 8

48 0.008
Success 0 38

Viber

Messenger
Fail 0 12

42 0.000
Success 0 30

Messenger

WhatsApp
Fail 4 2

42 0.109
Success 8 28

Table 7: McNemar’s test for success and failure

This data measures the time taken by participants to (a) find
the authentication ceremony and (b) complete the authen-
tication ceremony, which was only measured in the second
phase of the study. We want to know if there is a significant
difference in the time to complete these tasks among the
three different applications tested—WhatsApp, Viber, and
Facebook Messenger.

We first tested for normality using the Shapiro-Wilk test. As
Table 8 shows, the data is not normally distributed for any
application (p<0.05).

Task Application Statistic df Sig.

Finding Ceremony
WhatsApp 0.902 38 0.003
Viber 0.878 46 0.000
Messenger 0.886 30 0.004

Completing Ceremony
WhatsApp 0.856 38 0.000
Viber 0.835 46 0.000
Messenger 0.762 30 0.000

Table 8: Shapiro-Wilk test for task completion times

We next ran the Kruskal-Wallis test, which is a nonparametric
test that can determine if there are statistically significant
differences between two or more groups. This test rejects
the null hypothesis that the distribution of task times is hte
same across the applications, for both finding the ceremony
(p=0.031) and completing the ceremony (p=0.043). We next
ran pairwise post-hoc tests to determine where the differences
occur.

As Table 9 shows, We found a significant difference between
WhatsApp and Facebook Messenger for finding the ceremony
(p=0.029), with Facebook Messenger being faster (mean time,
WhatsApp=3.7 minutes, Facebook Messenger=2.5 minutes).
We also found a significant difference between Viber and
WhatsApp for completing the ceremony (p=0.021), with
Viber being faster (mean time WhatsApp=8.5 minutes, Viber
6.7 minutes). Note, the significance has been adjusted by
the Bonferonni correction for multiple tests.

Test Std. Std. Test Adj.
Task Comparison Statistic Error Statistic Sig.

Finding
Ceremony

Messenger – Viber 14.887 7.616 1.955 0.152
Viber – WhatsApp 5.492 7.114 0.772 1.000
Messenger – WhatsApp 20.379 7.926 2.571 0.030

Completing
Ceremony

Messenger Viber -12.000 7.702 -1.558 0.358
Viber – WhatsApp 17.526 7.195 2.436 0.045
Messenger – WhatsApp 5.526 8.016 0.689 1.000

Table 9: Pairwise comparisons from Kruskal-Wallis post-hoc
tests for task completion times

A.3 Favorite Rankings
This data measures the system participants selected as their
favorite, which was only collected in the second phase of the
study. We want to test whether there are any differences
between the favorite rankings for each application between
the two phases.

We ran a Chi-Square test using the scores for the favorite
application. As shown in Table 10, there are no statistically
significant differences.

Favorite Favorite Favorite Pearson Asym.
Phase WhatsApp Viber Messenger Chi-Square df Sig.

1 9 2 11
2.069 2 0.355

2 15 11 21

Table 10: Chi-Square test for favorite application ranking

A.4 Trust Scores
We ran a mixed model ANOVA Test because we are interested
in seeing the interaction between two independent variables
(application and phase). This data is not well suited to
a Kruskal-Wallis test because the use of the Likert scale
provides too many ties when measuring trust. Mauchly’s test
of sphericity indicated that the assumption of sphericity was
met for the two-way interaction (χ2(2) = 3.385, p=.184).

We next examined the results for tests of within-subject
effects and found that there is a significant interaction be-
tween the application and the study phase (F(2,140)=5.023,
p=0.008, partial η2 = 0.067).

To determine whether there was a simple main effect for
the application, we ran a repeated measures ANOVA on
each phase. As shown in Table 11, there was a statisti-
cally significant effect of the application on trust for phase 1
(F(2,46)=4.173, p=0.022, partial η2 = .154). Note that due
to a violation of the sphericity assumption in phase 2, we
use the Greenhouse-Geisser correction.

Mean Mean Mean
Phase WhatsApp Viber Messenger df F Sig. η2

1 4.13 3.58 3.79 2,46, 4.173 0.022 0.154
2 4.10 4.40 4.17 1.69,79.42 1.843 0.171 0.038

Table 11: Repeated measures ANOVA on each phase

By examining the pairwise comparisons, shown in Table 12,
we found that the trust score was significantly lower for
Viber as compared to WhatsApp in the first phase (M=0.542,
SE=0.180, p=0.19). Note, we use the Bonferroni correction
for multiple tests.

Mean Std. Adj. Lower Upper
Comparison Difference Error Sig Bound Bound

WhatsApp-Viber 0.542 0.180 0.019 0.076 1.007
WhatsApp-Messenger 0.333 0.155 0.128 -0.068 0.735
Messenger-Viber 0.208 0.225 1.00 -0.373 0.789

Table 12: Pairwise comparisons from one-way ANOVA on
each application, phase 1

To determine whether there was a simple main effect for
the study phase, we ran a one-way ANOVA on each appli-
cation to compare the trust between the two phases. As

42 Thirteenth Symposium on Usable Privacy and Security USENIX Association

shown in Table 13, there was a statistically significant dif-
ference in trust ratings between the two phases for Viber
(F(1,70)=14.994, p<0.0005, partial η2 = .176). The mean
trust for Viber in the first phase was 3.58, and in the second
phase it increased to 4.40.

Mean
Application Phase 1 Phase 2 df F Sig. η2

WhatsApp 4.13 4.12 1,70 0.007 0.935 0.00
Viber 3.58 4.40 1,70 14.994 0.00 0.176
Messenger 3.79 4.17 1,70 2.230 0.140 0.031

Table 13: One-way ANOVA on each application

B. STUDY MATERIALS
This section contains the study materials we used. The in-
terview guide and interview form were used by the study
coordinators to ensure that each pair of participants experi-
enced an identical study. The questionnaire was followed by
study participants to guide them through the study.

B.1 Interview Guide
Make sure to complete the following steps:

1. When the participants arrive, read them the following:

Welcome to our secure messaging application study. We
are the study coordinators and are here to assist you as
needed.

Before we start the study, we need you to install the
following applications: WhatsApp, Facebook Messenger,
Viber.

In this study, the two of you will be in different rooms
and will use the applications to communicate with each
other. You will each be asked to play the role of an-
other person. I will provide you with information about
this person. During the study, please use the provided
information and not your own personal information.

Notice that even you are in separate rooms, you are
welcome to ask for meeting, calling or emailing your
study partner during the study if you need to complete
the study.

You will be asked to do the task while you are thinking
loud and express your feelings or thoughts about each
single task that you are doing. During the course of this
study we will be recording what is happening in the study
room including your any verbal communication with the
study coordinators. These recordings will not be seen
by anyone beside the researchers and will be destroyed
once our research is complete. We will not collect any
personally identifying information. Any data, besides
the previously mentioned recordings and answers to the
study survey, will be deleted automatically upon your
completion of the study.

You will each receive $10 as compensation for your par-
ticipation in this study. The expected time commitment
is approximately 60 minutes. If you have any questions
or concerns, feel free to ask us. You can end participa-
tion in this survey at any time and we will delete all
data collected at your request. A study coordinator will
be with you at all times to observe the study and also to
answer any questions you may have.

2. Before going to the study rooms, make sure they sign
the audio recording consent form.

3. Make sure their phone has enough space for installing
the three apps (you can ask them to install the apps
before the study starts)

4. Choose one of the available codes for later usage in the
study from the following link (a spreadsheet for time
slots)

5. Flip a coin and choose one participant to be Person A
and one person to be Person B.

6. Take the user with whom you decided to work to the
study room. Complete the following setup steps:

(a) Ask the participant to sit down.

(b) Start the audio recording using the phones in the
lab.

(c) Read the following instructions to your participant:
We are going to ask you to do a series of tasks.
Once you are done with each step, let the study
coordinator know you have finished the task. You
will then fill out a questionnaire and go to the next
step.
We need you to think out loud while you are doing
the tasks, meaning you are supposed to talk about
how you are accomplishing the task and express any
feelings you have.
If you have any questions about the study ask the
study coordinator. Remember you are allowed to
talk to or meet your friend during the study.
Please do not forget think loud.

7. On the chromebook, load the survey from Qualtrics

8. Give the code you already selected to the user.

9. Before using each system, the survey will instruct the
participant to tell you they are ready to begin the next
task.

10. During the course of the task pay attention to what
user is doing and fill out one of the attached sheets.

(a) The user is supposed to think aloud while doing the
tasks. If she forgets, gently remind her.

(b) If the user complains that he is confused, suggest he
can consult with his study partner and do not help
him to accomplish the task. Try not to instruct
the user when they ask questions. Answer them
while giving as little information as you can away
about the study, but try to remind him that he has
a partner who can help him.

(c) If it takes the pair too long to use one application
(10 minutes), then record that as a failure and guide
the user to the next task. If you end the task, inform
the other study coordinator that you have done so,
so that he catches up with you.

11. When the survey is finished, ask the participant about
their experience.

(a) Use the situations you noted while they took the
study or interesting things they said on the survey.

(b) If they had any problems during the study, ask them
to use their own words to describe the problem. Ask
them how they would like to see it resolved.

USENIX Association Thirteenth Symposium on Usable Privacy and Security 43

12. When the participant is finished, go to meet the other
group in your room. Next, ask them the following
questions: (If it is applicable)

(a) You saw QR codes, strings of digits, and maybe
NFC communication (touching your phones) as
methods for verifying keys. Which one did you
prefer and why?

(b) If you were in a different city or state from your
friend, how would you verify your friend’s key?
Would this be convenient?

(c) Some of these applications, like Facebook Messenger
let you chat both securely and insecurely. The
rest of the applications only let you have secure
conversations. Which approach do you prefer and
why?

13. Thank the participants for their time. Help them fill
out the compensation forms. Send them to the CS office
to be compensated.

14. Stop the audio recording. Save the record by time.

15. Fill in your name:

16. Return this form.

B.2 Interview Form
Study Coordinator’s Name:
Study Number:
System:
WhatsApp, Viber, FaceBook Messenger
Start Time:
End Time:
Key Verification:

� QR Code

� Manual verification via phone call

� Manual verification in person

� Manual verification other:

� NFC

� Verified successfully

� Notices conversation encrypted

Mistakes Made:

� The user sends the key or anything related to the key
via the application itself

� The user sends sensitive data (the credit card number)
unencrypted or before doing the identity verification

� Does not use an encrypted conversation

� Other:

Other:

� The user calls, texts or meets his study partner Explain:

� The application crashes and needs to be restarted. Ex-
plain:

� The user expresses any strong feelings toward the task
(e.g. how boring or hard or easy it is) Explain:

� Other Explain:

C. STUDY QUESTIONNAIRE
Secure Messaging Application Study

1. Please enter the Type.

◦ A
◦ B

2. Please enter the code that study coordinator provides
for you, here.

3. What is your gender?

◦ Male
◦ Female
◦ I prefer not to answer

4. What is your age?

◦ 17 and under
◦ 18-24
◦ 25-34
◦ 35-45
◦ 46-64
◦ 65 and over
◦ I prefer not to answer

5. What is the highest degree or level of school you have
completed?

◦ None
◦ Primary/grade school (2)
◦ Some high school, no diploma
◦ High school graduate: diploma or equivalent (e.g.,

GED)

◦ Some college, no diploma
◦ Associate’s or technical degree
◦ Bachelor’s degree
◦ Graduate/professional degree
◦ I prefer not to answer

6. What is your occupation or major?

7. Mark any of the following options which apply to you.

� Others often ask me for help with the computer.

� I often ask others for help with the computer.

� I have never designed a website.

� I have never installed software.

� I have never used SSH.

� Computer security is one of my job responsibilities.

� I have taken courses related to computer security,
electronic engineering, security, or IT.

� I often use secure messaging applications such as
WhatsApp.

� I have never sent an encrypted email.

� I am familiar with cryptography.

� I understand the difference between secure and non-
secure messaging applications.

8. (Second phase only) How would you rate your knowl-
edge of computer security?

◦ Beginner
◦ Intermediate
◦ Advanced

44 Thirteenth Symposium on Usable Privacy and Security USENIX Association

9. (Second phase only) If you are reading a website,
such as CNN, using HTTP, who can see what you are
reading?

◦ Nobody, this is a private connection.
◦ Your ISP and CNN, but nobody else.
◦ Any router in between you and CNN.
◦ Your ISP and nobody else.
◦ I don’t know

10. (Second phase only) If you use a regular text mes-
saging application, who can read your text messages?

◦ Only the person you send the text message to.
◦ The person you send the text message to and the

company providing the text messaging service.

◦ Anybody who is nearby.
◦ Google.
◦ I don’t know.

11. (Second phase only) How can you tell if it is safe to
enter your username and password on a website?

◦ The website has a good reputation.
◦ The website has a privacy statement.
◦ There is a lock icon in the URL bar and the URL

shows the right host name.

◦ The web site is professionally designed.
◦ I don’t know.

12. (Second phase only) What is phishing?

◦ Making a fake website that looks legitimate to steal
your private information.

◦ Hacking someone’s computer.
◦ Calling someone pretending to be a company to

steal their information.

◦ Tracking your internet habits to send advertise-
ments.

◦ I don’t know.
13. (Second phase only) What is a public key used for?

◦ I do not know what a public key is.
◦ To encrypt data for the person who owns the corre-

sponding private key.

◦ To setup 2- factor authentication so your password
can’t be stolen.

◦ To identify you to a bank.
◦ To protect an application so you know it is safe to

use.

14. (Second phase only) If you receive a message en-
crypted with your friend’s private key, then you know
that

◦ Your friend has been hacked.
◦ Your friend was the one who sent the message.
◦ Everything you send your friend is private.
◦ You can’t trust what your friend is sending you.
◦ I do not know what a private key is.

15. Which of the following applications have you ever used?
Select as many options that applies to you.

� WhatsApp
� ChatSecure
� Signal

� Telegram

� Zendo

� SafeSlinger

� Allo

� FB messenger

� iMessage

� imo

� Skype

� Viber

� Other

16. What is the main reason why you use these applications
(list of applications from previous question) ?

17. Have you ever tried to verify the identity of the person
you are communicating with when you are using (list of
applications from previous question) ?

◦ Yes
◦ No
◦ Not Sure

18. Have you ever tried to send sensitive information when
you use (list of applications from previous question)?

◦ Yes
◦ No

19. Have you ever had an experience or heard any stories
about any secure messaging applications being compro-
mised?

◦ Yes
◦ No

20. If yes, what story did you hear and what application
was it about?

21. Second Phase Only:

Read aloud the following instructions:

What Is Secure Messaging?

When you use regular text messaging, your
phone company can read your text messages.

USENIX Association Thirteenth Symposium on Usable Privacy and Security 45

When you use secure messaging apps, you are
having a private conversation with your friend.

Not even the company running the service can
see your messages.

But you still need to be careful. A hacker could
intercept your traffic.

The bad guy

To make sure your conversation is secure, these
applications assign a “key” to each person.

You need to make sure the key you see is the
same key your friend sees.

???

???

Secure messaging apps provide a way for you to
compare these keys.

We want to see how well the application helps
you do this.

22. Tell the study coordinator that you are ready for the
next task to begin.

Repeat the following block for each of the three
applications

23. You would like to send secure text messages to your
friend. For example, you might want to ask for a credit
card number you left at home, or talk confidentially
about a friend who is depressed.

In this study we need you to do the following steps:

For Person A

You are going to be using (WhatsApp/Viber/Facebook
Messenger) for secure texting with your friend. This
application is designed to help you have a private con-
versation with your friend.

Your task is to make sure that you are really talking to
your friend and that nobody else (such as the service
provider) can read your text messages. The application
should have ways to help you do this.

We want you to talk and think aloud as you figure this
out.

Once you are sure the conversation is secure, ask the
other person to send you your credit card number with
the following message.

“Hello! Can you send me my credit card number that I
left on my desk at home?”

For Person B

You are going to be using (WhatsApp/Viber/Facebook
Messenger) for secure texting with your friend. This
application is designed to help you have a private con-
versation with your friend.

Your task is to make sure that you are really talking to
your friend and that nobody else (such as the service
provider) can read your text messages. The application
should have ways to help you do this.

We want you to talk and think aloud as you figure this
out.

Say out loud why you believe you are texting to the
right person and why nobody else can read the text
messages. Your preference is to figure this out without
the other person in the same room, but If you need to
visit the other person to do this, you should go ahead
and visit them.

Once you are sure the conversation is secure, he/she
will ask you to send his/her credit card number through
the application. Use the following number in the study:
“132542853779”=

24. You will now be asked several questions concerning your
experience with (WhatsApp/Viber/Facebook Messen-
ger).

25. (Second phase only) Please answer the following
questions about (WhatsApp/Viber/Facebook Messen-
ger). Try to give your immediate reaction to each state-
ment without pausing to think for a long time. Mark
the middle column if you don’t have a response to a
particular statement.

• I think that I would like to use this system fre-
quently.

◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I found the system unnecessarily complex.
◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I thought the system was easy to use.
◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I think that I would need the support of a technical
person to be able to use this system.

46 Thirteenth Symposium on Usable Privacy and Security USENIX Association

◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I found the various functions in this system were
well integrated.

◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I thought there was too much inconsistency in this
system.

◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I would imagine that most people would learn to
use this system very quickly.

◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I found the system very cumbersome to use.
◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I felt very confident using the system.
◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

• I needed to learn a lot of things before I could get
going with this system.

◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

26. I trust this application to be secure.

◦ Strongly agree
◦ Somewhat agree
◦ Neither agree nor disagree
◦ Somewhat disagree
◦ Strongly disagree

27. Have you managed to verify the identity of your friend
correctly?

◦ No

◦ Yes
◦ Not sure

28. Please explain why do you think you have (or have not)
verified the identity of your friend.

29. Who do you think can read your message except you
and your friend?

End of the repeated block

30. You have finished all the tasks for this study. Please
answer the following questions about your experience.

31. Which system was your favorite?

◦ WhatsApp
◦ Viber
◦ FaceBook Messenger
◦ I didn’t like any of the systems I used

32. Please explain why.

33. Which of the following applications have you ever used
for secure communication? Select as many options that
applies to you.

� WhatsApp
� ChatSecure
� Signal
� Telegram
� Zendo
� SafeSlinger
� Allo
� FB messenger
� iMessage
� Skype
� imo
� Viber
� Other

34. Please answer the following question. Try to give your
immediate reaction to each statement without pausing
to think for a long time. Mark the middle column if
you don’t have a response to a particular statement.

It is important to me to be able to have private conver-
sations with my friends and family using secure appli-
cations (like WhatsApp).

◦ Strongly disagree
◦ Disagree
◦ Neither Agree nor Disagree
◦ Agree
◦ Strongly Agree

35. Did you know about encryption before attending this
study?

36. Are you willing to participate in a follow up study? If
so, please leave your name and phone number with the
study coordinator.

USENIX Association Thirteenth Symposium on Usable Privacy and Security 47