week3-Designing Studies You Can Learn From

你可以学习的设计研究

In this lecture, we’re going to talk about trying out your interface with people and doing so in a way that you can improve your designs based on what you learned. One of the most common things that people ask when running studies is: “Do you like my interface?” and it’s a really natural thing to ask, because on some level it’s what we all want to know. But this is really problematic on a whole lot of levels.

我们这堂课要探讨让他人尝试你的界面并根据所学内容改进设计的方法。在做研究时,有一个很常问的的问题是:『你喜欢我的界面吗?』会问这个问题很正常,因为在某种程度上这确实是我们想知道的事。但在很大程度上这样问反而会制造问题。

For one it’s not very specific, and so sometimes people are trying to make this better and so they’ll improve it by doing something like: “How much do you like my interface on one to five scale?” Or: “‘This is a useful interface’ — Agree or disagree on one to five scale.” And this adds some kind of a patina of scientificness to it but really it’s just the same thing — you’re asking somebody “Do you like my interface?” And people are nice, so they’re going to say “Sure I like your interface.” This is the “please the experimenter” bias. And this can be especially strong when there are social or cultural or power differences between the experimenter and the people that you’re trying out your interface with: For example, [inaudible] and colleague show this effect in India where this effect was exacerbated when the experimenter was white.

首先这不够明确,因此一些人进行了改进。他们会问诸如『以1-5作为程度划分,你有多喜欢这个界面呢?』或者『从1-5为不同意至同意,这是个好用的界面,你同意吗?』这好像增加了一点科学感,但其实是一回事——你问别人『你喜欢我的界面吗?』然后受访者很友好地说『当然,我喜欢你的界面。』这是『愉悦实验者』倾向。当访问者与受访者有社会、文化或权力差异时,这个问题会更加严重:比如在印度我的同事就反馈了这个影响,而且如果实验者是白人,影响会更加严重。

Now, you should not take this to mean that you shouldn’t have your developers try out stuff with users — Being the person who is both the developer and the person who is trying stuff out is incredible valuable. And one example I like a lot of this is Mike Krieger, one of the Instagram founders — [he] is also a former master student and TA of mine. And Mike, when he left Stanford and joined Silicon Valley, every Friday afternoon he would bring people into the lab into his office and have them try out whatever they were working on that week.

你不该用这个例子来说明开发者不应让用户尝试使用——如果用户既是开发者又是尝试者,就会非常有价值。有一个例子是Mike Krieger,instagram创始人之一——他曾是我的硕士学生及助教。当Mike离开斯坦福进入硅谷时,每周五下午他会把人请入实验室办公室,让他们测试本周的工作成果。

And so that way they were able to get this regular feedback each week and the people who were building those systems got to see real people trying them out. This can be nails-on-a-chalkboard painful, but you’ll also learn a ton. So how do we get beyond “Do you like my interface?” The basic strategy that we’re going to talk about today is being able to use specific measures and concrete questions to be able to deliver meaningful results. One of the problems of “Do you like my interface?” is “Compared to what?”

用这种方法他们每周能获得固定反馈,而开发人员也能看到真实用户使用产品。这个过程可能是极其痛苦的,但你也能学到很多。所以我们如何才能优化『你喜欢我的界面吗』这个问题呢?我们今天要讲的基础方法就是用明确测定和具象化问题的方法,以获得有效结果。『你喜欢我的界面吗』中,有一个问题是『和谁比』。

And I think one of the reasons people say “Yeah sure” is that there’s no comparison point and so one thing that’s really important is when you’re measuring the effectiveness of your interface, even informally, it’s really nice to have some kind of comparison. It’s also important think about, well, what’s the yardstick? What constitutes “good” in this arena? What are the measures that you’re going to use? So how can we get beyond “Do you like my interface?” One of the ways that we can start out is by asking a base rate question, like “What fraction of people click on the first link in a search results page?” Or “What fraction of students come to class?” Once we start to measure correlations things get even more interesting, like, “Is there a relationship between the time of day a class is offered and how many students attend it?”

我认为人们会做肯定回答的一个原因是他们没有比较物,因此要重视一件事,当你测定页面使用效率时,提供比较物非常有用。另一个重点是思考衡量标准是什么,在这个领域构成『优秀』的条件是什么,你打算测量什么数据。所以我们应该怎么改进『你喜欢我的界面吗』呢?其一,我们可以问一个基本比例问题,像『有多少比例的人在搜索结果界面会点击第一条链接?』『有多少比例的学生会去上课?』一旦开始测量相互关系,事情就变得更有趣了,像『上课时间和学生出勤率有关联吗?』

Or “Is there a relationship between the order of a search result and the clickthrough rate?” For both students and clickthrough, there can be multiple explanations. For example, if there are fewer students that attend early morning classes, is that a function of when students want to show up, or is that afunctionof when good professors want to teach? With the clickthrough example, there are also two kinds of explanations. If lower placed links yield fewer clicks, Is that because the links are of intrinsically poorer quality, or is it because people just click on the first link — [that] they don’t bother getting to the second one even if it might be better? To isolate the effect of placement and identifying it as playing a casual role, you’d need to isolate that as a variable by say, randomizing the order or search results.

或者『搜索结果顺序与点阅率有关联吗?』学生和点阅率的例子都可以有很多解释。比如,如果较少学生会出席清早的课堂,是否存在学生想准时到场、或有好教授执教这种应变量呢?点阅率这个例子也有两种解释。如果位置较低的链接点阅率少,是否有可能是链接本身质量低下?还是因为人们只点击第一个链接——他们不在乎点不点第二个链接,即使它更好呢?为了分离位置的影响并把它定义原因,你需要把它分离为变量,使顺序和搜索结果随机化。

As we start to talk about these experiments, let’s introduce a few terms that are going to help us. The multiple different conditions that we try, that’s the thing we are manipulating — for example, the time of a class, or the location of a particular link on a search results page. These manipulations are independent variables because they are independent of what the user does. They are in the control of the experimenter. Then we are going to measure what the user does and those measures are called dependent variables because they depend on what the user does.

为讨论这些实验,让我引入一些术语来帮助陈述。我们尝试的许多不同条件是我们的实验操纵——比如上课时间、搜索页面中特定链接的位置。这些实验操纵是自变量,因为他们和用户行为没关系。它们由实验者操纵。然后我们就会测量用户做什么,而这些测量就是因变量,因为它们随用户行为变化。

Common measures in HCI include things like task completion time — How long does it take somebody to complete a task (for example, find something I want to buy, create a new account, order an item)? Accuracy — How many mistakes did people make, and were those fatal errors or were those things that they were able to quickly recover from? Recall — How much does a person remember afterward, or after periods of non-use? And emotional response — How does the person feel about the tasks being completed? Were they confident, were they stressed? Would the user recommend this system to a friend? So, your independent variables are the things that you manipulate, your dependent variables are the things that you measure.

HCI的惯常测量项目包括完成时间——人们完成一个任务要多久(比如找一个我要买的东西,创建一个新账号,点一份东西);精确度——用户犯了几个错,这些是致命错误还是能快速恢复正常的问题;回忆——使用结束后、或一段时间不用后用户还记得多少;还有情感反馈——当任务完成后用户有什么感受?他们很有自信还是很沮丧?用户会把这个系统推荐给朋友吗?因此,你的自变量是你操纵的因素,你的因变量是你测量的因素。

How reliable is your experiment? If you ran this again, would you see the same results? That’s the internal validity of an experiment. So, have apreciseexperiment, you need to better remove theconfoundingfactors. Also, it’s important to study enough people so that the result is unlikely to have been by chance. You may be able to run the same study over and over and get the same results but it may not matter in some real-world sense and the external validity is thegeneralizabilityof your results. Does this apply only to eighteen-year-olds in a college classroom? Or does this apply to everybody in the world? Let’s bring this back to HCI and talk about one of the problems you’re likely to face as a designer. I think one of the things that we commonly want to be able to do is to be able to ask something like “Is my cool new approach better than the industry standard?” Because after all, that’s why you’re making the new thing.

你的实验有多可靠?如果你再做一次会得到相同结果吗?这是实验的内部效度。因此,为了有一个明确实验,你需要除去混淆因素。此外,调查足够多的用户也非常重要,这样结果比较准确。你可能重复做同一个研究得到相同结果,但该研究在现实世界却失效了,外部效度就是结果的普适性。这个研究是否只适用于大学课堂十八岁的学生还是适用于每个人?让我们把话题重返HCI并探讨作为设计师的你可能会面临的一个问题。我们通常希望能问诸如『我这个酷炫的创新是否优于行业标准?』这种问题,毕竟这是我们想创新的原因。

Now, one of the challenges with this, especially early on in the design process is that you may have something which is very much in its prototype stages and something that is the industry standard is likely to benefit from years and years of refinement. And at the same time, it may be stuck with years and years ofcruftwhich may or may not be intrinsic to its approach. So if you compare your cool new tool to some industry standard, there is two things varying here. One is the fidelity of the implementation and the other one of course is the approach. Consequently, when you get the results, you can’t know whether to attribute the results to fidelity or approach or some combination of the two. So we’re going to talk about ways ofteasingapart those differentcausal factors. Now, one thing I should sayright off the batis there are some times where it may be more or less relevant whether you have a good handle on what the causal factors are.

此时这里有一个挑战(尤其在设计过程初期)在于你的设计完全处于原型阶段,而行业标准已经过长期改进。与此同时,行业标准也长期裹挟冗余部分,变得不太纯粹。当你比较你的新创造和行业标准时,有两件事不太相同。一是实现的精确度,二是方法。因此,当你获得结果时,你无法知道到底把结果归因于精度还是方法或它们的结合。所以我们来探讨一下梳理不同起因的方法。现在我马上想说的是,不同时间对你是否很好掌握什么是起因有或多或少的关系。

So for example, if you’re trying to decide between two different digital cameras, at the end of the day, maybe all you care about is image quality or usability or some other factor and exactly what makes that image quality better or worse or any other element along the way may be less relevant to you. If you don’t have control over the variables, then identifying cause may not always be what you want. But when you are a designer, you do have control over the variables, and that’s when it is really important toascertaincause. Here’s an example of a study that came out right when the iPhone was released, done by a research firm User Centric, and I’m going to read from this news article here. Research firm User Centric has released a study that tries togaugehow effective the iPhone’s unusual onscreen keyboard is. The goal is certainly a noble one but I cannot say the survey’s approach results in data that makes much sense.

比如,如果你想从两款不同的数码相机中选择,在最后一天你可能就只关心成像质量、易用性或其他因素,而真正影响照片质量的因素就没那么有相关性了。如果不控制变量,那么得出的成因就可能不是你想要的。当你是设计师时,严谨控制变量时就是确定成因的重要时刻。这里有一个iPhone刚发布时产出的研究实例,由研究机构User Centric完成。我想在这儿读一下这篇新闻。User Centric发布了一项研究来测量iPhone这块屏显键盘的使用效率。文章的目的很崇高,但我不能说这个研究很有效。

User Centric brought in twenty owners of other phones. Half had qwerty keyboards, half had ordinary numeric phones, withkeypads. None were familiar with the iPhone. The research involved having the test subjects enter six sample test messages with the phones that they already had, and six with the iPhone. The end result was that the iPhonenewbiestook twice as long to enter text with an iPhone as they did with their own phones and made lots more typos. So let’s critique this study and talk about its benefits and drawbacks. Here’s the webpage directly from User Centric. What’s our manipulation in this study? Well the manipulation is going to be the input style. How about the measure in the study? It’s going to be the words per minute. And there’s absolutely value in being able to measure the initial usability of the iPhone.

User Centric引入了20位其他手机用户。他们一半人使用全键盘,一半人使用手机按键的数字键盘,没人熟悉iPhone。这个研究让测试者用现有的手机发送六条简短短信,再用iPhone发一遍。研究结果是这些iPhone新手们使用iPhone的耗时是现有手机的两倍,而且还产生了更多错误。让我们来评判下这个研究的优缺点。这是User Centric的网页。在这个实验中,实验操纵是什么?实验操纵应该是输入方式。实验的测量是什么?应该是每分钟产出的字数。在这里,测量iPhone的初始易用度当然很有价值。

For several reasons, one is if you’re introducing new technology, it’s beneficial if people are able to get up to speed pretty quickly. However it’s important to realize that this comparison is intrinsically unfair because the users of the previous cell phones were experts at that inputmodalityand the people who are using the iphone are novices in that modality. And so it seems quite likely that the iPhone users, once they become actual users, are going to get better over time and so if you’re not used to something the first time you try it, that may not be a deal killer, and it’s certainly not anapples-to-apples comparison. Another thing that we don’t get out of this article is “Is this difference significant?” So we read that each person who typed six messages in each of two conditions and so they did their own device and the iPhone, or vice versa.

有几个原因,第一,如果你引入新技术,人们如果能很快熟练自然是很好的,但我们要意识到这个比较并不公平。因为用户精通传统手机的输入方法,而他们在iPhone上却是新手。因此很有可能的是,iPhone用户变成日常用户后,他们的表现就会更好。而且如果你第一次尝试某个事物不太习惯,这可能并不是大问题。而且这完全不是一个同类比较。第二,文章并没指出『这个不同重要吗?』因此我们只能知道每个人用自己的设备和iPhone分别输入了六条短信而已。

Six messages each and that the iPhone users were half the speed of the… or rather the people typing with the iPhone were half as fast as when they got to type with a mini qwerty at the device that they were accustomed to. So while this may tell us something about the initial usability of the iPhone, in terms of the long-term usability, you know, I don’t think we get so much out of this here. If you weren’t satisfied by that initial data, you’rein good company: neither were the authors of that study. So they went back a month later and they ran another study where they brought in 40 new people to the lab who were either iPhone users, qwerty users, or nine key users. And now it’s more of an apples-to-apples comparison in that they are going to test people that are relatively experts in these three differentmodalities— after about a month on the iPhone you’re probably starting toasymptotein terms of your performance.

每人发六条短信,然后iPhone用户速度是一半…或者说iPhone用户的输入速度是使用熟悉的全键盘用户输入速度的一半。这个调查展示了iPhone的初始易用度,但从长远角度考察易用性,我并不觉得这篇调查帮助很大。如果你不满意这个初始数据,你有同伴了:这份研究的作者也并不满意。于是一个月后他们又做了一次研究,邀请40位使用iPhone、全键盘及九宫格的用户。这次比较更趋向同类比较了,因为这些用户都是他们输入方式的专业用户了——在使用一个月后,你开始能根据表线画渐近线了。

Definitely it gets better over time, even past a month; but, you know, a month starts to get more reasonable. And what they found was that iPhone users and qwerty users were about the same in terms of speed, and that the numeric keypad users were much slower. So once again our manipulation is going to be input style and we’re going to measure speed. This time we’re also going to measure error rate. And what we see is that iPhone users and qwerty users are essentially the same speed. However, the iPhone users make many more errors. Now, one thing I should point out about the study is that each of the different devices was used by a different group of people. And it was done this way so that each device was used by somebody who is comfortable and had experience with working with that device. And so, we removed the worry that you had newbies working on these devices.

当然,在一段时间后使用表现肯定会好,即使只是一个月;但一个月的时间开始变合理了。研究人员发现iPhone用户和全键盘用户在速度上基本持平,而九宫格用户明显慢。所以这次我们的实验操纵是输入方式,测量是速度。这次也要测量出错率。我们看到iPhone用户和全键盘用户输入速度本质相同。但是,iPhone用户出错率更高。这次我要指出的是,不同设备是由不同人群使用的。因此设备使用者都符合习惯并有使用经验。我们就移除了新手使用设备的问题。

However, especially in 2007, there may have been significant differences in who the people were who were using theearly adoptersof the 2007 iPhone or maybe business users were particularly drawn to the qwerty devices or people who had better things to do with their time than send e-mail on their telephone or using the nine key devices. And so, while this comparison is better than the previous one, the potential for variation between the user populations is still problematic. If what you’d like to be able to claim is something about the intrinsic properties of the device, it may at least in part have to do with the users. So, what are some strategies for fairer comparison? To brainstorm a couple of options one thing that you can do is insert your approach in to your production setting and this may seem like a lot of work — sometimes it is but in the age of the web this is a lot easier than it used to be.

然而,尤其在2007年,用户群发生了巨大改变。2007版iPhone的首批用户是谁?又或者商务用户被全键盘设备深深吸引,除了在手机上发邮件,用户可能有更有意思的事可做,九宫格用户也是如此。所以,虽然他们这次的对比比之前的好,但用户群体的潜在变化也会制造问题。如果你想说的是一个设备的本质特性,那它至少和用户有一部分关系。所以更公平的比较方法是什么呢?为了发掘多项选择,你可以把你的方法放入生产设定,这工程量很大——在web时代可能会比以前容易些。

And it’s possible even if you don’t have access to the server of the service that you’re comparing against. You can use things like aproxy serveror client-side scripting to be able to put your own technique in and have an apples-to-apples comparison. A second strategy forneutralizingthe environment difference between a production version and your new approach is to make a version of the production thing in the same style as your new approach. That also makes them equivalent in terms of their implementation fidelity. A third strategy and one that’s used commonly in research, is toscale things downso you’re looking at just a piece of the system at a particular point in time. That way you don’t have to worry about implementing a whole big, giant thing. You can just focus on one small piece and have that comparison be fair.

即使你无法接入对比服务的服务器,这也是有可能的。你可以把自己的技术放入代理服务器或客户端脚本,使同类比较成为可能。第二个解决生产版本和你的新技术的环境差异的策略是用你的新技术做同样的生产版本。这样就统一了实现精度。第三个在研究中也很常用的策略是缩小规模,这样你只是考察系统某一部分的某一特定点。这个方法使得你不用去看产品的宏观实施。你可以缩小范围,使比较更公平。

And the fourth strategy is that when expertise is relevant, train people up — give them the practice that they need —, so that they can start at least hitting that asymptote in terms of performance and you can get a better read than what they would be as newbies. So now to close out this lecture, if somebody asks you the question “Is interface x better than interface y?” you know that we’re off to a good start because we have a comparison. However, you also know to be worried: What does “better” mean? And often, in a complex system, you’re going to have several measures. That’s totally cool. There’s a lot of value in beingexplicitthough about what it is you mean by better — What are you trying to accomplish? What are you trying to [im]prove? And if anybody ever tells you that their interface is always better, don’t believe them because nearly all of the time the answer is going to be “it depends.”

第四个策略是,当熟练度相关时,训练被实验者——让他们获得需要的锻炼——这样他们起码能达到表现曲线,他们能比作为新手时得到更好的解读。现在结束这次讲课,如果有人问你『界面X是否比界面Y更好?』时,你要知道由于存在比较,我们拥有了一个很好的开端。但是你也要知道有这样一个问题:什么叫『更好』?通常在复杂的系统中,你有很多东西要测量。这太酷了。明确更好的含义有巨大价值——你想达成什么?你想改进什么?如果有人跟你说他们的界面永远更好,别信他们,因为基本上所有时候这个回答都应该是『看情况』。

And the interesting question is “What does it depend on?” Most interfaces are good for some things and not for others. For example if you have a c computer where all of the screen is devoted to display, that is going to be great for reading, for web browsing, for that kind of activity, looking at pictures. Not so good if you want to type a novel. So here, we’ve introduced controlled comparison as a way of finding thesmoking gun, as a way of inferring cause. And often for, when you have only two conditions, we’re going to talk about that as being a minimal pairs design. As apracticing designer, the reason to care about what’s causal is that it gives you the material to make a better decision going forward. A lot of studies violate this constraint. And, that gets dangerous because it doesn’t, it prevents you from being able tomake sound decisions. I hope that the tools that we’ve talked about today and in the next several lectures will help you become a wiseskepticlike our friend in this XKCD comic. I’ll see you next time.

一个有趣的问题是『看什么情况?』许多界面只适配某一些情况。比如你的平板电脑完全用来显示,对于阅读、网页浏览或看图这类活动非常适用,但就不适用于写小说。所以在这里我们引入控制变量比较法找到确切的证据推断原因。而且通常,当你只有两种条件时,我们像最小组设计一样谈论它们。作为一位在工作的设计师,关注什么是原因是因为这能给你做更好决定的材料。许多研究违反了这一限制。这很危险。它会让你无法做出可靠的决定。我希望今天课程所授能帮助你成为一个明智的怀疑论者,像XKCD漫画上的这位朋友。下次见。

你可能感兴趣的:(week3-Designing Studies You Can Learn From)