原文地址:http://www.nytimes.com/2009/08/06/technology/06stats.html
【中文翻译】
Carrie Grimes在哈佛主修人类学和考古学,曾在宏都拉斯的雨林中经历过一场冒险,她透过标记古文物的出土位置来研究马雅人的居住地。「在大众眼中的考古学大多来自电影中印第安纳琼斯冒险犯难的场景,可是实际上的考古学大多都在做资料分析。」她这么描述着,而她也沉浸在她所谓的全是计算机与数学的领域中。如今Grimes女士从事另外一种挖掘的工作。她现在是Google的统计分析师,成天面对成堆的资料,并运用各种统计分析方法去找到能改善公司的搜索引擎的方法。
未来十年最迷人的工作将会是统计学家
一般人对网络世代的印象大多是成天与计算机为伍、足不出户的阿宅,而Grimes女士正是属于网络世代的统计学家,但是她和其他同属网络世代的优秀同伴们一样,致力扭转大众对网络世代的刻板印象,而大家也发现他们的能力不仅被这个社会密切的需要,而且也越来越抢手。
「我一直说在未来的十年里,最迷人的工作将会是统计学家,这并不是开玩笑的。」Google的首席经济学家Hal Varian这么强调着。
拜近年来的信息爆炸所赐,它造就了这社会对统计学家需求的成长。对一位刚毕业的统计博士来说,在美国顶尖的公司第一年的年薪就可以有125,000美金(约四百万台币)在未来的信息和网络科技的领域里,还有许多有潜力的资料等着人们去研究与分析,举凡感应讯号、监视录像带、社会网络脉络、公众数据记录等等,都在其范畴内。根据IDC市场研究公司的预测,这股数字数据的浪潮在未来只会加速的汹涌,在2012年更会达到现今五倍的水平。
分析数据是统计的核心价值
然而成堆的数据并不等于有用的知识。任职于麻省理工学院电子商务中心的经济学家兼董事Erik Brynjolfsson指出:「在现今的世界,数据的取得是可以很迅速且容易的,几乎所有的事物都可以被监测与量化成所需的数据。所以目前对人类来说,最大的问题是该如何去分析这些数据,并从中整理出我们所关心的信息。」
因应这股浪潮,新兴的统计学家们也随之兴起。他们利用高性能的计算机和精密的数学模型去处理成堆的数据,试图从中寻找到有意义的样本和珍贵的信息。关于这方面的应用可以说是五花八门,从搜索引擎优化、有效的在线广告模式、基因序列中癌细胞的筛检到粮食配送的优化等等,都可以是统计应用的范围。
全美最大的DVD租借网站Netfilx公司在前阵子举行了一场竞赛,只要能够提出有效改善他们公司的电影推荐系统的方法,就能抱走一百万美金的奖赏,这场竞赛可以说是用现代统计方法当作武器来厮杀的战场。
更多学者专家纷纷投入统计的怀抱
在这股数字数据的浪头上,正统的统计学家仅仅只是这些先行者的一小部分而已。专家说计算与数值分析的能力远比学位重要多了,有更多来自不同背景的学者专家们,有经济学家、计算机科学家、数学家等等,都纷纷去拥抱最新的统计资料分析技巧。
这些学者专家们在现在的社会中,不管到哪个领域都是受欢迎的,就连白宫也不例外。美国行政管理和预算局的一位经理Peter R. Orszag在今年五月的一场演讲中指出:「稳健与不偏的信息将会是我们在制定长期的经济需求政策中最优先考虑的关键点。」而日后他也在自己的部落格写下「统计是多么的贴近我的心」的字句,也再次阐述了关于统计的重要性。
I.B.M.也看见了隐藏在收集资料的服务里的庞大商机,在今年四月成立了商业分析与优化服务组织。在组织的研究室中拥有超过两百位的数学家、统计学家及其他资料分析的专家,然而这个数目远远低于I.B.M.的需求,他们计划为整个公司重新培训或雇用超过4,000位的分析专家。
统计领域的兴起早有迹象可循,根据美国统计学会的资料来看,这个星期在华盛顿举行的统计专业人才年度会议就有约6,400人出席,比起往年约5,400人左右的出席人数多出了整整一千人。而这些出席者不管男女老幼,看起来就跟首都里其他的观光客没什么两样,但是从他们全神贯注的讨论对随机化、参数、回归及数据丛集的神情来看,这股数字数据的浪潮已经让那些专业人士愈来愈重视在传统上属于能见度较低且无法获利的工作上,保险公司越来越重视寿险就是一个很好的例子。
网络的出现让统计有更大的挥洒空间
让我们把焦点回到一开始的Grimes女士身上,现年32岁的她,在2003年从史丹佛大学拿到统计学博士的学位,在来年就加入了Google。现在隶属于一个250人的统计分析团队里,并运用各种的统计模型去改善公司的搜索引擎。
举例来说,Grimes女士曾参与了搜索引擎机器人的算法最佳化工作,那个机器人会在因特网中漫游,并定时更新搜索引擎的索引。Grimes的工作就是找到一个适当的模型,让机器人拜访经常更新的页面次数会比那些静态内容的页面的次数来得多。
Grimes解释这个工作的最终目的就是希望得到在运算或者网络上的效能改善,即使只有2%的改善就整体而言是很巨大的,毕竟Google面对的是成千上万笔数据,累加下来的绩效是很可观的。
在网络上那些巨大的数据集的也让数据探勘进入了新世界。传统的社会科学家要研究人类行为的时候,得透过实际面谈来调查受访者。一位康乃尔大学的计算机科学家和社会网络研究者Jon Kleinberg就说:「因特网提供了惊人的资源,让我得以观察上百万的人们是如何互动的。」
在Kleinberg刚发表的研究报告中指出,他和另外两位同事依照网络上的流程,使用搜寻与特定字汇相关的新闻标题的算法,在2008年的美国总统大选期间,追踪了一百六十万笔的新闻网站和部落格的页面内容。他们发现平均来说传统传媒的信息会领先那些部落客2.5个小时,但是因为部落格可以被到处引用的特性,进而让信息更快速的传播并得到世人的注意。
正确的统计分析会让事半功倍
虽然网络上蕴藏着丰富的数据,但是专家也警告那些数据是有风险存在的,那些资料量可能会轻易得让既有的统计模型不堪使用。统计学家也提出警告,一些看起来有强烈关连性的数据,实际上来说并不是真的有因果关系。
举例来说,在40年代小儿麻痹疫苗出现前,一位美国乔治华盛顿大学的统计学家兼公共卫生专家David Alan Grier宣称:「小儿麻痹患者的增加是因为饮料及冰淇淋的消费量的上升。」减少饮料及冰淇淋的摄取甚至成为抗小儿麻痹疗程的一部份。而他仅仅是依据小儿麻痹患者的激增多出现在炎热的夏季,而这个季节的人们会食用更多的冰淇淋,就得出这项结论。
信息爆炸不仅会扩展一些统计上陈年的议题,也会开启更多的新领域。
一位在I.B.M.从事医药资料探勘的研究人员Daniel Gruhl就说:「现在最关键的事就是让计算机去做他擅长的事,也就是处理并收集成堆的数据。对人类来说只需要专注在如何去解释那些异常现象就好。」
【英文原文】
MOUNTAIN VIEW, Calif. — At Harvard, Carrie Grimes majored in anthropology and archaeology and ventured to places like Honduras, where she studied Mayan settlement patterns by mapping where artifacts were found. But she was drawn to what she calls “all the computer and math stuff” that was part of the job.
“People think of field archaeology as Indiana Jones, but much of what you really do is data analysis,” she said.
Now Ms. Grimes does a different kind of digging. She works at Google, where she uses statistical analysis of mounds of data to come up with ways to improve its search engine.
Ms. Grimes is an Internet-age statistician, one of many who are changing the image of the profession as a place for dronish number nerds. They are finding themselves increasingly in demand — and even cool.
“I keep saying that the sexy job in the next 10 years will be statisticians,” said Hal Varian, chief economist at Google. “And I’m not kidding.”
The rising stature of statisticians, who can earn $125,000 at top companies in their first year after getting a doctorate, is a byproduct of the recent explosion of digital data. In field after field, computing and the Web are creating new realms of data to explore — sensor signals, surveillance tapes, social network chatter, public records and more. And the digital data surge only promises to accelerate, rising fivefold by 2012, according to a projection by IDC, a research firm.
Yet data is merely the raw material of knowledge. “We’re rapidly entering a world where everything can be monitored and measured,” said Erik Brynjolfsson, an economist and director of the Massachusetts Institute of Technology’s Center for Digital Business. “But the big problem is going to be the ability of humans to use, analyze and make sense of the data.”
The new breed of statisticians tackle that problem. They use powerful computers and sophisticated mathematical models to hunt for meaningful patterns and insights in vast troves of data. The applications are as diverse as improving Internet search and online advertising, culling gene sequencing information for cancer research and analyzing sensor and location data to optimize the handling of food shipments.
Even the recently ended Netflix contest, which offered $1 million to anyone who could significantly improve the company’s movie recommendation system, was a battle waged with the weapons of modern statistics.
Though at the fore, statisticians are only a small part of an army of experts using modern statistical techniques for data analysis. Computing and numerical skills, experts say, matter far more than degrees. So the new data sleuths come from backgrounds like economics, computer science and mathematics.
They are certainly welcomed in the White House these days. “Robust, unbiased data are the first step toward addressing our long-term economic needs and key policy priorities,” Peter R. Orszag, director of the Office of Management and Budget, declared in a speech in May. Later that day, Mr. Orszag confessed in a blog entry that his talk on the importance of statistics was a subject “near to my (admittedly wonkish) heart.”
I.B.M., seeing an opportunity in data-hunting services, created a Business Analytics and Optimization Services group in April. The unit will tap the expertise of the more than 200 mathematicians, statisticians and other data analysts in its research labs — but that number is not enough. I.B.M. plans to retrain or hire 4,000 more analysts across the company.
In another sign of the growing interest in the field, an estimated 6,400 people are attending the statistics profession’s annual conference in Washington this week, up from around 5,400 in recent years, according to the American Statistical Association. The attendees, men and women, young and graying, looked much like any other crowd of tourists in the nation’s capital. But their rapt exchanges were filled with talk of randomization, parameters, regressions and data clusters. The data surge is elevating a profession that traditionally tackled less visible and less lucrative work, like figuring out life expectancy rates for insurance companies.
Ms. Grimes, 32, got her doctorate in statistics from Stanford in 2003 and joined Google later that year. She is now one of many statisticians in a group of 250 data analysts. She uses statistical modeling to help improve the company’s search technology.
For example, Ms. Grimes worked on an algorithm to fine-tune Google’s crawler software, which roams the Web to constantly update its search index. The model increased the chances that the crawler would scan frequently updated Web pages and make fewer trips to more static ones.
The goal, Ms. Grimes explained, is to make tiny gains in the efficiency of computer and network use. “Even an improvement of a percent or two can be huge, when you do things over the millions and billions of times we do things at Google,” she said.
It is the size of the data sets on the Web that opens new worlds of discovery. Traditionally, social sciences tracked people’s behavior by interviewing or surveying them. “But the Web provides this amazing resource for observing how millions of people interact,” said Jon Kleinberg, a computer scientist and social networking researcher at Cornell.
For example, in research just published, Mr. Kleinberg and two colleagues followed the flow of ideas across cyberspace. They tracked 1.6 million news sites and blogs during the 2008 presidential campaign, using algorithms that scanned for phrases associated with news topics like “lipstick on a pig.”
The Cornell researchers found that, generally, the traditional media leads and the blogs follow, typically by 2.5 hours. But a handful of blogs were quickest to quotes that later gained wide attention.
The rich lode of Web data, experts warn, has its perils. Its sheer volume can easily overwhelm statistical models. Statisticians also caution that strong correlations of data do not necessarily prove a cause-and-effect link.
For example, in the late 1940s, before there was a polio vaccine, public health experts in America noted that polio cases increased in step with the consumption of ice cream and soft drinks, according to David Alan Grier, a historian and statistician at George Washington University. Eliminating such treats was even recommended as part of an anti-polio diet. It turned out that polio outbreaks were most common in the hot months of summer, when people naturally ate more ice cream, showing only an association, Mr. Grier said.
If the data explosion magnifies longstanding issues in statistics, it also opens up new frontiers.
“The key is to let computers do what they are good at, which is trawling these massive data sets for something that is mathematically odd,” said Daniel Gruhl, an I.B.M. researcher whose recent work includes mining medical data to improve treatment. “And that makes it easier for humans to do what they are good at — explain those anomalies.”
Andrea Fuller contributed reporting.