原文地址:http://www.readwriteweb.com/archives/web_30_when_web_sites_become_web_services.php
翻译地址:http://www.yeeyan.com/articles/view/zhouchengly/565
Written by Alex Iskold / March 19, 2007 12:11 PM / 45 Comments
Today's Web has terabytes of information available to humans, but hidden from computers. It is a paradox that information is stuck inside HTML pages, formatted in esoteric ways that are difficult for machines to process. The so called Web 3.0, which is likely to be a pre-cursor of the real semantic web, is going to change this. What we mean by 'Web 3.0' is that major web sites are going to be transformed into web services - and will effectively expose their information to the world.
今天的web蕴藏着大量的对人类有用的信息,但却隐藏在计算机之后。矛盾在于信息是以一种神秘的机器难以处理的格式隐藏在HTML页面中. 所谓的”web3.0”就是指Web网站正向Web服务发展, 它们将有效地把信息暴露给世界.
The transformation will happen in one of two ways. Some web sites will follow the example of Amazon, del.icio.us and Flickr and will offer their information via a REST API. Others will try to keep their information proprietary, but it will be opened via mashups created using services like Dapper, Teqlo and Yahoo! Pipes. The net effect will be that unstructured information will give way to structured information - paving the road to more intelligent computing. In this post we will look at how this important transformation is taking place already and how it is likely to evolve.
这种转变将以两种方式进行,一些站点学习amazon, del.icio.us 和Flickr, 通过REST API的方式提供他们的信息. 另一些网站将试图保持信息的私有性, 但这些信息会通过汇聚(mashups)的方式公开, 汇聚可以通过Dapper, Teqlo和Yahoo!Pipes等生成. 非结构化信息将给结构化信息让路, 使得智能计算的道路更加平坦. 本文将讲述这些重要的转变是怎样发生的以及它们怎样改变互联网.
We have written here before about Amazon's visionary WebOS strategy. The Seattle web giant is reinventing itself by exposing its own infrastructure via a set of elegant APIs. One of the first web services opened up by Amazon was the E-Commerce service. This service opens access to the majority of items in Amazon's product catalog. The API is quite rich, allowing manipulation of users, wish lists and shopping carts. However its essence is the ability to lookup Amazon's products.
我们已经在这里写了亚马逊的WebOS战略,这个西雅图的Web巨人通过一套强大的API把自身的网站构架暴露于众。首先开放的服务是电子商务服务,该服务开放了亚马逊产品目录主要产品的访问,该API相当强大,允许处理用户,收藏列表和购物车,当然最关键的是可以查看亚马逊的产品。
Why has Amazon offered this service completely free? Because most applications built on top of this service drive traffic back to Amazon (each item returned by the service contains the Amazon URL). In other words, with the E-Commerce service Amazon enabled others to build ways to access Amazon's inventory. As a result many companies have come up with creative ways of leveraging Amazon's information - you can read about these successes in one of our previous posts.
亚马逊为什么以完全免费的方式提供这项服务呢?因为几乎所有购建在这项服务之上的应用都为亚马逊带来了流量(该服务返回给应用程序的产品中含有亚马逊的链接)。换句话说,通过提供电子商务服务,亚马逊使别人能够访问亚马逊的库存。结果是很多公司想出了利用亚马逊信息的好方法。你可以在这里看到这些成功的案例。
The web 2.0 poster child, del.icio.us, is also famous as one of the first companies to open a subset of its web site functionality via an API. Many services followed, giving rise to a true API culture. John Musser over at programmableweb has been tirelessly cataloging APIs and Mashups that use them. This page shows almost 400 APIs organized by category, which is an impressive number. However, only a fraction of those APIs are opening up information - most focus on manipulating the service itself. This is an important distinction to understand in the context of this article.
Del.icio.us也是最早开放其API 的Web2.0网站,很多Web服务紧随其后,使得API文化逐渐升温,John Musser在ProgrammableWeb上不辞辛劳地将这些API和汇聚分类整理,这个页面分门别类地给出了将近400个API。这些对理解本文很有帮助。
The del.icio.us API offering today is different from Amazon's one, because it does not open the del.icio.us database to the world. What it does do is allow authorized mashups to manipulate the user information stored in del.icio.us. For example, an application may add a post, or update a tag, programmatically. However, there is no way to ask del.icio.us, via API, what URLs have been posted to it or what has been tagged with the tag web 2.0 across the entire del.icio.us database. These questions are easy to answer via the web site, but not via current API.
当前的Del.icio.us API 与亚马逊的不同,因为它没有开放del.icio.us的数据库,它做的只是允许授权的汇聚应用去加工存储在del.icio.us的信息。比如,应用程序可以通过编程添加一个帖子,或更新一个tag,但你不能通过API去知道你这个帖子发到了哪个链接,或者整个数据库中哪些内容被打上了这个tag,这些问题通过网站很容易回答,但却不能通过当前的API来得到答案。
Despite the fact that there is no direct API (into the database), many companies have managed to leverage the information stored in del.icio.us. Here are some examples...
尽管没有直接的API(深入数据库), 很多公司还是可以利用存储在del.icio.us中的信息。这里有一些例子。
Delexa is an interesting and useful mashup that uses del.icio.us to categorize Alexa sites. For example, here are the popular sites tagged with the word book:
Delexa是一个有趣而且有用的汇聚站点, 它可以用del.icio.us去对Alexa站点进行归类,例如,这里是以book标签的著名站点:
Another web site called similicio.us uses del.icio.us to recommend similar sites. For example, here are the sites that it thinks are related to Read/WriteWeb.
另一个站点叫similicio.us, 使用del.icio.us去推荐类似站点,例如这里是它认为和Read/WriteWeb相似的站点。
So how do these services get around the fact that there is no API? The answer is that they leverage standardized URLs and a technique called Web scraping. Let's understand how this works. In del.icio.us, for example, all URLs that have the tag book can be found under the URL http://del.icio.us/tag/book; all URLs tagged with the tag movie are at http://del.icio.us/tag/movie; and so on. The structure of this URL is always the same: http://del.icio.us/tag[TAG]. So given any tag, a computer program can fetch the page that contains the list of sites tagged with it. Once the page is fetched, the program can now perform the scraping - the extraction of the necessary information from the page.
怎样在没有API的情况下实现这些服务呢?答案就是它们可以利用标准URL和一种叫做Web抓取(Web scraping)的技术. 让我们来看看它是怎么工作的. 比如, 所有被标签为book的链接可以在链接http://del.icio.us/tag/book 下找到, 而标签为movie的则在http://del.icio.us/tag/movie 下, 等等, 链接的结构总是一样的: http://del.icio.us/tag[TAG]. 因此, 给定任意标签, 程序都可以获得被该标签标记的站点列表, 然后通过页面抓取技术抓取需要的信息.
Web Scraping is essentially reverse engineering of HTML pages. It can also be thought of as parsing out chunks of information from a page. Web pages are coded in HTML, which uses a tree-like structure to represent the information. The actual data is mingled with layout and rendering information and is not readily available to a computer. Scrapers are the programs that "know" how to get the data back from a given HTML page. They work by learning the details of the particular markup and figuring out where the actual data is. For example, in the illustration below the scraper extracts URLs from the del.icio.us page. By applying such a scraper, it is possible to discover what URLs are tagged with any given tag.
页面抓取本质上是HTML页面的反向工程,也可以看成页面解释器,网页以HTML编码,HTML以树型结构表示信息,实际数据与布局代码以及效果信息混杂在一起,不能被计算机直接利用。抓取器程序“知道”怎样从给定HTML页面中抓取数据。它们通过分析网页特定的标注方式找到实际数据,例如,下图给出了抓取器怎么抓取del.icio.us的页面的示意图。我们可以找到被任意标签标记的链接。
We recently covered Yahoo! Pipes, a new app from Yahoo! focused on remixing RSS feeds. Another similar technology, Teqlo, has recently launched. It focuses on letting people create mashups and widgets from web services and rss. Before both of these, Dapper launched a generic scraping service for any web site. Dapper is an interesting technology that facilitates the scraping of the web pages, using a visual interface.
我们最近谈到Yahoo!Pipes, 一个专注于RSS融合的应用,另一个相似的技术是,刚刚上线的Teqlo, 它可以让人们从Web服务和RSS中创建汇聚。在它们之前,Dapper提供了一个通用的抓取器,可以抓取任意网站。Dapper是一项有趣的技术,它通过可视界面为抓取网页提供便捷。
It works by letting the developer define a few sample pages and then helping her denote similar information using a marker. This looks simple, but behind the scenes Dapper uses a non-trivial tree-matching algorithm to accomplish this task. Once the user defines similar pieces of information on the page, Dapper allows the user to make it into a field. By repeating the process with other information on the page, the developer is able to effectively define a query that turns an unstructured page into a set of structured records.
它是这样工作的,让开发者定义一些示例页面,然后帮助她用标记表示相似信息。这看起来很简单,但现象的背后是Dapper使用一种不平常的树匹配的算法去完成该任务。一旦用户在页面上定义了相似信息,Dapper允许用户将其转为一个字段,对页面其他信息重复该过程,开发这就可以有效的定义一个查询语句将一个非结构化的页面转为一些结构化的记录。
Here is an illustration of the net effect of apps like Dapper and Teqlo:
这里图示了一些网络应用像Dapper和Teqlo的作用:
So bringing together Open APIs (like the Amazon E-Commerce service) and scraping/mashup technologies, gives us a way to treat any web site as a web service that exposes its information. The information, or to be more exact the data, becomes open. In turn, this enables software to take advantage of this information collectively. With that, the Web truly becomes a database that can be queried and remixed.
结合开放API(如亚马逊的电子商务服务)和抓取/汇聚技术,我们可以将任何网站看作一个开放了信息的web服务。信息,更准确地说数据,变得开放。接踵而来的是,软件可以借此获取大量数据。有了它,互联网才真正称得上是一个可以查询和重新组合的数据库。
Scraping technologies are actually fairly questionable. In a way, they can be perceived as stealing the information owned by a web site. The whole issue is complicated because it is unclear where copy/paste ends and scraping begins. It is okay for people to copy and save the information from web pages, but it might not be legal to have software do this automatically. But scraping of the page and then offering a service that leverages the information without crediting the original source, is unlikely to be legal.
抓取技术事实上备受争议。某种意义上来说,它们可以被认为是偷取了属于其它网站的信息。整个问题很复杂,因为你不清楚拷贝/粘贴什么时候结束,抓取什么时候开始。人们拷贝保存网页信息是没有问题的,但用软件来做这个可能不是合法的。但抓取网页然后提供利用网页信息的服务,并且没有说明出处,可能就是非法的了。
But it does not seem that scraping is going to stop. Just like legal issues with Napster did not stop people from writing peer-to-peer sharing software, or the more recent YouTube lawsuit is not likely to stop people from posting copyrighted videos. Information that seems to be free is perceived as being free.
但抓取似乎并不会停止,就像Napster的法律纠纷没有让人们停止写点对点共享软件,更近的YouTube法律纠纷不会使人们停止上传版权视频,看似免费的信息也会被人们理所当然的当成免费的。
The opportunities that will come after the web has been turned into a database are just too exciting to pass up. So if conversion is going to take place anyway, would it not be better to rethink how to do this in a consistent way?
互联网如果变成数据库,巨大机遇将让人们兴奋不已,如果这一转变真地会实现呢,我们是否应该反思一下怎样去顺应这一趋势呢?
There are several good reasons why Web Sites (online retailers in particular), should think about offering an API. The most important reason is control. Having an API will make scrapers unnecessary, but it will also allow tracking of who is using the data - as well as how and why. Like Amazon, sites can do this in a way that fosters affiliates and drives the traffic back to their sites.
这里有几个好的理由(特别对在线零售商),最重要的理由是控制,有了API,抓取器就变得没必要了,它还可以跟踪谁在使用数据以及怎样和为什么使用,像亚马逊,网站通过这样做去培育会员和增加流量。
The old perception is that closed data is a competitive advantage. The new reality is that open data is a competitive advantage. The likely solution then is to stop worrying about protecting information and instead start charging for it, by offering an API. Having a small fee per API call (think Amazon Web Services) is likely to be acceptable, since the cost for any given subscriber of the service is not going to be high. But there is a big opportunity to make money on volume. This is what Amazon is betting on with their Web Services strategy and it is probably a good bet.
陈旧的观点认为封闭的数据是竞争优势,新的观点则认为开放数据才是竞争优势。可行的解决方案是不必提心吊胆的去保护信息,而是提供API,然后收取一定费用。对每次API调用收取少量费用(想想亚马逊)是可以接受的,因为该费用对每个使用者都不会很高。但却有机会在总体上盈利。这就是亚马逊在web服务战略上下的赌,这也许是个不错的赌。
As more and more of the Web is becoming remixable, the entire system is turning into both a platform and the database. Yet, such transformations are never smooth. For one, scalability is a big issue. And of course legal aspects are never simple.
随着越来越多的网站可别其他网站混合利用,整个系统变成平台兼数据库。当然,这种转变并不顺利,可扩展性是一个大问题,法律问题也不那么简单。
But it is not a question of if web sites become web services, but when and how. APIs are a more controlled, cleaner and altogether preferred way of becoming a web service. However, when APIs are not avaliable or sufficient, scraping is bound to continue and expand. As always, time will be best judge; but in the meanwhile we turn to you for feedback and stories about how your businesses are preparing for 'web 3.0'.
但问题不在于网站是否会变成Web服务,而在于什么时候和怎样转变.API是一个更容易控制,干净的被大家接受的方式。尽管如此,当没有API或者API不够强大时,抓取是一种解决问题的方式。像往常一样,时间是最好的裁判;同时,我们希望从你的反馈和故事中知道你是怎么面对web3.0的。