在看HADOOP的时候认识了Doug Cutting,Hadoop 团队早在 06 年就成立了,当然,这个团队成立的前提是 Doug Cutting 入职雅虎。于是上网一查才知道他就是著名的全文搜索引擎 Lucene 和 Nutch 的创始人。
下面是Doug Cutting Interview的原文,译文在网上一搜就有了
Doug Cutting is primary developer of the Lucene and Nutch open source search projects. He has worked in the search technology field for nearly two decades, including five years at Xerox PARC, three years at Apple, and four years at Excite.
What do you do for a living, and how are you involved in search engine development?
I work from home on Lucene & Nutch, two open source search projects. I get paid by various contracts related to these projects. I have an ongoing relationship with Yahoo! Labs that funds me to work part-time on Nutch. I take other short-term contract work related to both projects.
Could you briefly tell us about Nutch, and where you are trying to take it?
First, let me say what Lucene is, to provide context. Lucene is a software library for full full-text search. It's not an application, but rather technology that can be incorporated in applications. It's an Apache project and is widely used. A small subset of folks using Lucene are listed at wiki.apache.org/ jakarta-lucene/PoweredBy .
Nutch builds on Lucene to implement web search. Nutch is an application: you can download it and run it. It adds a crawler and other web-specific stuff to Lucene. Nutch aims to scale from simple intranet searching to search of the entire web, like Google and Yahoo!. To rival these guys you need a lot of tricks. We've demoed it on over 100M pages, and it's designed to scale to over 1B pages. But it also works well on a single machine, searching just a few servers.
From your perspective, what are the core principles of search engine architecture? What are the main things to consider and the big modules search engine software can be broken up into?
Let's see, the major bits are:
fetching – downloading lists of pages that have been referenced.
database – keeping track of what pages you've fetched, when you fetched them, what they've linked to, etc.
link analysis – analyzing the database to assign a priori scores to pages (e.g., PageRank & WebRank) and to prioritize fetching. The value of this is somewhat overrated. Indexing anchor text is probably more important (that's what makes, e.g., Google Bombing so effective).
indexing – combines content from the fetcher, incoming links from the database, and link analysis scores into a datastructure that's quickly searchable.
searching – ranks pages against a query using an index.
To scale to billions of pages, all of these must be distributable, i.e., each must be able to run in parallel on multiple machines.
You are saying people can download Nutch to run on their machines. Is there a possibility for small time webmasters who don't have full control over their Apache servers to make use of Nutch?
Unfortunately, most of them probably won't. Nutch requires a Java servlet container, which some ISPs support, but most do not.
Can I combine Lucene and the Google Web API, or Lucene and some other application I wrote?
A couple of folks have contributed Google-like APIs for Nutch, but none has yet made it into the system. We should have something like this soon, however.
What do you think is the biggest hurdle to overcome when implementing a search engine – is it the hardware and storage barrier, or the ranking algorithms? Also, how much space do you need to assure the search engine makes some sense, say, by writing an engine restricted to search a million RSS feeds?
Nutch requires around a total of 10kb per web page. RSS feeds tend to point to small pages, so you'd probably do better than that. Nutch doesn't yet have specific support for RSS.
Is it easy to get funded by Yahoo! Labs? Who can apply, and what do you need to give back in return?
I was invited, I didn't apply. I have no idea what the process is.
Did Google Inc. ever show interest in Nutch?
I've talked with folks there, including Larry Page. The'd like to help, but they can't see a way to do that without also helping their competitors.
In Nutch, do you implement your own PageRank or WebRank system? What considerations go into ranking?
Yes, Nutch has a link analysis module. Use of it is optional. For intranet search we find that it's really not needed.
I guess you heard it before, but doesn't an open-source search engine open itself up for blackhat Search Engine Optimization?
Potentially.
Let's say it takes spammers six weeks to reverse engineer a closed-source search engines latest spam detecting algorithm. With an open source engine, this can be done much faster. But in either case, the spammers will eventually figure out how it works; the only difference is how quickly. So the best anti-spam techniques, open or closed source, are those that continue to work even when their mechanism is known.
Also, if you, e.g., remove detected spammers from the index for six months, then there's not much they can do, once detected, to change their sites to elude detection. And if your spam detectors are based on statistical analyses of good and bad example sites, then you can, overnight, notice new patterns and remove the spammers before they have a chance to respond.
So open source can make it a little harder to stop spam, but it doesn't make it impossible. And closed-source search engines have not been able to use secrecy to solve this problem. I think the closed-source advantage here is not as great as folks imagine it to be.
How does Nutch relate to distributed Web crawler Grub , and what do you think of it?
As far as I can tell, Grub is a project that lets folks donate their hardware and bandwidth to LookSmart's crawling effort. Only the client is open source, not the server, so folks can neither deploy their own version of Grub, nor can they access the data that Grub gathers.
What about distributed crawling more generally? When a search engine gets big, crawl-related expenses are dwarfed by search-related expenses. So a distributed crawler doesn't significantly improve costs, rather it makes more complicated something that is already relatively inexpensive. That's not a good tradeoff.
Widely distributed search is interesting, but I'm not sure it can yet be done and keep things as fast as they need to be. A faster search engine is a better search engine. When folks can quickly revise queries then they more frequently find what they're looking for before they get impatient. But building a widely distributed search system that can search billions of pages in a fraction of a second is difficult, since network latencies are high. Most of the half-second or so that Google takes to perform a search is network latency within a single datacenter. If you were to spread that same system over a bunch of PCs in people's houses, even connected by DSL and cable modems, the latencies are much higher and searches would probably take several seconds or longer. And hence it wouldn't be as good of a search engine.
You are emphasizing the importance of speed in a search engine. I'm often puzzled by how fast Google returns a result. Do you have an idea how they do it, and what are your experience with Nutch?
I believe Google does roughly what Nutch does: they broadcast queries to a number of nodes, each which returns the top-results over a set of pages. With a couple of million pages per node then disk accesses can be avoided for most queries and each node can process tens to hundreds of queries per second. If you want to search billions of pages then you have to broadcast each query to thousands of nodes. That's a lot of network traffic.
Some of this is described in www.computer.org/ micro/mi2003/ m2022.pdf
When you mention spam, do you have any spam-fighting algorithms in Nutch? How can one differentiate between spam patterns like linkfarms, and sites which just happen to be very popular?
We haven't yet had time to start working on this, but it's obviously an important area. Before we get to link farms we need to do the simple stuff: look for word stuffing, white-on-white text, etc.
I think the key to search quality in general (of which spam detection is a sub-problem) is to have a trusted supply of hand-evaluated search results. With this, one can train a ranking algorithm to generate better results. (Spammy results are just one kind of bad results.) Commercial search engines get trusted evaluations by hiring folks. It remains to be seen how Nutch will do this. We obviously cannot just accept all evaluations donated, or else spammers will spam the evaluations. So we need a means of establishing the trustability of volunteer evaluators. I think a peer-review system, perhaps something like Slashdots's karma system, could work here.
Where do you see search engines heading in the near and far future, and what do you think are the biggest hurdles to overcome from a developer's perspective?
Sorry, I'm not very imaginative here. My prediction is that the web search in the coming decade is going to look more-or-less like web search of today. It's a safe bet. Web search evolved quickly for the first few years. It started in 1994 with WebCrawler, using standard information retrieval methods. The development of more web-specific methods took a few years, culminating in Google's 1998 launch. Since then, the introduction of new methods has slowed dramatically. The low-hanging fruit has been harvested. Innovation is easy when an area is young, and becomes more difficult as it field matures. Web search grew up in the 1990s, is now a cash cow, and will soon be a commodity.
As far as development challenges, I think operational reliability is a big one. We're working on developing something like GFS, the Google Filesystem. Stuff like this is essential to large-scale web search: you cannot let a failure of any single component cause a major hiccough; you must be able to easily scale by throwing more hardware into the pool, without massive reconfiguration; and you can't require an army of operators – things should largely fix themselves.