关于gethostbyname在多线程环境下的阻塞问题

  Unix/Linux下的gethostbyname函数常用来向DNS查询一个域名的IP地址。 由于DNS的递归查询,常常会发生gethostbyname函数在查询一个域名时严重超时。而该函数又不能像connect和read等函数那样通过setsockopt或者select函数那样设置超时时间,因此常常成为程序的瓶颈。有人提出一种解决办法是用alarm设置定时信号,如果超时就用setjmp和longjmp跳过gethostbyname函数(这种方式我没有试过,不知道具体效果如何)。
    在多线程下面,gethostbyname会一个更严重的问题,就是如果有一个线程的gethostbyname发生阻塞,其它线程都会在gethostbyname处发生阻塞。我在编写爬虫时也遇到了这个让我疑惑很久的问题,所有的爬虫线程都阻塞在gethostbyname处,导致爬虫速度非常慢。在网上google了很长时间这个问题,也没有找到解答。今天凑巧在实验室的googlegroup里面发现了一本电子书"Mining the Web - Discovering Knowledge from Hypertext Data",其中在讲解爬虫时有下面几段文字:

    Many clients for DNS resolution are coded poorly.Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which cannot concurrently handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance. Furthermore, if the system-provided client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator crawler from Compaq System Research Center reduced the time spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library is ideal for use in crawlers.
    In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for Http data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted. Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets, and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP  instead of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will be fast when the page is actually needed later on.

    大意是说unix的gethostbyname无法处理在并发程序下使用,这是先天的缺陷是无法改变的。大型爬虫往往不会使用gethostbyname,而是实现自己独立定制的DNS客户端。这样可以实现DNS的负载平衡,而且通过异步解析能够大大提高DNS解析速度。DNS客户端往往用UDP实现,可以在爬虫爬取网页前提前解析URL的IP。文章中还提到了一个开源的异步DNS库adns,主页是http://www.chiark.greenend.org.uk/~ian/adns/
    从以上可看出,gethostbyname并不适用于多线程环境以及其它对DNS解析速度要求较高的程序。

你可能感兴趣的:(UNIX/Linux,多线程,asynchronous,network,caching,performance,unix)