配置Nutch模拟浏览器以绕过反爬虫限制

原文链接:http://yangshangchuan.iteye.com/blog/2030741

当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓取的所有页面内容均为:您的访问请求被拒绝 ...... 这是最简单的反爬虫策略(该策略简单地读取HTTP请求头User-Agent的值来判断是人(浏览器)还是机器爬虫,我们只需要简单地配置Nutch来模拟浏览器(simulate web browser)就可以绕过这种限制。

 

nutch-default.xml中有5项配置是和User-Agent相关的:

 

Xml代码  
  1. <property>  
  2.   <name>http.agent.description</name>  
  3.   <value></value>  
  4.   <description>Further description of our bot- this text is used in  
  5.   the User-Agent header.  It appears in parenthesis after the agent name.  
  6.   </description>  
  7. </property>  
  8. <property>  
  9.   <name>http.agent.url</name>  
  10.   <value></value>  
  11.   <description>A URL to advertise in the User-Agent header.  This will   
  12.    appear in parenthesis after the agent name. Custom dictates that this  
  13.    should be a URL of a page explaining the purpose and behavior of this  
  14.    crawler.  
  15.   </description>  
  16. </property>  
  17. <property>  
  18.   <name>http.agent.email</name>  
  19.   <value></value>  
  20.   <description>An email address to advertise in the HTTP 'From' request  
  21.    header and User-Agent header. A good practice is to mangle this  
  22.    address (e.g. 'info at example dot com') to avoid spamming.  
  23.   </description>  
  24. </property>  
  25. <property>  
  26.   <name>http.agent.name</name>  
  27.   <value></value>  
  28.   <description>HTTP 'User-Agent' request header. MUST NOT be empty -   
  29.   please set this to a single word uniquely related to your organization.  
  30.   NOTE: You should also check other related properties:  
  31.     http.robots.agents  
  32.     http.agent.description  
  33.     http.agent.url  
  34.     http.agent.email  
  35.     http.agent.version  
  36.   and set their values appropriately.  
  37.   </description>  
  38. </property>  
  39. <property>  
  40.   <name>http.agent.version</name>  
  41.   <value>Nutch-1.7</value>  
  42.   <description>A version string to advertise in the User-Agent   
  43.    header.</description>  
  44. </property>  
<property>

  <name>http.agent.description</name>

  <value></value>

  <description>Further description of our bot- this text is used in

  the User-Agent header.  It appears in parenthesis after the agent name.

  </description>

</property>

<property>

  <name>http.agent.url</name>

  <value></value>

  <description>A URL to advertise in the User-Agent header.  This will 

   appear in parenthesis after the agent name. Custom dictates that this

   should be a URL of a page explaining the purpose and behavior of this

   crawler.

  </description>

</property>

<property>

  <name>http.agent.email</name>

  <value></value>

  <description>An email address to advertise in the HTTP 'From' request

   header and User-Agent header. A good practice is to mangle this

   address (e.g. 'info at example dot com') to avoid spamming.

  </description>

</property>

<property>

  <name>http.agent.name</name>

  <value></value>

  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 

  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents

	http.agent.description

	http.agent.url

	http.agent.email

	http.agent.version

  and set their values appropriately.

  </description>

</property>

<property>

  <name>http.agent.version</name>

  <value>Nutch-1.7</value>

  <description>A version string to advertise in the User-Agent 

   header.</description>

</property>

 

在类nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这5项配置是如何构成User-Agent的:

 

Java代码  
  1. this.userAgent = getAgentString( conf.get("http.agent.name"),   
  2.         conf.get("http.agent.version"),   
  3.         conf.get("http.agent.description"),   
  4.         conf.get("http.agent.url"),   
  5.         conf.get("http.agent.email") );  
this.userAgent = getAgentString( conf.get("http.agent.name"), 

        conf.get("http.agent.version"), 

        conf.get("http.agent.description"), 

        conf.get("http.agent.url"), 

        conf.get("http.agent.email") );

 

Java代码  
  1. private static String getAgentString(String agentName,  
  2.                                      String agentVersion,  
  3.                                      String agentDesc,  
  4.                                      String agentURL,  
  5.                                      String agentEmail) {  
  6.     
  7.   if ( (agentName == null) || (agentName.trim().length() == 0) ) {  
  8.     // TODO : NUTCH-258  
  9.     if (LOGGER.isErrorEnabled()) {  
  10.       LOGGER.error("No User-Agent string set (http.agent.name)!");  
  11.     }  
  12.   }  
  13.     
  14.   StringBuffer buf= new StringBuffer();  
  15.     
  16.   buf.append(agentName);  
  17.   if (agentVersion != null) {  
  18.     buf.append("/");  
  19.     buf.append(agentVersion);  
  20.   }  
  21.   if ( ((agentDesc != null) && (agentDesc.length() != 0))  
  22.   || ((agentEmail != null) && (agentEmail.length() != 0))  
  23.   || ((agentURL != null) && (agentURL.length() != 0)) ) {  
  24.     buf.append(" (");  
  25.       
  26.     if ((agentDesc != null) && (agentDesc.length() != 0)) {  
  27.       buf.append(agentDesc);  
  28.       if ( (agentURL != null) || (agentEmail != null) )  
  29.         buf.append("; ");  
  30.     }  
  31.       
  32.     if ((agentURL != null) && (agentURL.length() != 0)) {  
  33.       buf.append(agentURL);  
  34.       if (agentEmail != null)  
  35.         buf.append("; ");  
  36.     }  
  37.       
  38.     if ((agentEmail != null) && (agentEmail.length() != 0))  
  39.       buf.append(agentEmail);  
  40.       
  41.     buf.append(")");  
  42.   }  
  43.   return buf.toString();  
  44. }  
  private static String getAgentString(String agentName,

                                       String agentVersion,

                                       String agentDesc,

                                       String agentURL,

                                       String agentEmail) {

    

    if ( (agentName == null) || (agentName.trim().length() == 0) ) {

      // TODO : NUTCH-258

      if (LOGGER.isErrorEnabled()) {

        LOGGER.error("No User-Agent string set (http.agent.name)!");

      }

    }

    

    StringBuffer buf= new StringBuffer();

    

    buf.append(agentName);

    if (agentVersion != null) {

      buf.append("/");

      buf.append(agentVersion);

    }

    if ( ((agentDesc != null) && (agentDesc.length() != 0))

    || ((agentEmail != null) && (agentEmail.length() != 0))

    || ((agentURL != null) && (agentURL.length() != 0)) ) {

      buf.append(" (");

      

      if ((agentDesc != null) && (agentDesc.length() != 0)) {

        buf.append(agentDesc);

        if ( (agentURL != null) || (agentEmail != null) )

          buf.append("; ");

      }

      

      if ((agentURL != null) && (agentURL.length() != 0)) {

        buf.append(agentURL);

        if (agentEmail != null)

          buf.append("; ");

      }

      

      if ((agentEmail != null) && (agentEmail.length() != 0))

        buf.append(agentEmail);

      

      buf.append(")");

    }

    return buf.toString();

  }

 

在类nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头,这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:

 

Java代码  
  1. String userAgent = http.getUserAgent();  
  2. if ((userAgent == null) || (userAgent.length() == 0)) {  
  3.     if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }  
  4. else {  
  5.     reqStr.append("User-Agent: ");  
  6.     reqStr.append(userAgent);  
  7.     reqStr.append("\r\n");  
  8. }  
String userAgent = http.getUserAgent();

if ((userAgent == null) || (userAgent.length() == 0)) {

	if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }

} else {

	reqStr.append("User-Agent: ");

	reqStr.append(userAgent);

	reqStr.append("\r\n");

}

 

通过上面的分析可知:在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器(Imitating a specific browser)

 

1、模拟Firefox浏览器:

 

Xml代码  
  1. <property>  
  2.     <name>http.agent.name</name>  
  3.     <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>  
  4. </property>  
  5. <property>  
  6.     <name>http.agent.version</name>  
  7.     <value>20100101 Firefox/27.0</value>  
  8. </property>  
<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>20100101 Firefox/27.0</value>

</property>

 

2、模拟IE浏览器:

 

Xml代码  
  1. <property>  
  2.     <name>http.agent.name</name>  
  3.     <value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>  
  4. </property>  
  5. <property>  
  6.     <name>http.agent.version</name>  
  7.     <value>6.0)</value>  
  8. </property>  
<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>6.0)</value>

</property>

 

3、模拟Chrome浏览器:

 

Xml代码  
  1. <property>  
  2.     <name>http.agent.name</name>  
  3.     <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>  
  4. </property>  
  5. <property>  
  6.     <name>http.agent.version</name>  
  7.     <value>537.36</value>  
  8. </property>  
<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>537.36</value>

</property>

 

4、模拟Safari浏览器:

 

Xml代码  
  1. <property>  
  2.     <name>http.agent.name</name>  
  3.     <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>  
  4. </property>  
  5. <property>  
  6.     <name>http.agent.version</name>  
  7.     <value>534.57.2</value>  
  8. </property>  
<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>534.57.2</value>

</property>

 

 

5、模拟Opera浏览器:

 

Xml代码  
  1. <property>  
  2.     <name>http.agent.name</name>  
  3.     <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>  
  4. </property>  
  5. <property>  
  6.     <name>http.agent.version</name>  
  7.     <value>19.0.1326.59</value>  
  8. </property>  
<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>19.0.1326.59</value>

</property>

 

 

后记:查看User-Agent的方法:

1、http://www.useragentstring.com

2、http://whatsmyuseragent.com

3、http://www.enhanceie.com/ua.aspx

 

NUTCH/HADOOP视频教程

你可能感兴趣的:(Nutch)