网络爬虫相关文章

1.Programming a Spider in Java

英文版在这http://www.developer.com/java/other/article.php/1573761

中文翻译。http://blog.csdn.net/shuidao/archive/2007/09/05/1772512.aspx

 

2.MyEclipse下配置heritrix 1.14.3步骤

http://blog.163.com/caixinbao1/blog/static/161494162009730103718497/

 

3.Heritrix相关文章  -Xmx512m

http://www.cnblogs.com/hustcat/category/139956.html

http://atwo.iteye.com/blog/216960

4.Heritrix主页:http://crawler.archive.org/

Heritrix开发文档:http://crawler.archive.org/articles/developer_manual/index.html

Heritrix用户手册:http://crawler.archive.org/articles/user_manual/index.html

Heritrix使用小结:http://www.ruanko.com:9090/uchome/space.php?uid=871&do=blog&id=5773

编程启动Heritrix:http://www.soidc.net/discuss/1/040101/00/615080_1.html

http://lucenebook.spaces.live.com/

http://www.iteye.com/topic/141272

Heritrix yahoo:http://tech.groups.yahoo.com/group/archive-crawler/

无法增加选项的问题:

在EclipseRun Dialog中,Classpath标签Table,选中User Entries,然后右边会有Advance选项,选Add External Folder,把你的Conf加进去就行了)。再试,在Modules页面中的功能正常了。  

5.wsdl文档下载

http://www.biocatalogue.org/

http://www.webservicex.net/WCF/Default.aspx

6.搜索引擎资料收集:http://wind-bell.iteye.com/blog/81504

package my.processor;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.postprocessor.FrontierScheduler;
public class FrontierWsdlOnly extends FrontierScheduler
{

	final static Logger logger=Logger.getLogger(FrontierWsdlOnly.class.getName());		
	public FrontierWsdlOnly(String name) {
		super(name);		
	}
	protected void schedule(CandidateURI caUri){
		
		String url=caUri.toString();
		if(url.endsWith(".jpg")
				||url.endsWith(".gif")
				||url.endsWith(".doc")
				||url.endsWith(".html")
				||url.contains("/images/"))
		{
			return;
		}
		getController().getFrontier().schedule(caUri);		
	}

	
}

 

切记切记要添加1.12.1-srcconf而不是1.12.1conf

你可能感兴趣的:(java,PHP,MyEclipse,Blog,WCF)