Url是爬虫的核心,因为爬虫就是依赖URL一层一层的抓取下去,最后完成整个抓取。Heritrix中的URL比较特殊,有以下继承关系(由于不对继承关系作介绍,所以这里就不画图了):
1)org.archive.crawler.datamodel.CrawlURI——>CandidateURI
2)org.archive.net.UURI——>org.archive.net.LaxURI
——>org.apache.commons.httpclient.URI——>java.net.URL
前面说过CrawlURI和CandidateURI的区别在于CrawlURI是由通过了调度器(Frontier)的CandidateURI转换而来的。下面就先介绍CnadidateURI(主要介绍相关属性):
- public static final int HIGH = 1;
- public static final int HIGHEST = 0;
- public static final int MEDIUM = 2;
- public static final int NORMAL = 3;
-
- private String cachedCandidateURIString = null;
-
-
- private String classKey;
-
-
- private boolean forceRevisit = false;
-
- private boolean isSeed = false;
-
-
-
-
- private transient AList alist;
-
-
-
-
-
-
-
-
-
-
-
-
-
- private String pathFromSeed;
-
- private int schedulingDirective = NORMAL;
- private transient UURI uuri;
- private transient UURI via;
- private CharSequence viaContext;
下面再介绍一下CrawlURI相关属性,前面说过CrawlURI和CandidateURI最大区别就是CrawlURI通过了调度器,这也就意味着CrawlURI会进入队列抓取,如此CrawlURI就会相比CandidateURI对很多属性来记录抓取情况,如处理器,下面请看代码以及注释:
-
- private static final List<Object> alistPersistentMember = new CopyOnWriteArrayList<Object>(
- new String[] { A_CREDENTIAL_AVATARS_KEY });
-
- public static final int MAX_OUTLINKS = Integer.parseInt(System.getProperty(
- CrawlURI.class.getName() + ".maxOutLinks", "6000"));
-
- transient private int discardedOutlinks = 0;
-
- public static final int UNCALCULATED = -1;
- private String cachedCrawlURIString = null;
- private byte[] contentDigest = null;
- private String contentDigestScheme = null;
- private long contentLength = UNCALCULATED;
- private long contentSize = UNCALCULATED;
- private String contentType = null;
- private int deferrals = 0;
-
- private int fetchAttempts = 0;
- private int fetchStatus = 0;
- transient Object holder;
- int holderCost = UNCALCULATED;
- transient Object holderKey;
- private transient HttpRecorder httpRecorder = null;
- transient private boolean linkExtractorFinished = false;
- transient private Processor nextProcessor;
- transient private ProcessorChain nextProcessorChain;
- protected long ordinal;
- transient Collection<Object> outLinks = new HashSet<Object>();
- private boolean post = false;
- private boolean prerequisite = false;
- transient private int threadNumber;
- private String userAgent = null;
- @Deprecated
- private int embedHopCount = UNCALCULATED;
- @Deprecated
- private int linkHopCount = UNCALCULATED;
同时很多人在使用Heritrix的时候需要增加自己的属性,我之前也有这样的需求。不过那时是直接修改源代码增加几个属性,然后在抽取的时候将新的属性赋给抽取出来的URL即可。后来才发现完全没有这个必要,Heritrix已经提供了这样一个功能,可以自定义放入各种属性和属性值。同时Heritrix自己在运行过程中也是如此,把一些会动态变化的属性放入其中,如HttpStatus Code。下面就介绍下其相关原理以及如何使用这个功能:
1)原理:
CandidateURI里面有一个属性private transient AList alist;该属性实际上是一个HashTable,其中Key为属性,Value为属性值。如此一致贯穿整个抓取,可以随时动态读写。但由于该属性是transient,也就意味着HashTable里面的值不会被持久化,所以Heritrix在CrawlURI里面引入一个个变量来记录HashTable中需要持久化的Key,也就是我们所要持久化的属性了:private static final List<Object> alistPersistentMember = new CopyOnWriteArrayList<Object>( new String[] { A_CREDENTIAL_AVATARS_KEY });该属性类型为CopyOnWriteArrayList,也就是专门用于复制写的List,里面存放需要持久化的Key。所以当你需要某个HashTable中的某个Key持久化的时候,只需要在该变量里添加即可。
2)使用方法:
1.存放属性和属性值,变量可以按多种类型存放:
-
- public void putInt(String key, int value) {
- getAList().putInt(key, value);
- }
-
- public void putLong(String key, long value) {
- getAList().putLong(key, value);
- }
-
- public void putObject(String key, Object value) {
- getAList().putObject(key, value);
- }
-
- public void putString(String key, String value) {
- getAList().putString(key, value);
- }
2.获得属性和属性值:
-
- public int getInt(String key) {
- return getAList().getInt(key);
- }
-
- public long getLong(String key) {
- return getAList().getLong(key);
- }
-
- public Object getObject(String key) {
- return getAList().getObject(key);
- }
-
- public String getString(String key) {
- return getAList().getString(key);
- }
3.查看是否包含某个属性:
-
- public boolean containsKey(String key) {
- return getAList().containsKey(key);
- }
4.获得所有的属性:
-
- public Iterator keys() {
- return getAList().getKeys();
- }
5.让某个属性持久化:
- public void makeHeritable(String key) {
- @SuppressWarnings("unchecked")
- List<String> heritableKeys = (List<String>) getObject(A_HERITABLE_KEYS);
- if (heritableKeys == null) {
- heritableKeys = new ArrayList<String>();
- heritableKeys.add(A_HERITABLE_KEYS);
- putObject(A_HERITABLE_KEYS, heritableKeys);
- }
- heritableKeys.add(key);
- }
6.让某个属性不持久化:
- public void makeNonHeritable(String key) {
- List heritableKeys = (List) getObject(A_HERITABLE_KEYS);
- if (heritableKeys == null) {
- return;
- }
- heritableKeys.remove(key);
- if (heritableKeys.size() == 1) {
-
- remove(A_HERITABLE_KEYS);
- }
- }
以上6个介绍完全可以让你扩展自己的属性以及让他们持久化