Heritrix 3.1.0 源码解析(二十六)

上文分析了Heritrix3.1.0系统对HttpClient组件的请求处理类的封装,本文接下来分析Heritrix3.1.0系统是怎样封装请求证书的

Heritrix3.1.0系统的package org.archive.modules.credential里面的相关类都是与请求证书有关的

先来了解一下CredentialStore类,该类用Map类型存储了应用的所有证书(Credential),外部只要调用这个类就可以获取证书

该类重要方法如下

KeyedProperties kp = new KeyedProperties();

    public KeyedProperties getKeyedProperties() {

        return kp;

    }

    

    /**

     * Credentials used by heritrix authenticating. See

     * http://crawler.archive.org/proposals/auth/ for background.

     * 

     * @see http://crawler.archive.org/proposals/auth/

     */

    {

        setCredentials(new HashMap<String, Credential>());

    }

    @SuppressWarnings("unchecked")

    public Map<String,Credential> getCredentials() {

        return (Map<String,Credential>) kp.get("credentials");

    }

    public void setCredentials(Map<String,Credential> map) {

        kp.put("credentials",map);

    }

    

    /**

     * List of possible credential types as a List.

     *

     * This types are inner classes of this credential type so they cannot

     * be created without their being associated with a credential list.

     */

    private static final List<Class<?>> credentialTypes;

    // Initialize the credentialType data member.

    static {

        // Array of all known credential types.

        Class<?> [] tmp = {HtmlFormCredential.class, HttpAuthenticationCredential.class};

        credentialTypes = Collections.unmodifiableList(Arrays.asList(tmp));

    }



    /**

     * Constructor.

     */

    public CredentialStore() {

    }



    /**

     * @return Unmodifable list of credential types.

     */

    public static List<Class<?>> getCredentialTypes() {

        return CredentialStore.credentialTypes;

    }





    /**

     * @param context Pass a ProcessorURI.  Used to set

     * context.

     * @return An iterator or null.

     */

    public Collection<Credential> getAll() {

        Map<String,Credential> map = getCredentials();

        return map.values();

    }



    /**

     * @param context  Used to set context.

     * @param name Name to give the manufactured credential.  Should be unique

     * else the add of the credential to the list of credentials will fail.

     * @return Returns <code>name</code>'d credential.

     * @throws AttributeNotFoundException

     * @throws MBeanException

     * @throws ReflectionException

     */

    public Credential get(/*StateProvider*/Object context, String name) {

        return getCredentials().get(name);

    }

/**

     * Return set made up of all credentials of the passed

     * <code>type</code>.

     *

     * @param context  Used to set context.  

     * @param type Type of the list to return.  Type is some superclass of

     * credentials.

     * @param rootUri RootUri to match.  May be null.  In this case we return

     * all.  Currently we expect the CrawlServer name to equate to root Uri.

     * Its not.  Currently it doesn't distingush between servers of same name

     * but different ports (e.g. http and https).

     * @return Unmodifable sublist of all elements of passed type.

     */

    public Set<Credential> subset(CrawlURI context, Class<?> type, String rootUri) {

        Set<Credential> result = null;

        for (Credential c: getAll()) {

            if (!type.isInstance(c)) {

                continue;

            }

            if (rootUri != null) {

                String cd = c.getDomain();

                if (cd == null) {

                    continue;

                }

                if (!rootUri.equalsIgnoreCase(cd)) {

                    continue;

                }

            }

            if (result == null) {

                result = new HashSet<Credential>();

            }

            result.add(c);

        }

        return result;

    }

上面方法分别提供了获取所有证书(Map类型),根据名称(Map的key键)获取证书和获取所有证书类型

(注意到最后的subset方法,好像没有用到CrawlURI context参数,方法返回的只能是指定域并且指定证书类型的证书集合)

从它的静态代码块可以看到,系统提供了两种类型的证书类型,分别是HtmlFormCredential.class, HttpAuthenticationCredential.class,前者用于form认证,后者用于Basic/Digest HTTP认证

两种证书类型继承自抽象类Credential,先看一下该抽象类的方法

    /**

     *域名

     * The root domain this credential goes against: E.g. www.archive.org

     */

    String domain = "";

    /**

     * @param context Context to use when searching for credential domain.

     * @return The domain/root URI this credential is to go against.

     * @throws AttributeNotFoundException If attribute not found.

     */

    public String getDomain() {

        return this.domain;

    }

    public void setDomain(String domain) {

        this.domain = domain;

    }

/**

     *为CrawlURI curi对象添加当前证书

     * Attach this credentials avatar to the passed <code>curi</code> .

     *

     * Override if credential knows internally what it wants to attach as

     * payload.  Otherwise, if payload is external, use the below

     * {@link #attach(CrawlURI, String)}.

     *

     * @param curi CrawlURI to load with credentials.

     */

    public void attach(CrawlURI curi) {

        curi.getCredentials().add(this);

    }



    /**

     *为CrawlURI curi对象移除当前证书

     * Detach this credential from passed curi.

     *

     * @param curi

     * @return True if we detached a Credential reference.

     */

    public boolean detach(CrawlURI curi) {

        return curi.getCredentials().remove(this);

    }



    /**

     *为CrawlURI curi对象移除所有证书

     * Detach all credentials of this type from passed curi.

     *

     * @param curi

     * @return True if we detached references.

     */

    public boolean detachAll(CrawlURI curi) {

        boolean result = false;

        Iterator<Credential> iter = curi.getCredentials().iterator();

        while (iter.hasNext()) {

            Credential cred = iter.next();

            if (cred.getClass() ==  this.getClass()) {

                iter.remove();

                result = true;

            }

        }

        return result;

    }



    /**

     *判断CrawlURI curi对象是否需要当前证书认证

     * @param curi CrawlURI to look at.

     * @return True if this credential IS a prerequisite for passed

     * CrawlURI.

     */

    public abstract boolean isPrerequisite(CrawlURI curi);



    /**

     *判断CrawlURI curi对象是否存在认证URI

     * @param curi CrawlURI to look at.

     * @return True if this credential HAS a prerequisite for passed CrawlURI.

     */

    public abstract boolean hasPrerequisite(CrawlURI curi);



    /**

     *获取CrawlURI curi对象的认证URI

     * Return the authentication URI, either absolute or relative, that serves

     * as prerequisite the passed <code>curi</code>.

     *

     * @param curi CrawlURI to look at.

     * @return Prerequisite URI for the passed curi.

     */

    public abstract String getPrerequisite(CrawlURI curi);



    /**

     *获取CrawlURI curi对象的认证URI

     * @param context Context to use when searching for credential domain.

     * @return Key that is unique to this credential type.

     * @throws AttributeNotFoundException

     */

    public abstract String getKey();





    /**

     *判断CrawlURI curi对象是否每次都要认证

     * @return True if this credential is of the type that needs to be offered

     * on each visit to the server (e.g. Rfc2617 is such a type).

     */

    public abstract boolean isEveryTime();



    /**

     *为HttpMethod method添加认证参数

     * @param curi CrawlURI to as for context.

     * @param http Instance of httpclient.

     * @param method Method to populate.

     * @return True if added a credentials.

     */

    public abstract boolean populate(CrawlURI curi, HttpClient http,

        HttpMethod method);



    /**

     *是否post认证

     * @param curi CrawlURI to look at.

     * @return True if this credential is to be posted.  Return false if the

     * credential is to be GET'd or if POST'd or GET'd are not pretinent to this

     * credential type.

     */

    public abstract boolean isPost();



    /**

     * 判断CrawlURI curi对象的CrawlServer类中的名称与当前认证对象的域名是否一致(用于排除不需要当前认证的CrawlURI curi对象)

     * Test passed curi matches this credentials rootUri.

     * @param controller

     * @param curi CrawlURI to test.

     * @return True if domain for credential matches that of the passed curi.

     */

    public boolean rootUriMatch(ServerCache cache, 

            CrawlURI curi) {

        String cd = getDomain();



        CrawlServer serv = cache.getServerFor(curi.getUURI());

        String serverName = serv.getName();

//        String serverName = controller.getServerCache().getServerFor(curi).

//            getName();

        logger.fine("RootURI: Comparing " + serverName + " " + cd);

        return cd != null && serverName != null &&

            serverName.equalsIgnoreCase(cd);

    }

上述方法的功能是为CrawlURI curi对象添加当前证书、移除当前证书、为HttpMethod method对象添加证书参数、判断CrawlURI curi对象的域名与当前证书的域名是否一致等

HtmlFormCredential对象继承自上述证书类Credential,为CrawlURI curi对象提供form认证,相关方法实现如下

/**

     * Full URI of page that contains the HTML login form we're to apply these

     * credentials too: E.g. http://www.archive.org

     */

    String loginUri = "";

    public String getLoginUri() {

        return this.loginUri;

    }

    public void setLoginUri(String loginUri) {

        this.loginUri = loginUri;

    }

    

    /**

     * Form items.

     */

    Map<String,String> formItems = new HashMap<String,String>();

    public Map<String,String> getFormItems() {

        return this.formItems;

    }

    public void setFormItems(Map<String,String> formItems) {

        this.formItems = formItems;

    }

    

    

    enum Method {

        GET,

        POST

    }

    /**

     * GET or POST.

     */

    Method httpMethod = Method.POST;

    public Method getHttpMethod() {

        return this.httpMethod;

    }

    public void setHttpMethod(Method method) {

        this.httpMethod = method; 

    }



    /**

     * Constructor.

     */

    public HtmlFormCredential() {

    }



    public boolean isPrerequisite(final CrawlURI curi) {

        boolean result = false;

        String curiStr = curi.getUURI().toString();

        String loginUri = getPrerequisite(curi);

        if (loginUri != null) {

            try {
//登录url UURI uuri
= UURIFactory.getInstance(curi.getUURI(), loginUri); if (uuri != null && curiStr != null && uuri.toString().equals(curiStr)) { result = true; if (!curi.isPrerequisite()) { curi.setPrerequisite(true); logger.fine(curi + " is prereq."); } } } catch (URIException e) { logger.severe("Failed to uuri: " + curi + ", " + e.getMessage()); } } return result; } public boolean hasPrerequisite(CrawlURI curi) { return getPrerequisite(curi) != null; } public String getPrerequisite(CrawlURI curi) { return getLoginUri(); } public String getKey() { return getLoginUri(); } public boolean isEveryTime() { // This authentication is one time only. return false; } public boolean populate(CrawlURI curi, HttpClient http, HttpMethod method) { // http is not used boolean result = false; Map<String,String> formItems = getFormItems(); if (formItems == null || formItems.size() <= 0) { try { logger.severe("No form items for " + method.getURI()); } catch (URIException e) { logger.severe("No form items and exception getting uri: " + e.getMessage()); } return result; } NameValuePair[] data = new NameValuePair[formItems.size()]; int index = 0; String key = null; for (Iterator<String> i = formItems.keySet().iterator(); i.hasNext();) { key = i.next(); data[index++] = new NameValuePair(key, (String)formItems.get(key)); } if (method instanceof PostMethod) { ((PostMethod)method).setRequestBody(data); result = true; } else if (method instanceof GetMethod) { // Append these values to the query string. // Get current query string, then add data, then get it again // only this time its our data only... then append. HttpMethodBase hmb = (HttpMethodBase)method; String currentQuery = hmb.getQueryString(); hmb.setQueryString(data); String newQuery = hmb.getQueryString(); hmb.setQueryString( ((StringUtils.isNotEmpty(currentQuery)) ? currentQuery + "&" : "") + newQuery); result = true; } else { logger.severe("Unknown method type: " + method); } return result; } public boolean isPost() { return Method.POST.equals(getHttpMethod()); }

上述方法的功能 我在它的接口方法里面已经注释了,这里不再重复

另外HttpAuthenticationCredential证书类提供了Basic/Digest HTTP认证功能,源码我就不具体分析了,可以参照HtmlFormCredential类的认证功能对比不难理解了

在Heritrix3.1.0官方的参考文档里面提供了两种认证方式在配置文件crawler-beans.cxml中的示例(官方的示例里面关键词有误)

<bean id="credentialStore"

   class="org.archive.modules.credential.CredentialStore">

     <property name="credentials">

       <map>

         <entry key="formCredential" value-ref="formCredential" />

       </map>

 </property>

</bean>

<bean id="credential"

   class="org.archive.modules.credential.HtmlFormCredential"> 

    <property name="domain" value="example.com" /> 

    <property name="login-uri" value="http://example.com/login"/> 

    <property name="form-items">

        <map>

            <entry key="login" value="mylogin"/>

            <entry key="password" value="mypassword"/>

            <entry key="submit" value="submit"/>

        </map>

    </property>

</bean>

<bean id="credential"

  class="org.archive.modules.credential.HttpAuthenticationCredential"> 

    <property name="domain"><value>domain</value></property> 

    <property name="realm"><value>myrealm</value></property> 

    <property name="login"><value>mylogin</value></property> 

    <property name="password"><value>mypassword</value></property> 

</bean>

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/28/3049042.html

你可能感兴趣的:(Heritrix)