Larbin[1]hashtable checker 源代码分析

     昨天看Larbin源代码,觉得Larbin根本没有用bloom filter算法,他只hash了一次。不过他的按位保存的那段代码还是蛮精辟的。今天上网,发现了这位仁兄的博客,他也觉得Larbin没有用bloom Filter算法,而且他的blog对Larbin的重要之处都做了解释。现在转载过来。  原文出处是:http://quweiprotoss.blog.163.com/blog/static/4088288320103190243558/

      在larbin里判断用一个URL是否被抓取过,用的是bloom filter算法(至少网上的人这么说),但是我感觉与《数学之美系列二十一-布隆过滤器(Bloom Filter)》中所介绍的算法有着很大的不同,因为larbin中只是简单用了hash方法,它有点像位图法,但我的算法很一般,意见仅供参考。

Larbin[1]hashtable checker 源代码分析_第1张图片

       先看一下hashTable的类定义:

      

class hashTable {

private:

    ssize_t size;

    char *table;

 

public:

    /* constructor */

    hashTable(bool create);

 

    /* destructor */

    ~hashTable();

 

    /* save the hashTable in a file */

    void save();

 

    /* test if this url is allready in the hashtable

     * return true if it has been added

     * return false if it has allready been seen

     */

    bool test(url *U);

 

    /* set a url as present in the hashtable

     */

    void set(url *U);

 

    /* add a new url in the hashtable

     * return true if it has been added

     * return false if it has allready been seen

     */

    bool testSet(url *U);

};

       它只有两个成员变量,一个是它的大小,一个是它的内容。Save函数是把hashTable保存到一个文件中,test函数是判断一个url是否已经存在于这个hashtableset函数是将这个urlhashtable中标明已经见过了,testSet判断这个url有没有被加入过。     

/* constructor */

hashTable::hashTable(bool create) {

    ssize_t total = hashSize / 8;

    table = new char[total];

    if (create) {

       for (ssize_t i = 0; i < hashSize / 8; i++) {

           table[i] = 0;

       }

    } else {

       int fds = open("hashtable.bak", O_RDONLY);

       if (fds < 0) {

           cerr << "Cannot find hashtable.bak, restart from scratch\n";

           for (ssize_t i = 0; i < hashSize / 8; i++) {

              table[i] = 0;

           }

       } else {

           ssize_t sr = 0;

           while (sr < total) {

              ssize_t tmp = read(fds, table + sr, total - sr);

              if (tmp <= 0) {

                  cerr << "Cannot read hashtable.bak : " << 

strerror(errno) << endl;

                  exit(1);

              } else {

                  sr += tmp;

              }

           }

           close(fds);

       }

    }

}

       hashSize定义在type.h中,它的大小为64000000,因为是用位图法,一个字节有8位,所以除以8。如果是创建一个hashtable,就将table中所有值转置空,否则,去读hashtable.bak中的内容。

/* save the hashTable in a file */

void hashTable::save() {

    rename("hashtable.bak", "hashtable.old");

    int fds = creat("hashtable.bak", 00600);

    if (fds >= 0) {

       ecrireBuff(fds, table, hashSize / 8);

       close(fds);

    }

    unlink("hashtable.old");

}

  ecrire是法语的write,就是将一个char*buff写入文件的函数,unlink将文件的连接数减少一个,如果当前文件的连接数目为0,并且没有其他程序打开这个文件,则删除,而remove则将文件直接删除。

/* test if this url is allready in the hashtable

 * return true if it has been added

 * return false if it has allready been seen

 */

bool hashTable::test(url *U) {

    unsigned int code = U->hashCode();

    unsigned int pos = code / 8;

    unsigned int bits = 1 << (code % 8);

    return table[pos] & bits;

}

这里计算得到URLhash code,后计算它在哪个字节上,再看它在它个bit上,最后判断这个bit是否已经置过1。

/* return a hashcode for this url */

uint url::hashCode() {

    unsigned int h = port;

    unsigned int i = 0;

    while (host[i] != 0) {

       h = 31 * h + host[i];

       i++;

    }

    i = 0;

    while (file[i] != 0) {

       h = 31 * h + file[i];

       i++;

    }

    return h % hashSize;

}

       代码在utils/url.c。这里是将host name和后面部分做hashnutch里算这个hash code值的时候,是反过来算的,因为这样同一hosturlhashcode的值就会有更大的差异,是不是会更好些呢?这里乘31是经典做法,java中也是这样的。      

/* set a url as present in the hashtable

 */

void hashTable::set(url *U) {

    unsigned int code = U->hashCode();

    unsigned int pos = code / 8;

    unsigned int bits = 1 << (code % 8);

    table[pos] |= bits;

}

      这里有test函数差不多,只是最后用或运算置位。

       checker.h中声明了两个函数:

/** check if an url is already known

 * if not send it

 * @param u the url to check

 */

void check(url *u);

 

/** Check the extension of an url

 * @return true if it might be interesting, false otherwise

 */

bool filter1(char *host, char *file);

check的实现如下:

void check(url *u) {

    if (global::seen->testSet(u)) {

       hashUrls(); // stat

       // where should this link go ?

#ifdef SPECIFICSEARCH

       if (privilegedExts[0] != NULL

              && matchPrivExt(u->getFile())) {

           interestingExtension();

           global::URLsPriority->put(u);

       } else {

           global::URLsDisk->put(u);

       }

#else // not a SPECIFICSEARCH

       global::URLsDisk->put(u);

#endif

    } else {

       // This url has already been seen

       answers(urlDup);

       delete u;

    }

}

Global::seen是一个hashtable对象,用它来判断这个URL是否已经见过。

这里SPECIFICSEARCH是判断是我们我们想得到的那种格式,用URLsPriority处理,否则用URLsDisk处理,

/** Check the extension of an url

 * @return true if it might be interesting, false otherwise

 */

bool filter1(char *host, char *file) {

    int i = 0;

    if (global::domains != NULL) {

       bool ok = false;

       while ((*global::domains)[i] != NULL) {

           ok = ok || endWith((*global::domains)[i], host);

           i++;

       }

       if (!ok) {

           return false;

       }

    }

    i = 0;

    int l = strlen(file);

    if (endWithIgnoreCase("html", file, l) || file[l - 1] == '/'

           || endWithIgnoreCase("htm", file, l)) {

       return true;

    }

    while (global::forbExt[i] != NULL) {

       if (endWithIgnoreCase(global::forbExt[i], file, l)) {

           return false;

       }

       i++;

    }

    return true;

}

Filter1的实现如下:

/** Check the extension of an url

 * @return true if it might be interesting, false otherwise

 */

bool filter1(char *host, char *file) {

    int i = 0;

    if (global::domains != NULL) {

       bool ok = false;

       while ((*global::domains)[i] != NULL) {

           ok = ( ok || endWith((*global::domains)[i], host) );

           i++;

       }

       if (!ok) {

           return false;

       }

    }

    i = 0;

    int l = strlen(file);

    if (endWithIgnoreCase("html", file, l) || file[l - 1] == '/'

           || endWithIgnoreCase("htm", file, l)) {

       return true;

    }

    while (global::forbExt[i] != NULL) {

       if (endWithIgnoreCase(global::forbExt[i], file, l)) {

           return false;

       }

       i++;

    }

    return true;

}

前半部分是判断domain是否是配置中提到的,后面是判断这种格式是不是配置中禁止的,htmlhtm是永真的。感觉上面的代码是不是可以写成:

while ((*global::domains)[i] != NULL) {

           if( endWith((*global::domains)[i], host) )

                  return false;

           i++;

       }

配置在larbin.conf中,forbiddenExtensions是所有禁止的类型,注释上写到不要禁止htmhtml,你也禁止不了。

# Do you want to limit your search to a specific domain ?

# if yes, uncomment the following line

#limitToDomain .fr .dk .uk end

 

# What are the extensions you surely don't want

# never forbid .html, .htm and so on : larbin needs them

forbiddenExtensions

.tar .gz .tgz .zip .Z .rpm .deb

.ps .dvi .pdf

.png .jpg .jpeg .bmp .smi .tiff .gif

.mov .avi .mpeg .mpg .mp3 .qt .wav .ram .rm

.jar .java .class .diff

.doc .xls .ppt .mdb .rtf .exe .pps .so .psd

end












 

     

你可能感兴趣的:(File,filter,table,url,extension,代码分析)