利用simhash来进行文本去重复

原文http://d3s.mff.cuni.cz/~holub/sw/shash/#a1

传统的hash函数能够将一样的文本生成一样的hash函数,但是,通过simhash方法,能够差不多相同的文档得到的hash函数也比较相近。

Charikar's hash

通过Charikar‘s hash,能够将比较相似度的文档得到比较相近的fingerprint。

该算法的流程如下:

    *  Document is split into tokens (words for example) 
or super-tokens (word tuples)
    * Each token is represented by its hash value; a traditional 
hash function is used
    * Weights are associated with tokens
    * A vector V of integers is initialized to 0, length of the vector 
corresponds to the desired hash size in bits
    * In a cycle for all token's hash values (h), vector V is updated:
          o ith element is decreased by token's weight if the ith bit of 
the hash h is 0, otherwise
          o ith element is increased by token's weight if the ith bit of 
the hash h is 1
    * Finally, signs of elements of V corresponds to the bits of the
 final fingerprint

    该hash不是将文档总体计算hash值,而是将文档中的每个token计算哈希值,对文档中每个token的hash值,按照 对hash值进行求和,如果当前token的hash值在该位上是0,则减去1,如果在该位上是1,则加上1.将所有的token按照这种方式累加,求的最终的值作为fingerprint。

python对应的代码如下:

#!/usr/bin/python

# Implementation of Charikar simhashes in Python
# See: http://dsrg.mff.cuni.cz/~holub/sw/shash/#a1

class simhash():
    def __init__(self, tokens='', hashbits=128):
        self.hashbits = hashbits
        self.hash = self.simhash(tokens)

    def __str__(self):
        return str(self.hash)

    def __long__(self):
        return long(self.hash)

    def __float__(self):
        return float(self.hash)

    def simhash(self, tokens):
        # Returns a Charikar simhash with appropriate bitlength
        v = [0]*self.hashbits

        for t in [self._string_hash(x) for x in tokens]:
            bitmask = 0
            print (t)
            for i in range(self.hashbits):
                bitmask = 1 << i
                print(t,bitmask, t & bitmask)
                if t & bitmask:
                    v[i] += 1 #查看当前bit位是否为1,是的话则将该位+1 
                else:
                    v[i] –= 1 #否则得话,该位减1

        fingerprint = 0
        for i in range(self.hashbits):
            if v[i] >= 0:
                fingerprint += 1 << i 
#整个文档的fingerprint为最终各个位大于等于0的位的和
        return fingerprint

    def _string_hash(self, v):
        # A variable-length version of Python's builtin hash
        if v == "":
            return 0
        else:
            x = ord(v[0])<<7
            m = 1000003
            mask = 2**self.hashbits-1
            for c in v:
                x = ((x*m)^ord(c)) & mask
            x ^= len(v)
            if x == -1: 
                x = -2
            return x

    def hamming_distance(self, other_hash):
        x = (self.hash ^ other_hash.hash) & ((1 << self.hashbits) - 1)
        tot = 0
        while x:
            tot += 1
            x &= x-1
        return tot

    def similarity(self, other_hash):
        a = float(self.hash)
        b = float(other_hash)
        if a>b: return b/a
        return a/b

if __name__ == '__main__':
    s = 'This is a test string for testing'
    hash1 =simhash(s.split())
    #print("0x%x" % hash1)
    #print ("%s/t0x%x" % (s, hash1))

    s = 'This is a test string for testing also!'
    hash2 = simhash(s.split())
    #print ("%s/t[simhash = 0x%x]" % (s, hash2))

    print (hash1.similarity(hash2), "percent similar")
    print (hash1.hamming_distance(hash2), "bits differ out of", hash1.hashbits)

上述代码运行的结果如下:

0.97584098811 percent similar
14 bits differ out of 128

你可能感兴趣的:(vector,String,文档,token,float,distance)