一个简单的simhash算法

simhash是个很常用的计算文本相似度的算法,网上一般说用64bit的签名,这里采用times33作为普通hash函数,用32bit的签名,算法如下:

#!/usr/bin/env perl # sub hash { my ($input) = @_; my @chars = split "", $input; my $hash = 5381; foreach(@chars){ $hash = $hash + ord($_) * 33; } $hash; } sub simhash { my @tokens = @_; my @simhash = (); foreach(@tokens) { my $hash = hash($_); foreach(0 .. 31) { my $current_bit = $hash & 0x1; if($current_bit == 0) { $simhash[$_]--; } else { $simhash[$_]++; } $hash = $hash >> 1; } } my $simhash = 0; @simhash = reverse @simhash; foreach(@simhash) { if($_ > 0) { $simhash = ($simhash << 1) + 0x1; } else { $simhash = $simhash << 1; } } $simhash; } #@test = qw (我 爱 吃 桔子); #@test2 = qw (我 喜欢 吃 苹果); @test = qw (名 侦探 诅咒); @test2 = qw (名 侦探 蛋疼); $sim1 = simhash(@test); $sim2 = simhash(@test2); printf "test=%x, test2=%x/n", $sim1, $sim2; $simi = $sim1 < $sim2? $sim1 / $sim2: $sim2 / $sim1; print $simi . "/n"; 

你可能感兴趣的:(算法,input,64bit)