part 1: resemblance with the jaccard coefficient

resemblance with the jaccard coefficient

<< back to other nerdy projects

part 1: resemblance with the jaccard coefficient

part 2: fastmap projection using jaccard distances

part 3: the simhash algorithm

part 4: a sketching algorithm

huh?

i started working on another rss feed classification technique using a data duplication algorithm to classify articles.

the idea is that an article can be classified by determining which class it is most likely a duplicate of.

however half way through i realised this technique could work against a problem we were seeing at work and changed to start work on that data instead

it's a bit sad i know but data is data and it's still an interesting problem.

i'll use nothing but publicly available data for this, and if it looks promising i might get a chance to work on it further during business hours!

all discussed ruby/c++ code is available from http://github.com/matpalm/resemblance

so what is the actual problem?

given two very similiar business names, address pairs can we decide if they are actually the same company?

let's consider some examples...

eg1

Burra Hotel, 5 Market Sq, Burra, SA, 5417 Camping Country Superstore, 401 Pacific Hwy, Belmont North, NSW, 2280
it's pretty obvious these are not the same company. next!

eg2

One Stop Bakery, 1304 High St Rd, Wantirna, VIC, 3152 One Stop Bakery, 1304 High Street Rd, Wantirna South, VIC, 3152
i think these are the same, it's just one is using an abbrev for street.

eg3

Park Beach Interiors, Showroom Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450 Park Beach Interiors, Showroom Park Beach Plaza Pacific Highway, Coffs Harbour, NSW, 2450 Park Beach Interiors, Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450 Park Beach Interiors, 26 Park Beach Plaza, Pacific Hwy, Coffs Harbour, NSW, 2450
i think these are all the same.

eg 4

Weaver Interiors, 955 Pacific Hwy, Pymble, NSW, 2073 Weaver Interiors, 997 Pacific Hwy, Pymble, NSW, 2073
this pair is interesting.... they might be the same, but maybe not...

eg 5

Gibbon Hamor Commercial Interiors, 233 Johnston St, Annandale, NSW, 2038 Gibbon Hamor Development Planners, 233 Johnston St, Annandale, NSW, 2038
this pair is also interesting for the same reasons.

shingling

shingling is a way of generating a set that represents a bit of data which can be used for comparisons

eg. the 4 bigram shingles of "the cat sat on the cat" are...

the cat

cat sat

sat on

on the

(note: this is a set so we only count the shingle "the cat" once)

the jaccard index

the jaccard index is a simple measure of how similiar two sets are.

it's simply the ratio of the size of the intersection of the sets and the size of the union of the sets.

eg. if J(A,B) is jaccard index between sets A and B

and A = {1,2,3}, B = {2,3,4}, C = {4,5,6},

then J(A,B) = 2/4 = 0.5,

and J(A,C) = 0/6 = 0,

and J(B,C) = 1/5 = 0.2

so the most "similiar" sets are A and B and the least similiar are A and C

(note also J(A,A) = J(B,B) = J(C,C) = 1)

putting it all together

so given two business name/addresses we can build a shingling set for each and use the jaccard index to decide how similiar they are.

we'll use bigrams for building our sets but lets use character bigrams, not word bigrams.

this is since the documents are quite small and we want to include puncutation in the comparisons...

lets run through our above examples again...

eg 1

Burra Hotel, 5 Market Sq, Burra, SA, 5417

is represented by the set of 2 character-gram shingles

{" 5", " B", " H", " M", " S", ", ", "17", "41", "5 ", "54", "A,", "Bu", "Ho", "Ma", "SA", "Sq", "a ", "a,", "ar", "el", "et", "ke", "l,", "ot", "q,", "ra", "rk", "rr", "t ", "te", "ur"}

Camping Country Superstore, 401 Pacific Hwy, Belmont North, NSW, 2280

is represented by the set of 2 character-gram shingles

{" 2", " 4", " B", " C", " H", " N", " P", " S", ", ", "01", "1 ", "22", "28", "40", "80", "Be", "Ca", "Co", "Hw", "NS", "No", "Pa", "SW", "Su", "W,", "ac", "am", "c ", "ci", "e,", "el", "er", "fi", "g ", "h,", "ic", "if", "in", "lm", "mo", "mp", "ng", "nt", "on", "or", "ou", "pe", "pi", "re", "rs", "rt", "ry", "st", "t ", "th", "to", "tr", "un", "up", "wy", "y ", "y,"}

they have an intersection size of 6 shingles and a union size of 87 shingles, hence a jaccard index of 6/87 = 0.068

eg 2

One Stop Bakery, 1304 High St Rd, Wantirna, VIC, 3152 and

One Stop Bakery, 1304 High Street Rd, Wantirna South, VIC, 3152

have an intersection size of 46 shingles and a union size of 57 shingles, hence a jaccard index of 46/57 = 0.807

eg 3

a) Park Beach Interiors, Showroom Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450

b) Park Beach Interiors, Showroom Park Beach Plaza Pacific Highway, Coffs Harbour, NSW, 2450

c) Park Beach Interiors, Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450

d) Park Beach Interiors, 26 Park Beach Plaza, Pacific Hwy, Coffs Harbour, NSW, 2450

have indexes J(ab)=0.888, J(ac)=0.861, J(ad)=0.808, J(bc)=0.760, J(bd)=0.716, J(cd)=0.932

eg 4

Weaver Interiors, 955 Pacific Hwy, Pymble, NSW, 2073 and

Weaver Interiors, 997 Pacific Hwy, Pymble, NSW, 2073

have an intersection size of 43 shingles and a union size of 49 shingles, hence a jaccard index of 43/49 = 0.877

eg 5

Gibbon Hamor Commercial Interiors, 233 Johnston St, Annandale, NSW, 2038 and

Gibbon Hamor Development Planners, 233 Johnston St, Annandale, NSW, 2038

have an intersection size of 49 shingles and a union size of 76 shingles, hence a jaccard index of 49/76 = 0.644

conclusion

though there is no obvious magic cutoff point it seems to give pretty good values.

it would find some obvious duplicates, though would require a bit of human double checking to make sure.

here's a histogram of the frequency of resemblance values from the comparison of all pairs of 2000 name addresses

(a total of 1,999,000 comparisons and notice the y scale is logarithmic)

algorithmic discussion

order n squared sucks

the jaccard coefficient is, unfortunately, not transistive

(ie if we know J(A,B) and J(B,C) it tells use nothing about J(A,C)

naively then to determine the pair with the highest similarity requires we compare every element with
every other element.

this is O(n²) and O(n²) sucks since we are looking at (n(n-1))/2 comparisons, joy!

lets examine some of the ruby runtimes

num records comparisons time

50 1,225 0.2s

100 4,950 0.9s

250 31,125 5.6s

500 124,750 24s

750 280,875 52s

2000 1,999,000 6m 57s

and just say i ran this over a subset of the full data, say, 1,000,000 records

it would be 499,999,500,000 comparisons

and at about 300,000 per minute we'll be here till christmas (2011)

( luckily the actual data allows me to do something which reduces the runtime to be O(n) but i'm not going to talk about it out of work)

bit level optimisation in c++

i decided to reimplement this in c++ and go the whole hog by using a bit level representation of the data to wring everything out of the machine.

the big question is: how to optimise the jaccard index calculation? it's where the time is spent.

consider the shingle sets for "cat" and "mat", ie {"ca","at"} and {"ma","at"}

we can convert shingles to ints by taking all the unique ones and mapping them to ints from a sequence starting at 0

ie { "ca" => 0, "at" => 1, "ma" => 2}

giving us the two equivalent shingle sets {0,1} and {2,1}

finally we can use the values in these sets to set bits in a nibble

giving us the two nibbles 0011 (setting bits 0 and 1) and 0110 (setting bits 2 and 1)

now consider the bit representations and the results of the bitwise operators | and &

  0011 (equivalent to {"ca","at"})

  0110 (equivalent to {"ma","at"})

& 0010 => and'ing the bits strings gives us their intersection!

| 0111 => or'ing the bits strings gives us their union!

the number of bits set in x0010 (size of intersection) is 1 and

the number of bits set in x0111 (size of union) is 3

so the jaccard index of "cat" and "mat" is 1/3

note: we can count the number of bits set with a crazy bit of c like

inline int count_number_bits_set(long l) { unsigned int c;   for(c=0;l;c++)    l &= l-1; return c; }

(thanks to brian kernighan for that one)

using this method we can calculate the union or intersection of a 4 byte long (ie 32 set elements) in a single | or &!

bamm!

finally we can use the awesome openmp library ( available as part of gcc since 4.2 )

with two additional lines of code (both pragma statements) we can give hints to the compiler where the code can be multithreaded

num records comparisons ruby time c++ time c++ openmp time

50 1,225 0.29s 0.008s 0.013s

100 4,950 0.97s 0.01s 0.013s

250 31,125 5.5s 0.04s 0.04s

500 124,750 22s 0.12s 0.09s

1000 499,500 1m 30s 0.37s 0.2s

2000 1,999,000 6m 34s 1.2s 0.5s

4000 7,998,000 ? 7.4s 1.8s

8000 31,996,000 ? 21s 6.2s

16000 127,992,000 ? ? 26s

so the ruby code is getting about 5,000 a second

the single threaded c++ implementation is getting about 1,500,000 a second

and the c++ implementation using openmp on a quad core box (utilising about 350% cpu) is getting about 5,000,000 a second

this is a speed up of about 1,000 times

booya! that's more like it!

now lets consider the jaccard distance after which we'll consider the simhash algorithm as a way of avoiding all that O(n²) nastiness.

你可能感兴趣的:(with)

JAVA中的Enum 周凡杨 java enum 枚举
Enum是计算机编程语言中的一种数据类型---枚举类型。在实际问题中，有些变量的取值被限定在一个有限的范围内。例如，一个星期内只有七天我们通常这样实现上面的定义： public String monday; public String tuesday; public String wensday; public String thursday
赶集网mysql开发36条军规 Bill_chen mysql 业务架构设计 mysql调优 mysql性能优化
(一)核心军规 (1)不在数据库做运算 cpu计算务必移至业务层； (2)控制单表数据量 int型不超过1000w，含char则不超过500w；合理分表；限制单库表数量在300以内； (3)控制列数量字段少而精，字段数建议在20以内
Shell test命令 daizj shell 字符串 test 数字文件比较
Shell test命令 Shell中的 test 命令用于检查某个条件是否成立，它可以进行数值、字符和文件三个方面的测试。数值测试参数说明 -eq 等于则为真 -ne 不等于则为真 -gt 大于则为真 -ge 大于等于则为真 -lt 小于则为真 -le 小于等于则为真实例演示： num1=100 num2=100if test $[num1]
XFire框架实现WebService(二) 周凡杨 java webservice
有了XFire框架实现WebService(一)，就可以继续开发WebService的简单应用。 Webservice的服务端(WEB工程)：两个java bean类： Course.java package cn.com.bean; public class Course { private
重绘之画图板朱辉辉33 画图板
上次博客讲的五子棋重绘比较简单，因为只要在重写系统重绘方法paint（）时加入棋盘和棋子的绘制。这次我想说说画图板的重绘。画图板重绘难在需要重绘的类型很多，比如说里面有矩形，园，直线之类的，所以我们要想办法将里面的图形加入一个队列中，这样在重绘时就
Java的IO流西蜀石兰 java
刚学Java的IO流时，被各种inputStream流弄的很迷糊，看老罗视频时说想象成插在文件上的一根管道，当初听时觉得自己很明白，可到自己用时，有不知道怎么代码了。。。每当遇到这种问题时，我习惯性的从头开始理逻辑，会问自己一些很简单的问题，把这些简单的问题想明白了，再看代码时才不会迷糊。 IO流作用是什么？答：实现对文件的读写，这里的文件是广义的； Java如何实现程序到文件
No matching PlatformTransactionManager bean found for qualifier 'add' - neither 林鹤霄
java.lang.IllegalStateException: No matching PlatformTransactionManager bean found for qualifier 'add' - neither qualifier match nor bean name match! 网上找了好多的资料没能解决，后来发现：项目中使用的是xml配置的方式配置事务，但是
Row size too large (> 8126). Changing some columns to TEXT or BLOB aigo column
原文：http://stackoverflow.com/questions/15585602/change-limit-for-mysql-row-size-too-large 异常信息： Row size too large (> 8126). Changing some columns to TEXT or BLOB or using ROW_FORMAT=DYNAM
JS 格式化时间 alxw4616 JavaScript
/** * 格式化时间 2013/6/13 by 半仙 [email protected] * 需要 pad 函数 * 接收可用的时间值. * 返回替换时间占位符后的字符串 * * 时间占位符:年 Y 月 M 日 D 小时 h 分 m 秒 s 重复次数表示占位数 * 如 YYYY 4占4位 YY 占2位<p></p> * MM DD hh mm
队列中数据的移除问题百合不是茶队列移除
队列的移除一般都是使用的remov();都可以移除的,但是在昨天做线程移除的时候出现了点问题,没有将遍历出来的全部移除, 代码如下; // package com.Thread0715.com; import java.util.ArrayList; public class Threa
Runnable接口使用实例 bijian1013 java thread Runnable java多线程
Runnable接口 a. 该接口只有一个方法：public void run(); b. 实现该接口的类必须覆盖该run方法 c. 实现了Runnable接口的类并不具有任何天
oracle里的extend详解 bijian1013 oracle 数据库 extend
扩展已知的数组空间，例： DECLARE TYPE CourseList IS TABLE OF VARCHAR2(10); courses CourseList; BEGIN -- 初始化数组元素，大小为3 courses := CourseList('Biol 4412 ', 'Psyc 3112 ', 'Anth 3001 '); --
【httpclient】httpclient发送表单POST请求 bit1129 httpclient
浏览器Form Post请求浏览器可以通过提交表单的方式向服务器发起POST请求，这种形式的POST请求不同于一般的POST请求 1. 一般的POST请求，将请求数据放置于请求体中，服务器端以二进制流的方式读取数据，HttpServletRequest.getInputStream()。这种方式的请求可以处理任意数据形式的POST请求，比如请求数据是字符串或者是二进制数据 2. Form
【Hive十三】Hive读写Avro格式的数据 bit1129 hive
1. 原始数据 hive> select * from word; OK 1 MSN 10 QQ 100 Gtalk 1000 Skype 2. 创建avro格式的数据表 hive> CREATE TABLE avro_table(age INT, name STRING)STORE
nginx+lua+redis自动识别封解禁频繁访问IP ronin47
在站点遇到攻击且无明显攻击特征，造成站点访问慢，nginx不断返回502等错误时，可利用nginx+lua+redis实现在指定的时间段内，若单IP的请求量达到指定的数量后对该IP进行封禁，nginx返回403禁止访问。利用redis的expire命令设置封禁IP的过期时间达到在指定的封禁时间后实行自动解封的目的。一、安装环境： CentOS x64 release 6.4(Fin
java-二叉树的遍历-先序、中序、后序（递归和非递归）、层次遍历 bylijinnan java
import java.util.LinkedList; import java.util.List; import java.util.Stack; public class BinTreeTraverse { //private int[] array={ 1, 2, 3, 4, 5, 6, 7, 8, 9 }; private int[] array={ 10,6,
Spring源码学习-XML 配置方式的IoC容器启动过程分析 bylijinnan java spring IOC
以FileSystemXmlApplicationContext为例，把Spring IoC容器的初始化流程走一遍： ApplicationContext context = new FileSystemXmlApplicationContext ("C:/Users/ZARA/workspace/HelloSpring/src/Beans.xml&q
[科研与项目]民营企业请慎重参与军事科技工程 comsci 企业
军事科研工程和项目并非要用最先进，最时髦的技术，而是要做到“万无一失” 而民营科技企业在搞科技创新工程的时候，往往考虑的是技术的先进性，而对先进技术带来的风险考虑得不够，在今天提倡军民融合发展的大环境下，这种“万无一失”和“时髦性”的矛盾会日益凸显。。。。。。所以请大家在参与任何重大的军事和政府项目之前，对
spring 定时器-两种方式 cuityang spring quartz 定时器
方式一：间隔一定时间运行 <bean id="updateSessionIdTask" class="com.yang.iprms.common.UpdateSessionTask" autowire="byName" /> <bean id="updateSessionIdSchedule
简述一下关于BroadView站点的相关设计 damoqiongqiu view
终于弄上线了，累趴，戳这里http://www.broadview.com.cn 简述一下相关的技术点前端：jQuery+BootStrap3.2+HandleBars，全站Ajax（貌似对SEO的影响很大啊！怎么破？），用Grunt对全部JS做了压缩处理，对部分JS和CSS做了合并（模块间存在很多依赖，全部合并比较繁琐，待完善）。后端：U
运维 PHP问题汇总 dcj3sjt126com windows2003
1、Dede(织梦)发表文章时,内容自动添加关键字显示空白页解决方法：后台>系统>系统基本参数>核心设置>关键字替换（是/否），这里选择“是”。后台>系统>系统基本参数>其他选项>自动提取关键字，这里选择“是”。 2、解决PHP168超级管理员上传图片提示你的空间不足网站是用PHP168做的，反映使用管理员在后台无法
mac 下安装php扩展 - mcrypt dcj3sjt126com PHP
MCrypt是一个功能强大的加密算法扩展库，它包括有22种算法，phpMyAdmin依赖这个PHP扩展，具体如下：下载并解压libmcrypt-2.5.8.tar.gz。在终端执行如下命令： tar zxvf libmcrypt-2.5.8.tar.gz cd libmcrypt-2.5.8/ ./configure --disable-posix-threads --
MongoDB更新文档 [四] eksliang mongodb Mongodb更新文档
MongoDB更新文档转载请出自出处：http://eksliang.iteye.com/blog/2174104 MongoDB对文档的CURD，前面的博客简单介绍了，但是对文档更新篇幅比较大，所以这里单独拿出来。语法结构如下： db.collection.update( criteria, objNew, upsert, multi) 参数含义参数
Linux下的解压，移除，复制，查看tomcat命令 y806839048 tomcat
重复myeclipse生成webservice有问题删除以前的，干净 1、先切换到：cd usr/local/tomcat5/logs 2、tail -f catalina.out 3、这样运行时就可以实时查看运行日志了 Ctrl+c 是退出tail命令。有问题不明的先注掉 cp /opt/tomcat-6.0.44/webapps/g
Spring之使用事务缘由(3-XML实现) ihuning spring
用事务通知声明式地管理事务事务管理是一种横切关注点。为了在 Spring 2.x 中启用声明式事务管理，可以通过 tx Schema 中定义的 <tx:advice> 元素声明事务通知，为此必须事先将这个 Schema 定义添加到 <beans> 根元素中去。声明了事务通知后，就需要将它与切入点关联起来。由于事务通知是在 <aop:
GCD使用经验与技巧浅谈啸笑天 GC
前言 GCD(Grand Central Dispatch)可以说是Mac、iOS开发中的一大“利器”，本文就总结一些有关使用GCD的经验与技巧。 dispatch_once_t必须是全局或static变量这一条算是“老生常谈”了，但我认为还是有必要强调一次，毕竟非全局或非static的dispatch_once_t变量在使用时会导致非常不好排查的bug，正确的如下： 1
linux（Ubuntu）下常用命令备忘录1 macroli linux 工作 ubuntu
在使用下面的命令是可以通过--help来获取更多的信息1,查询当前目录文件列表：ls ls命令默认状态下将按首字母升序列出你当前文件夹下面的所有内容，但这样直接运行所得到的信息也是比较少的，通常它可以结合以下这些参数运行以查询更多的信息： ls / 显示/.下的所有文件和目录 ls -l 给出文件或者文件夹的详细信息 ls -a 显示所有文件，包括隐藏文
nodejs同步操作mysql qiaolevip 学习永无止境每天进步一点点 mysql nodejs
// db-util.js var mysql = require('mysql'); var pool = mysql.createPool({ connectionLimit : 10, host: 'localhost', user: 'root', password: '', database: 'test', port: 3306 });
一起学Hive系列文章 superlxw1234 hive Hive入门
[一起学Hive]系列文章目录贴，入门Hive，持续更新中。 [一起学Hive]之一—Hive概述，Hive是什么 [一起学Hive]之二—Hive函数大全-完整版 [一起学Hive]之三—Hive中的数据库(Database)和表(Table) [一起学Hive]之四-Hive的安装配置 [一起学Hive]之五-Hive的视图和分区 [一起学Hive
Spring开发利器：Spring Tool Suite 3.7.0 发布 wiselyman spring
Spring Tool Suite(简称STS)是基于Eclipse，专门针对Spring开发者提供大量的便捷功能的优秀开发工具。在3.7.0版本主要做了如下的更新：将eclipse版本更新至Eclipse Mars 4.5 GA Spring Boot(JavaEE开发的颠覆者集大成者，推荐大家学习)的配置语言YAML编辑器的支持(包含自动提示，

num records	comparisons	time
50	1,225	0.2s
100	4,950	0.9s
250	31,125	5.6s
500	124,750	24s
750	280,875	52s
2000	1,999,000	6m 57s

num records	comparisons	ruby time	c++ time	c++ openmp time
50	1,225	0.29s	0.008s	0.013s
100	4,950	0.97s	0.01s	0.013s
250	31,125	5.5s	0.04s	0.04s
500	124,750	22s	0.12s	0.09s
1000	499,500	1m 30s	0.37s	0.2s
2000	1,999,000	6m 34s	1.2s	0.5s
4000	7,998,000	?	7.4s	1.8s
8000	31,996,000	?	21s	6.2s
16000	127,992,000	?	?	26s