Jaccard相似指数用来度量两个集合之间的相似性,它被定义为两个集合交集的元素个数除以并集的元素个数。
数学公式描述:
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
/**
* Created by bee on 17/4/12.
*/
public class JackcardSim {
public static double calJackcardSim(Set s1, Set s2) {
Set all = new HashSet<>();
all.addAll(s1);
all.addAll(s2);
System.out.println(all);
Set both = new HashSet<>();
both.addAll(s1);
both.retainAll(s2);
System.out.println(both);
return (double) both.size() / all.size();
}
public static void main(String[] args) {
Set s1 = new HashSet();
s1.add("互联网");
s1.add("金融");
s1.add("房产");
s1.add("融资");
s1.add("科技");
Set s2 = new HashSet();
s2.add("互联网");
s2.add("开源");
s2.add("人工智能");
s2.add("软件");
s2.add("科技");
System.out.println(calJackcardSim(s1, s2));
}
}
运行结果
[科技, 房产, 软件, 融资, 人工智能, 互联网, 开源, 金融]
[科技, 互联网]
0.25
向量空间模型是一个把文本文件表示为标识符(比如索引)向量的代数模型。它应用于信息过滤、信息检索、索引以及相关排序。
文档和查询都用向量来表示。
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
/**
* Created by bee on 17/4/10.
*/
public class Vsm {
public static double calCosSim(Map v1, Map
v2) {
double sclar = 0.0,norm1=0.0,norm2=0.0,similarity=0.0;
Set v1Keys = v1.keySet();
Set v2Keys = v2.keySet();
Set both= new HashSet<>();
both.addAll(v1Keys);
both.retainAll(v2Keys);
System.out.println(both);
for (String str1 : both) {
sclar += v1.get(str1) * v2.get(str1);
}
for (String str1:v1.keySet()){
norm1+=Math.pow(v1.get(str1),2);
}
for (String str2:v2.keySet()){
norm2+=Math.pow(v2.get(str2),2);
}
similarity=sclar/Math.sqrt(norm1*norm2);
System.out.println("sclar:"+sclar);
System.out.println("norm1:"+norm1);
System.out.println("norm2:"+norm2);
System.out.println("similarity:"+similarity);
return similarity;
}
public static void main(String[] args) {
Map m1 = new HashMap<>();
m1.put("Hello", 1.0);
m1.put("css", 2.0);
m1.put("Lucene", 3.0);
Map m2 = new HashMap<>();
m2.put("Hello", 1.0);
m2.put("Word", 2.0);
m2.put("Hadoop", 3.0);
m2.put("java", 4.0);
m2.put("html", 1.0);
m2.put("css", 2.0);
calCosSim(m1, m2);
}
}
运行结果:
[css, Hello]
sclar:5.0
norm1:14.0
norm2:35.0
similarity:0.22587697572631282
https://zh.wikipedia.org/wiki/%E5%90%91%E9%87%8F%E7%A9%BA%E9%96%93%E6%A8%A1%E5%9E%8B
http://baike.baidu.com/link?url=enqtEW1bEXe0iZvil1MBk8m2upnfmN118p4cgjNpYdoJYe2l-FC5_s_yYQAq_3GUtiQW0jgwfMMBBxM0U16JiRKeFToPQ0fj058H7P8mHlZ5RV7rERN9Je7jdrYdA3gI7SRMUNTD