字符串相似度计算工具和算法

一、fuzzywuzzy

介绍:JavaWuzzy是Java版的FuzzyWuzzy,用于计算字符串之间的匹配度。
FuzzySearch.ratio(String s1, String s2)
全匹配,对顺序敏感
FuzzySearch.partialRatio(String s1, String s2)
搜索匹配(部分匹配),对顺序敏感
FuzzySearch.tokenSortRatio(String s1, String s2)
首先做排序,然后全匹配,对顺序不敏感(也就是更换单词位置之后,相似度依然会很高)
FuzzySearch.tokenSortPartialRatio(String s1, String s2)
首先做排序,然后搜索匹配(部分匹配),对顺序不敏感
FuzzySearch.tokenSetRatio(String s1, String s2)
首先取集合(去掉重复词),然后全匹配,对顺序不敏感,第二个字符串包含第一个字符串就100
FuzzySearch.tokenSetPartialRatio(String s1, String s2)
首先取集合,然后搜索匹配(部分匹配),对顺序不敏感
FuzzySearch.weightedRatio(String s1, String s2)
对顺序敏感,算法不同

开源地址:https://github.com/xdrop/fuzzywuzzy

案例:

        System.out.println("1 "+FuzzySearch.ratio("admin", "admin"));
        System.out.println("2 "+FuzzySearch.partialRatio("ADMIN", "admin"));
        System.out.println("3 "+FuzzySearch.tokenSetPartialRatio("test", "test1"));
        System.out.println("4 "+FuzzySearch.weightedRatio("你是", "你是我"));
        System.out.println("5 "+FuzzySearch.tokenSortRatio("你是", "你是W"));
        System.out.println("6 "+FuzzySearch.tokenSetRatio("你是", "你是o"));
        System.out.println(DiffUtils.getRatio("你是", "你是我"));
        System.out.println(DiffUtils.levEditDistance("你是", "你是我",1));
        System.out.println(DiffUtils.getMatchingBlocks("你是", "你是我"));
        System.out.println(DiffUtils.getEditOps("你是", "你是我"));

maven:

        <dependency>
            <groupId>me.xdrop</groupId>
            <artifactId>fuzzywuzzy</artifactId>
            <version>1.3.1</version>
        </dependency>

二、commons-text

介绍:Commons Text 是一组用于在 Java 环境中使用的处理文本的实用、可重用组件。

开源地址:http://commons.apache.org/proper/commons-text/

案例:

        FuzzyScore fuzzyScore = new FuzzyScore(Locale.ENGLISH);
        System.out.println("1 "+fuzzyScore.fuzzyScore("admin", "admin"));
        FuzzyScore fuzzyScores = new FuzzyScore(Locale.CHINESE);
        System.out.println("2 "+fuzzyScores.fuzzyScore("你是", "你是"));

maven:

        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-text</artifactId>
            <version>1.4</version>
        </dependency>

三、java-string-similarity

介绍:一个实现不同字符串相似度和距离度量的库。当前实现了十二种算法(包括Levenshtein编辑距离和同级,Jaro-Winkler,最长公共子序列,余弦相似性等)。

归一化,度量,相似度和距离
基于(n-gram)的相似度和距离
莱文施泰因
标准化莱文施泰因
加权Levenshtein
Damerau-Levenshtein
最佳字符串对齐
杰罗·温克勒
最长公共子序列
公制最长公共子序列
N-格拉姆
基于碎片(n-gram)的算法
Q-Gram
余弦相似度
雅卡指数
Sorensen-Dice系数

开源地址:https://github.com/tdebatty/java-string-similarity

案例:

        Levenshtein levenshtein = new Levenshtein();
        System.out.println(levenshtein.distance("My string", "My $tring"));
        System.out.println(levenshtein.distance("My string", "My $tring"));
        System.out.println(levenshtein.distance("My string", "My $tring"));
        NormalizedLevenshtein normalizedLevenshtein = new NormalizedLevenshtein();
        System.out.println(normalizedLevenshtein.distance("My string", "My $tring"));
        System.out.println(normalizedLevenshtein.distance("My string", "My $tring"));
        System.out.println(normalizedLevenshtein.distance("My string", "My $tring"));
        Damerau damerau = new Damerau();
        // 1 substitution
        System.out.println(damerau.distance("ABCDEF", "ABDCEF"));
        // 2 substitutions
        System.out.println(damerau.distance("ABCDEF", "BACDFE"));
        // 1 deletion
        System.out.println(damerau.distance("ABCDEF", "ABCDE"));
        System.out.println(damerau.distance("ABCDEF", "BCDEF"));
        System.out.println(damerau.distance("ABCDEF", "ABCGDEF"));
        // All different
        System.out.println(damerau.distance("ABCDEF", "POIU"));
        OptimalStringAlignment optimalStringAlignment = new OptimalStringAlignment();
        System.out.println(optimalStringAlignment.distance("CA", "ABC"));
        JaroWinkler jaroWinkler = new JaroWinkler();
        // substitution of s and t
        System.out.println(jaroWinkler.similarity("My string", "My tsring"));
        // substitution of s and n
        System.out.println(jaroWinkler.similarity("My string", "My ntrisg"));
        LongestCommonSubsequence longestCommonSubsequence = new LongestCommonSubsequence();
        // Will produce 4.0
        System.out.println(longestCommonSubsequence.distance("AGCAT", "GAC"));
        // Will produce 1.0
        System.out.println(longestCommonSubsequence.distance("AGCAT", "AGCT"));
        RatcliffObershelp ratcliffObershelp = new RatcliffObershelp();
        // substitution of s and t
        System.out.println(ratcliffObershelp.similarity("My string", "My tsring"));
        // substitution of s and n
        System.out.println(ratcliffObershelp.similarity("My string", "My ntrisg"));

maven:

        <dependency>
            <groupId>info.debatty</groupId>
            <artifactId>java-string-similarity</artifactId>
            <version>2.0.0</version>
        </dependency>

四、java-diff-utils

介绍:Diff Utils库是一个开放源代码库,用于执行文本之间的比较操作:计算差异,应用补丁,生成统一的差异或对其进行解析,生成差异输出以方便将来显示(如并排视图)等等。
构建该库的主要原因是缺乏使用差异文件时需要的所有常用内容的易于使用的库。最初它受JRCS库的启发,并且是diff模块的不错的设计。

开源地址:https://github.com/java-diff-utils/java-diff-utils

案例:

        System.out.println(DiffUtils.diffInline("admin","admin"));
        System.out.println(DiffUtils.diff(Arrays.asList("admin"),Arrays.asList("admin"),true);
        System.out.println(DiffUtils.diff(Arrays.asList("admin"),Arrays.asList("admin")));

maven:

        <dependency>
            <groupId>io.github.java-diff-utils</groupId>
            <artifactId>java-diff-utils</artifactId>
            <version>4.7</version>
        </dependency>

你可能感兴趣的:(JAVA开发,算法,字符串相似度计算工具和算法,字符串相似度,字符串相似度计算工具,字符串相似度算法,算法)