install.packages('stringdist')
or
git clone https://github.com/markvanderloo/stringdist.git
cd stringdist
bash ./build.bash
R CMD INSTALL output/stringdist_*.tar.gz
The package offers the following main functions:
stringdist(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)
a :R object (target); will be converted by as.characte
b :R object (source); will be converted by as.character This argument is optional for stringdistmatrix (see section Value).
method :Method for distance calculation.
useBytes :Perform byte-wise comparison
weight :For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order.
When method='lv', the penalty for transposition is ignored.
When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order.
Weights must be positive and not exceed 1.
weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or soundex.
q :Size of the q-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.
p :Prefix factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25.
If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.
bt :Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0.
useNames :Use input vectors as row and column names?
注意:String distance functions have two possible special output values.
NA is returned whenever at least one of the input strings to compare is NA .
And Inf is returned when the distance between two strings is undefined according to the selected algorithm.
stringdist("bar","foo",method = "lv") #使用的是Levenshtein distance & return 3
stringdist("ba","foo",method = "lv") #使用的是Levenshtein distance & return 3 ,注意这里是不等长的序列
stringdist('fu', 'foo', method='hamming') # 使用的是 Hamming distance & return Inf
stringdistmatrix(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
useNames = c("none", "strings", "names"),
nthread = getOption("sd_num_thread")
)
Arg
- 只输入一个vertor:返回一个 dist函数的结果
- 输入两个vector :返回矩阵
amatch仿照R base function match进行设计,通过 参数maxDist控制该函数的行为,如果maxDist 设置的很小其表现近似于 exact match,当 maxDist 设置的比较大时则表现的是approximately match。amtch 与 ain的区别类似于match和 %in%,一个返回元素的index,一个返回TRUE/FALSE。
amatch('fu', c('foo','bar')) # return NA
amatch('fu', c('foo','bar'), maxDist=2) # return 1
ain('fu', c('foo','bar')) # return FALSE
ain('fu', c('foo','bar'), maxDist=2) # return TRUE
ain('bar', c('foo','bar')) # return TRUE
ain('bar', c('foo','bar'), maxDist=2) # return TRUE
注意,Dosa 和Ddl的区别主要是最后一个方程式,Dosa只允许前后相邻的两个字符串置换,Ddl则允许当前的字符串和其他的字符置换后计算距离