https://blog.csdn.net/circle2015/article/details/102771232
可以用来挖掘量级之间关系
解释到位并且有spark的scala代码
https://blog.csdn.net/zongzhiyuan/article/details/77883641
https://blog.csdn.net/baixiangxue/article/details/80335469
FpGrowth算法的平均效率远高于Apriori算法,但是它并不能保证高效率,它的效率依赖于数据集,当数据集中的频繁项集没有公共项时,所有的项集都挂在根结点上,不能实现压缩存储,而且Fptree还需要其他的开销,需要存储空间更大,使用FpGrowth算法前,对数据分析一下,看是否适合用FpGrowth算法
https://www.cnblogs.com/EE-NovRain/p/3810737.html
使用spark的包时,会发生下述报错:
User class threw exception: java.lang.NoClassDefFoundError: Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;java.lang.NoClassDefFoundError: Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;
原因是:
https://github.com/Azure/mmlspark/issues/482
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
# Generate train data
X = 0.3 * np.random.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
X_test = np.r_[X + 2, X-2]
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=0.1, high=4, size=(20, 2))
# fit the model
clf = svm.OneClassSVM(nu=0.1, kernel='rbf', gamma=0.1)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outlier = y_pred_outliers[y_pred_outliers == 1].size
print("size of train: %s, number of error: %s" % (X_train.size, n_error_train))
print("size of test: %s, number of error: %s" % (X_test.size, n_error_test))
print("size of outliers: %s, number of error: %s" % (X_outliers.size, n_error_outlier))
# plot the line , the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.title("Novelty Detection")
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
a = plt.contour(xx, yy, Z, levels=[0, Z.max()], colors='palevioletred')
s =40
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white', s=s, edgecolors='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='blueviolet', s=s, edgecolors='k')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='gold', s=s, edgecolors='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([a.collections[0], b1, b2, c],
["learned frontier", 'training observations',
"new regular observations", "new abnormal observations"],
loc="upper left",
prop=matplotlib.font_manager.FontProperties(size=11))
plt.xlabel(
"error train: %d/200; errors novel regular: %d/200; errors novel abnormal:%d/40"%(
n_error_train, n_error_test, n_error_outlier) )
plt.show()
<dependency>
<groupId>org.scalanlpgroupId>
<artifactId>nak_2.11artifactId>
<version>1.3version>
dependency>
属于动态规划问题
LeetCode 72. 编辑距离
scala文本编辑距离算法实现
怎样衡量两个字符串的相似度
编辑距离
编辑距离(edit distance)
/**
* 返回两个字符串之间的相似度
* @param s1
* @param s2
* @return
*/
def editDistOptimizedSimilarity(s1: String, s2: String): Double = {
val longest = if (s1.length > s2.length) s1.length else s2.length
val nums = editDistOptimized(s1, s2).toDouble
return 1 - nums / longest
}
/**
* 主要优化编辑距离时间复杂度从O(n2)到O(n)
* @param s1
* @param s2
* @return https://www.jianshu.com/p/10519b584b50
*/
def editDistOptimized(s1: String, s2: String): Int = {
if(s1 == null || !"".equals(s1) || s2 == null || !"".equals(s2)){
if(s1.length > s2.length){
s1.length
}else{
s2.length
}
}
val s1_length = s1.length + 1
val s2_length = s2.length + 1
val matrix = Array.ofDim[Int](2, s2_length)
for (j <- 0.until(s2_length)) {
matrix(0)(j) = j
}
// 进行dp操作,由上述的状态转移方程迁移而来。
// 增加两个变量,用来标记dp数组的操作方式。
var last_index = 0
var current_index = 1
var tmp =0
var cost = 0
for (i <- 1.until(s1_length)) {
// 初始化dp数组,表示删除word字符的操作数
matrix(current_index)(0) = i
for (j <- 1.until(s2_length)) {
//辅助判断左上角元素是否加1
if (s1.charAt(i - 1) == s2.charAt(j - 1)) {
cost = 0
} else {
cost = 1
}
// 通过状态方程进行跃迁
matrix(current_index)(j) = math.min(math.min(matrix(last_index)(j) + 1, matrix(current_index)(j - 1) + 1), matrix(last_index)(j - 1) + cost)
}
// 交换这两个变量,复用数组,降低空间复杂度
tmp = last_index
last_index = current_index
current_index = tmp
}
matrix(last_index)(s2_length - 1)
}
涉及到 ICTCLAS、Jieba、Ansj、THULAC、FNLP、Stanford CoreNLP、LTP
https://www.cnblogs.com/en-heng/p/6225117.html
val movieEmbSeq = movieEmbMap.toSeq.map(item => (item._1, Vectors.dense(item._2.map(f => f.toDouble))))
val movieEmbDF = spark.createDataFrame(movieEmbSeq).toDF("movieId", "emb")
//LSH bucket model
val bucketProjectionLSH = new BucketedRandomProjectionLSH()
.setBucketLength(0.1)
.setNumHashTables(3)
.setInputCol("emb")
.setOutputCol("bucketId")
val bucketModel = bucketProjectionLSH.fit(movieEmbDF)
val embBucketResult = bucketModel.transform(movieEmbDF)
println("movieId, emb, bucketId schema:")
embBucketResult.printSchema()
println("movieId, emb, bucketId data result:")
embBucketResult.show(10, truncate = false)
//尝试对一个示例Embedding查找最近邻
println("Approximately searching for 5 nearest neighbors of the sample embedding:")
val sampleEmb = Vectors.dense(0.795,0.583,1.120,0.850,0.174,-0.839,-0.0633,0.249,0.673,-0.237)
bucketModel.approxNearestNeighbors(movieEmbDF, sampleEmb, 5).show(truncate = false)
代码取自《深度学习推荐系统实践》
https://github.com/wzhe06/SparrowRecSys
信息熵是用来度量数据集的无序程度的,其值越大,则越无序。
熵越大,随机变量的不确定性就越大。
第一个函数式计算信息熵的,第二个函数是创建数据的。
import math
def cacShannonEnt(dataset):
numEntries = len(dataset)
labelCounts = {}
for featVec in dataset:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] +=1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob*math.log(prob, 2)
return shannonEnt
def CreateDataSet():
dataset = [[1, 1, 'yes' ],
[1, 1, 'yes' ],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing', 'flippers']
return dataset, labels
myDat,labels = CreateDataSet()
print(cacShannonEnt(myDat))
信息熵最大值的证明
相对熵 (Relative entropy),也称KL散度 (Kullback–Leibler divergence)
一文搞懂交叉熵在机器学习中的使用,透彻理解交叉熵背后的直觉
主要思路还是经过编码和解码,保持对称,网络结构自定义