sentence_transformers加载使用预训练bert模型;向量KMeans聚类

sentence_transformers使用

参考:https://www.sbert.net/docs/quickstart.html

sentence_transformers 基于transformers库

from sentence_transformers import SentenceTransformer,util

## cache_folder指定保存路径
model1 = SentenceTransformer('peterchou/simbert-chinese-base',cache_folder =r"D:\simcse")


embeddings1 = model1.encode("语言模型", show_progress_bar=True)
embeddings2 = model1.encode("自然语言", show_progress_bar=True)
print(embeddings1)

cos_sim = util.cos_sim(embeddings1, embeddings2)
print("Cosine-Similarity:", cos_sim)

sentence_transformers加载使用预训练bert模型;向量KMeans聚类_第1张图片

sentence_transformers加载使用预训练bert模型;向量KMeans聚类_第2张图片

错误:ValueError: unable to parse D:\simcse\peterchou_simbert-chinese-base\tokenizer_config.json as a URL or as a local path

sentence_transformers加载使用预训练bert模型;向量KMeans聚类_第3张图片

transformers在加载的时候基本上使用的是from_pretrained,那么如果我们想直接加载比如'roberta-base', 那它就会先下载,但是有时候我们会去先下载好一些预训练模型比如:chinese_roberta_wwm_ext_pytorch,那我们直接给from_pretrained传本地路径也是可以的,但是这个时候就会遇到第一个坑:transformers一般使用from_pretrained加载的config形式是config.json,而chinese_roberta_wwm_ext_pytorch里面的命名是bert_config.json(pytorch_pretrained_bert中加载的),所以我们要改一下名字即将bert_config.json改为config.json,当然了如果是在别的预训练模型上面热启,而且下载的模型里面正好是config.json就不用改了

在线下载的模型文件名对不起,可以把tokenizer_config.json 改成config.json即可
sentence_transformers加载使用预训练bert模型;向量KMeans聚类_第4张图片

向量KMeans聚类

from sklearn.cluster import KMeans
bert计算出来的向量集合并归一化
uv = np.linalg.norm(wv['vectors'], axis=1).reshape(-1, 1)  # Unit Vector
wv['vectors'] = wv['vectors'] / uv 

# 聚类
Kmeans = KMeans(n_clusters=30).fit(wv['vectors'])
labels = Kmeans.labels_  #得到聚类后每一组数据对应的类型标签

kmeans使用案例

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn import metrics
import matplotlib.pyplot as plt

x,y = make_blobs(n_samples=1000,n_features=4,centers=[[-1,-1],[0,0],[1,1],[2,2]],cluster_std=[0.4,0.2,0.2,0.4],random_state=10)

k_means = KMeans(n_clusters=3, random_state=10)

k_means.fit(x)

y_predict = k_means.predict(x)
plt.scatter(x[:,0],x[:,1],c=y_predict)
plt.show()
print(k_means.predict((x[:30,:])))
# print(metrics.calinski_harabaz_score(x,y_predict))
print(k_means.cluster_centers_)
print(k_means.inertia_)
print(metrics.silhouette_score(x,y_predict))
print(x[:30])
print(k_means.labels_)  #得到聚类后每一组数据对应的类型标签

你可能感兴趣的:(深度学习,bert,深度学习,自然语言处理)