文本聚类 baseline 实例

from sklearn.cluster import KMeans
import numpy as np
# hidden_dim = 2
X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)
print(kmeans.predict([[0, 0], [12, 3]]))
print(kmeans.cluster_centers_)

print()

# hidden_dim = 4
X = np.array([[1, 2, 1, 1], [1, 4, 1, 1], [1, 0, 1, 1],
              [10, 2, 10, 10], [10, 4, 10, 10], [10, 0, 10, 10]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)
print(kmeans.predict([[0, 0, 1, 1], [12, 3, 10, 10]]))
print(kmeans.cluster_centers_)

上面是把每句话表示成hidden_dim维度的向量,可以用BERT把句子表示成比如768维向量,套用上面代码

你可能感兴趣的:(自然语言处理NLP)