一工具:python4.3,Scikit learn库,nltk(自然语言处理工具包),参考资料:《机器学习系统设计》
二步骤:
1、 将原始文本转化为词袋:统计词语个数,并把词频转化为向量。
from sklearn.feature_extraction.text import CountVectorizer
注意点:1)打开文件:os.path.join()连接两个文件名地址的时候,就比os.path.join("D:\","test.txt")结果是D:\test.txt
from utils import DATA_DIR TOY_DIR = os.path.join(DATA_DIR, "toy") posts = [open(os.path.join(TOY_DIR, f)).read() for f in os.listdir(TOY_DIR)] new_post = "imaging databases"
2、词频向量的归一化
http://blog.sina.com.cn/s/blog_70586e000100moen.html--scipy的矩阵计算函数
def dist_norm(v1, v2): v1_normalized = v1 / sp.linalg.norm(v1.toarray()) v2_normalized = v2 / sp.linalg.norm(v2.toarray()) delta = v1_normalized - v2_normalized return sp.linalg.norm(delta.toarray())
vectorizer = CountVectorizer(min_df=1, stop_words='english',)
stop_words='english':使用一个包含318单词的英文停用词表
4、词干处理
对于英语,使用nltk包的SnowballStemmer,作用是:将词义相同但形式不同的词放到一起统计,比如:images和imaging
import nltk.stem english_stemmer = nltk.stem.SnowballStemmer('english')
按如下步骤对每个帖子进行处理:
1)在预处理阶段将原始帖子变成小写字母形式(在父类中完成)
2)在词语切分阶段提取所有单词(在父类中完成)
3)将每个词语转换成词干形式
class StemmedCountVectorizer(CountVectorizer): def build_analyzer(self): analyzer = super(StemmedCountVectorizer, self).build_analyzer() return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc)) # vectorizer = CountVectorizer(min_df=1, stop_words='english', # preprocessor=stemmer) vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')5、停用词兴奋剂
词频-反转文档频率(TF-IDF):tf 表示统计部分,IDF考虑权重折扣
from sklearn.feature_extraction.text import TfidfVectorizer,
from sklearn.feature_extraction.text import TfidfVectorizer class StemmedTfidfVectorizer(TfidfVectorizer): def build_analyzer(self): analyzer = super(StemmedTfidfVectorizer, self).build_analyzer() return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc)) vectorizer = StemmedTfidfVectorizer( min_df=1, stop_words='english', decode_error='ignore')
6、计算帖子相似度
X_train = vectorizer.fit_transform(posts) num_samples, num_features = X_train.shape print("#samples: %d, #features: %d" % (num_samples, num_features)) new_post_vec = vectorizer.transform([new_post]) print(new_post_vec, type(new_post_vec)) print(new_post_vec.toarray()) print(vectorizer.get_feature_names()) def dist_raw(v1, v2): delta = v1 - v2 return sp.linalg.norm(delta.toarray()) def dist_norm(v1, v2): v1_normalized = v1 / sp.linalg.norm(v1.toarray()) v2_normalized = v2 / sp.linalg.norm(v2.toarray()) delta = v1_normalized - v2_normalized return sp.linalg.norm(delta.toarray()) dist = dist_norm best_dist = sys.maxsize best_i = None for i in range(0, num_samples): post = posts[i] if post == new_post: continue post_vec = X_train.getrow(i) d = dist(post_vec, new_post_vec) print("=== Post %i with dist=%.2f: %s" % (i, d, post)) if d < best_dist: best_dist = d best_i = i print("Best post is %i with dist=%.2f" % (best_i, best_dist))7、输出结果
#samples: 5, #features: 17 (0, 5) 0.707106781187 (0, 4) 0.707106781187 <class 'scipy.sparse.csr.csr_matrix'> [[ 0. 0. 0. 0. 0.70710678 0.70710678 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]] ['actual', 'capabl', 'contain', 'data', 'databas', 'imag', 'interest', 'learn', 'machin', 'perman', 'post', 'provid', 'save', 'storag', 'store', 'stuff', 'toy'] === Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff. === Post 1 with dist=1.08: Imaging databases provide storage capabilities. === Post 2 with dist=0.86: Most imaging databases save images permanently. === Post 3 with dist=0.92: Imaging databases store data. === Post 4 with dist=0.92: Imaging databases store data. Imaging databases store data. Imaging databases store data. Best post is 2 with dist=0.86