机器学习--应用:评估帖子相关性

一工具:python4.3,Scikit learn库,nltk(自然语言处理工具包),参考资料:《机器学习系统设计》

二步骤:

1、 将原始文本转化为词袋:统计词语个数,并把词频转化为向量。

from sklearn.feature_extraction.text import CountVectorizer

注意点:1)打开文件:os.path.join()连接两个文件名地址的时候,就比os.path.join("D:\","test.txt")结果是D:\test.txt

from utils import DATA_DIR

TOY_DIR = os.path.join(DATA_DIR, "toy")
posts = [open(os.path.join(TOY_DIR, f)).read() for f in os.listdir(TOY_DIR)]

new_post = "imaging databases"


2、词频向量的归一化

http://blog.sina.com.cn/s/blog_70586e000100moen.html--scipy的矩阵计算函数


def dist_norm(v1, v2):
    v1_normalized = v1 / sp.linalg.norm(v1.toarray())
    v2_normalized = v2 / sp.linalg.norm(v2.toarray())

    delta = v1_normalized - v2_normalized

    return sp.linalg.norm(delta.toarray())

3、删除不重要的词

vectorizer = CountVectorizer(min_df=1, stop_words='english',)

stop_words='english':使用一个包含318单词的英文停用词表


4、词干处理

对于英语,使用nltk包的SnowballStemmer,作用是:将词义相同但形式不同的词放到一起统计,比如:images和imaging

import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')

用nltk词干处理器拓展词向量

按如下步骤对每个帖子进行处理:
1)在预处理阶段将原始帖子变成小写字母形式(在父类中完成)

2)在词语切分阶段提取所有单词(在父类中完成)

3)将每个词语转换成词干形式

class StemmedCountVectorizer(CountVectorizer):

    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

# vectorizer = CountVectorizer(min_df=1, stop_words='english',
# preprocessor=stemmer)
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')
5、停用词兴奋剂

词频-反转文档频率(TF-IDF):tf 表示统计部分,IDF考虑权重折扣

from sklearn.feature_extraction.text import TfidfVectorizer,

from sklearn.feature_extraction.text import TfidfVectorizer




class StemmedTfidfVectorizer(TfidfVectorizer):


    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))


vectorizer = StemmedTfidfVectorizer(
    min_df=1, stop_words='english', decode_error='ignore')

scikkt封装在TfidfVectorizer中,这样,得到的文档向量不再包含词语统计值,相反,它会包含TF-IDF值

6、计算帖子相似度

X_train = vectorizer.fit_transform(posts)


num_samples, num_features = X_train.shape
print("#samples: %d, #features: %d" % (num_samples, num_features))


new_post_vec = vectorizer.transform([new_post])
print(new_post_vec, type(new_post_vec))
print(new_post_vec.toarray())
print(vectorizer.get_feature_names())




def dist_raw(v1, v2):
    delta = v1 - v2
    return sp.linalg.norm(delta.toarray())




def dist_norm(v1, v2):
    v1_normalized = v1 / sp.linalg.norm(v1.toarray())
    v2_normalized = v2 / sp.linalg.norm(v2.toarray())


    delta = v1_normalized - v2_normalized


    return sp.linalg.norm(delta.toarray())


dist = dist_norm


best_dist = sys.maxsize
best_i = None


for i in range(0, num_samples):
    post = posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist(post_vec, new_post_vec)


    print("=== Post %i with dist=%.2f: %s" % (i, d, post))


    if d < best_dist:
        best_dist = d
        best_i = i


print("Best post is %i with dist=%.2f" % (best_i, best_dist))
7、输出结果

#samples: 5, #features: 17
  (0, 5)	0.707106781187
  (0, 4)	0.707106781187 <class 'scipy.sparse.csr.csr_matrix'>
[[ 0.          0.          0.          0.          0.70710678  0.70710678
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.        ]]
['actual', 'capabl', 'contain', 'data', 'databas', 'imag', 'interest', 'learn', 'machin', 'perman', 'post', 'provid', 'save', 'storag', 'store', 'stuff', 'toy']
=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=1.08: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.86: Most imaging databases save images permanently.

=== Post 3 with dist=0.92: Imaging databases store data.
=== Post 4 with dist=0.92: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 2 with dist=0.86



你可能感兴趣的:(机器学习--应用:评估帖子相关性)