word2vector从参数解释到实战

1,Word2Vector参数解释

Word2Vector是gensim封装好的模块,gensim是generate similarity的缩写。

本文默认有词向量的基础。参数:

from  gensim.models import Word2Vec
#下面的参数均是默认值
Word2Vec(sentences=None,  #sentences可以是分词列表,也可以是大语料
        size=100,#特征向量的维度
        alpha=0.025,#学习率
        window=5,#一个句子内,当前词和预测词之间的最大距离
        min_count=5,#最低词频
        max_vocab_size=None,#
        sample=0.001, #随机下采样的阈值
        seed=1,#随机数种子
        workers=3,#进程数
        min_alpha=0.0001,#学习率下降的最小值
        sg=0, #训练算法的选择,sg=1,采用skip-gram,sg=0,采用CBOW
        hs=0,# hs=1,采用hierarchica·softmax,hs=10,采用negative sampling
        negative=5,#这个值大于0,使用negative sampling去掉'noise words'的个数(通常设置5-20);为0,不使用negative sampling
        cbow_mean=1,#为0,使用词向量的和,为1,使用均值;只适用于cbow的情况
        iter = 5,#迭代次数
        null_word = 0,
        trim_rule = None, #裁剪词汇规则,使用None(会使用最小min_count)
        sorted_vocab = 1,#对词汇降序排序
        batch_words = 10000,#训练时,每一批次的单词数量
        compute_loss = False,
        callbacks = ())

2,kaggle电影评论实战

  • 导入需要用到的模块
import pandas as pd
import numpy as np
from gensim.models import word2vec
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import nltk.data
import re

  • 训练数据详情
数据集下载链接 kaggle电影评论文本情感分析数据集
train = pd.read_csv('../Bag of Words Meets Bags of Popcorn/labeledTrainData.tsv/labeledTrainData.tsv',header=0,delimiter='\t',quoting=3)
print(train.head())#头5个数据
print(train.tail())#最后5个数据

结果:

         id  sentiment                                             review
0  "5814_8"          1  "With all this stuff going down at the moment ...
1  "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2  "7759_3"          0  "The film starts with a manager (Nicholas Bell...
3  "3630_4"          0  "It must be assumed that those who praised thi...
4  "9495_8"          1  "Superbly trashy and wondrously unpretentious ...
              id  sentiment                                             review
24995   "3453_3"          0  "It seems like more consideration has gone int...
24996   "5064_1"          0  "I don't believe they made this film. Complete...
24997  "10905_3"          0  "Guy is a loser. Can't get girls, needs to bui...
24998  "10194_3"          0  "This 30 minute documentary Buñuel made in the...
24999   "8478_8"          1  "I saw this movie as a child and it broke my h...

总共有25000个数据,有id,sentiment(标签),review三列。

review的评论文本,有各种字符,需要处理,处理函数。

  • 句子划分成单词列表
#定义函数,句子划分成单词列表
def review_to_wordlist(review,remove_stopwords=False):
    review_text = BeautifulSoup(review).get_text()
    #任何非字母的字符都用空格替换掉
    review_text = re.sub('[^a-zA-Z]',' ',review_text)
    #都转化成小写,并按空格划分词
    words = review_text.lower().split()
    # 去掉英文的停止词,没有实际意义的词,像 a,an,the之类的
    if remove_stopwords:
        stops = set(stopwords.words('english'))
        words = [w for w in words if not w in stops]
    return words
  • 文本段落划分成句子
#load the punkt tokenizer       nltk.tokenize.punkt中包含了很多预先训练好的tokenize模型
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#文本划分成句子,返回的句子是很多个单词组成的列表
def review_to_sentences(review,tokenizer,remove_stopwords=False):
    raw_sentences = tokenizer.tokenize(review.strip()) #只是把文章分成了句子,有标点符号
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) >0:
            #调用上面的函数,每一句划分成单词列表
            sentences.append(review_to_wordlist(raw_sentence,remove_stopwords=True))
    return sentences

对nltk.data.load()这一句不理解的看这里nltk.data.load()解释

  • 所有训练样本转化
sentences = []
for review in train['review']:
    sentences += review_to_sentences(review,tokenizer)
print(sentences[:2])#查看处理情况

结果:

[['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker'],
 ['maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent']]

  • 模型训练
import logging
#打印日志
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO)
#设置参数
num_features = 300 #Word vector dimensionality
min_word_count = 40 #
num_workers = 4 #number of threads to run in parallel
context = 10
downsampling = 1e-3 #Downsample setting for frequent words

#初始化和训练模型
from gensim.models import word2vec
print('Training model...')
model = word2vec.Word2Vec(sentences,workers=num_workers,size=num_features,min_count= min_word_count,
                          window = context,sample = downsampling)
model.init_sims(replace=True)
model_name = '300features_40minwords_10context'
#保存模型,以便下次使用或者继续训练
model.save(model_name)
  • 模型效果查看
#查看模型训练的效果
#给定四个单词,哪一个和另外一个不是一类的
print(model.doesnt_match('man woman child kitchen'.split()))#'kitchen'
print(model.doesnt_match("france england germany berlin".split())) #'berlin'
#给定单词,和它相似度大的
print(model.most_similar('man'))
print(model.most_similar("queen"))
print(model.most_similar("awful"))

下面3个结果分别如下:

[('woman', 0.6699092388153076), ('lady', 0.5975276827812195), ('men', 0.48970407247543335), ('doctor', 0.48383718729019165), ('lover', 0.4782255291938782), ('soldier', 0.4646846055984497), ('boy', 0.46273571252822876), ('scientist', 0.4554121196269989), ('lawyer', 0.45362791419029236), ('farmer', 0.44818687438964844)]
[('princess', 0.8943277597427368), ('bride', 0.861689567565918), ('wealthy', 0.8264931440353394), ('befriends', 0.8103165626525879), ('mistress', 0.8016114830970764), ('thief', 0.7950919270515442), ('servant', 0.7940468192100525), ('visits', 0.7895402908325195), ('orphan', 0.7882245182991028), ('widow', 0.7848430275917053)]
[('terrible', 0.8924846053123474), ('horrible', 0.8547634482383728), ('dreadful', 0.8026485443115234), ('sucks', 0.7753838300704956), ('pathetic', 0.7692033052444458), ('crappy', 0.768912136554718), ('lousy', 0.766276478767395), ('atrocious', 0.7649823427200317), ('horrid', 0.7589359283447266), ('abysmal', 0.757400631904602)]

这些代码都可以运行在python3版本上。

题外话:中国‘芯’加油!!!



你可能感兴趣的:(自然语言处理)