为了显示代码的友好性,我会把代码的每一步运行的结果显示出来,让读者可以有一个直观的认识,了解每一步代码的意思,下面我会先以几条数据为例,让读者可以直观的认识每段代码执行出来的效果,文章末我会已一个大数据集实验,并且给出实验效果,读者可以参考
一、 首先,笔者的数据存放在两个excel,一个是存放的是pos评论,一个是neg评论。分别是poss.xlsx和negg.xlsx,里面的内容如下:
poss.xls的内容是:
neg.xls的内容是:
二、 然后,读入数据了,具体代码如下
import numpy as np import pandas as pd pos = pd.read_excel('poss.xlsx', header=None)#读入数据到pandas数据框架 pos['label'] = 1#添加标签列为1 neg = pd.read_excel('negg.xlsx', header=None) neg['label'] = 0#添加标签列为0 all= pos.append(neg, ignore_index=True)#合并预料
print(all) 这段代码运行的效果是这样的:
接下来是分词了
cw=lambda s: list(jieba.cut(s))#调用结巴分词 all['words'] = all[0].apply(cw)
print(all['words'])这段代码运行的效果是这样的:
把所有的词组成一个大的词典
all['words'] = all[0].apply(cw) content = [] for i in all['words']: content.extend(i) abc = pd.Series(content).value_counts()
给每个词一个固定的编号
abc[:] = range(1, len(abc)+1) abc[''] = 0 maxlen=10 def doc2num(s, maxlen): s = [i for i in s if i in abc.index] s = s[:maxlen] + ['']*max(0, maxlen-len(s)) return list(abc[s]) all['doc2num'] = all['words'].apply(lambda s: doc2num(s, maxlen))
结果如下:
打乱数据,并且生成keras的输入数据
idx = range(len(all)) np.random.shuffle(idx) all= all.loc[idx] x = np.array(list(all['doc2num'])) y = np.array(list(all['label'])) y = y.reshape((-1,1))
首先,我们看下x里面的数据形式,如下图:
接下来就是用keras搭建卷积神经网络模型了
model = Sequential() model.add(Embedding(len(abc), embedding_vecor_length,input_length=maxlen)) model.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_length, border_mode='valid', activation='relu')) model.add(GlobalMaxPooling1D()) model.add(Dense(128)) model.add(Dropout(0.2)) model.add(Activation('relu')) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch, validation_data=(X_test, y_test))
最后就是对1000条积极评论和1000条消极评论的情感分类代码了,代码如下:
from __future__ import print_functionimport jieba import pandas as pd import numpy as np np.random.seed(1337) # for reproducibility from keras.preprocessing import sequence from keras.utils import np_utils from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Flatten from keras.layers import Embedding from keras.layers import Convolution1D, GlobalMaxPooling1D embedding_vecor_length = 32 maxlen =200 min_count=5 batch_size = 32 nb_epoch =10 nb_filter =128 filter_length = 3 pos = pd.read_excel('poss.xls', header=None) pos['label'] = 1 neg = pd.read_excel('negg.xls', header=None) neg['label'] = 0 all= pos.append(neg, ignore_index=True) cw=lambda s: list(jieba.cut(s)) all['words'] = all[0].apply(cw) content = [] for i in all['words']: content.extend(i) abc = pd.Series(content).value_counts() abc[:] = range(1, len(abc)+1) abc[''] = 0 def doc2num(s, maxlen): s = [i for i in s if i in abc.index] s = s[:maxlen] + ['']*max(0, maxlen-len(s)) return list(abc[s]) all['doc2num'] = all['words'].apply(lambda s: doc2num(s, maxlen)) idx = range(len(all)) np.random.shuffle(idx) all= all.loc[idx] x = np.array(list(all['doc2num'])) y = np.array(list(all['label'])) y = y.reshape((-1,1)) train_num=1600 X_train=x[:train_num] y_train=y[:train_num] X_test=x[train_num:] y_test=y[train_num:] model = Sequential() model.add(Embedding(len(abc), embedding_vecor_length,input_length=maxlen)) model.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_length, border_mode='valid', activation='relu')) model.add(GlobalMaxPooling1D()) model.add(Dense(128)) model.add(Dropout(0.2)) model.add(Activation('relu')) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch, validation_data=(X_test, y_test)) score, acc = model.evaluate(X_test, y_test,verbose=0) print('Test score:', score) print('Test accuracy:', acc) print('Train...') model.fit(X_train, y_train, batch_size=batch_size,nb_epoch=nb_epoch,validation_data=(X_test, y_test)) score, acc = model.evaluate(X_test, y_test,verbose=0) print('Test score:', score) print('Test accuracy:', acc)
结果如下: