lintcode上面有十几道类似于Kaggle的小项目,用于深度学习的入手练习再好不过了,现在就让我们上手这道猫狗分类器的问题吧!
(全程用Keras框架,简单上手!)
本题网址:https://www.lintcode.com/ai/spam-message-classification
题目描述:
本题提供一个数据集, 它包括了5574条英文短信,每条短信内容由几个长短不一的句子组成。每条短信都标注好了是否为垃圾短信,通过该训练集训练出一个分类器,预测短信内容是否为垃圾短信。
这一步比较麻烦,用pd.readcsv读取总是出现bug,无法读取完整的数据。最后import csv搞定.....读取部分这样写就行了,最后得到(5572,)大小的train,(1115,)大小的test,具体函数没什么可说的,就是逐行读取。
def read_data(file):
train_data = csv.reader(open(file, encoding="utf-8"))
lines = 0
for r in train_data:
lines += 1
train_data_label = np.zeros([lines - 1, ])
train_data_content = []
train_data = csv.reader(open(file, encoding="utf-8"))
i = 0
for data in train_data:
if data[0] == "Label" or data[0] == "SmsId":
continue
if data[0] == "ham":
train_data_label[i] = 0
if data[0] == "spam":
train_data_label[i] = 1
train_data_content.append(data[1])
i += 1
print(train_data_label.shape, len(train_data_content))
return train_data_label,train_data_content
# 载入数据
train_y,train_data_content = read_data("./垃圾短信分类data/train.csv")
_,test_data_content = read_data("./垃圾短信分类data/test.csv")
初步的清洗数据,包括把所有的单词都转换成小写,删除除了英文之外的字符,考虑到英语有简写,恢复简写的部分,把所有的单词恢复成词干。
def clean_text(comment_text):
comment_list = []
for text in comment_text:
# 将单词转换为小写
text = text.lower()
# 删除非字母、数字字符
text = re.sub(r"[^a-z']", " ", text)
# 恢复常见的简写
text = re.sub(r"what's", "what is ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "can not ", text)
text = re.sub(r"cannot", "can not ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"\'m", " am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " will ", text)
text = re.sub(r"ain\'t", " are not ", text)
text = re.sub(r"aren't", " are not ", text)
text = re.sub(r"couldn\'t", " can not ", text)
text = re.sub(r"didn't", " do not ", text)
text = re.sub(r"doesn't", " do not ", text)
text = re.sub(r"don't", " do not ", text)
text = re.sub(r"hadn't", " have not ", text)
text = re.sub(r"hasn't", " have not ", text)
text = re.sub(r"\'ll", " will ", text)
#进行词干提取
new_text = ""
s = nltk.stem.snowball.EnglishStemmer() # 英文词干提取器
for word in word_tokenize(text):
new_text = new_text + " " + s.stem(word)
# 放回去
comment_list.append(new_text)
return comment_list
train_data_content = clean_text(train_data_content)
test_data_content = clean_text(test_data_content)
这里使用tfidf将一段话转换成向量,取前5000常见单词,也就是一段话成为了5000维的向量,具体的不解释了...看代码
# 数据的TF-IDF信息计算
all_comment_list = list(train_data_content) + list(test_data_content)
text_vector = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode',token_pattern=r'\w{1,}',
max_features=5000, ngram_range=(1, 1), analyzer='word')
text_vector.fit(all_comment_list)
train_x = text_vector.transform(train_data_content)
test_x = text_vector.transform(test_data_content)
train_x = train_x.toarray()
test_x = test_x.toarray()
print(train_x.shape,test_x.shape,type(train_x)) # (5572, 5000) (1115, 5000)
最开始用的是神经网络,效果极差,貌似遇到了梯度消失,怎么训练都没用,最后忍无可忍,用sklearn的LogisticRegression,回归原初.....然后稍微调整一下C就得到了100%正确率,你敢信?
# 构建模型
clf = LogisticRegression(C=100.0)
clf.fit(train_x, train_y)
train_scores = clf.score(train_x, train_y)
print(train_scores)
test_y = clf.predict_proba(test_x)
# 预测答案
print(test_y.shape)
answer = pd.read_csv(open("./垃圾短信分类data/sampleSubmission.csv"))
for i in range(test_y.shape[0]):
predit = test_y[i,0]
if predit < 0.5:
answer.loc[i,"Label"] = "spam"
else:
answer.loc[i,"Label"] = "ham"
answer.to_csv("./垃圾短信分类data/submission.csv",index=False) # 不要保存引索列
最后结果,得分1.000(震惊了,真的存在100%正确率的.....)
排名3/108(有史以来最好的一次)
代码已经公布,参见github