在lintcode刷AI题:垃圾短信分类

lintcode上面有十几道类似于Kaggle的小项目,用于深度学习的入手练习再好不过了,现在就让我们上手这道猫狗分类器的问题吧!

(全程用Keras框架,简单上手!)

本题网址:lintcode.com/ai/spam-me

题目描述:

本题提供一个数据集, 它包括了5574条英文短信,每条短信内容由几个长短不一的句子组成。每条短信都标注好了是否为垃圾短信,通过该训练集训练出一个分类器,预测短信内容是否为垃圾短信。

一 下载,读取数据

这一步比较麻烦,用pd.readcsv读取总是出现bug,无法读取完整的数据。最后import csv搞定.....读取部分这样写就行了,最后得到(5572,)大小的train,(1115,)大小的test,具体函数没什么可说的,就是逐行读取。

def read_data(file):
    train_data = csv.reader(open(file, encoding="utf-8"))
    lines = 0
    for r in train_data:
        lines += 1
    train_data_label = np.zeros([lines - 1, ])
    train_data_content = []
    train_data = csv.reader(open(file, encoding="utf-8"))
    i = 0
    for data in train_data:
        if data[0] == "Label" or data[0] == "SmsId":
            continue
        if data[0] == "ham":
            train_data_label[i] = 0
        if data[0] == "spam":
            train_data_label[i] = 1
        train_data_content.append(data[1])
        i += 1
    print(train_data_label.shape, len(train_data_content))
    return train_data_label,train_data_content


# 载入数据
train_y,train_data_content = read_data("./垃圾短信分类data/train.csv")
_,test_data_content = read_data("./垃圾短信分类data/test.csv")

二 清洗数据

初步的清洗数据,包括把所有的单词都转换成小写,删除除了英文之外的字符,考虑到英语有简写,恢复简写的部分,把所有的单词恢复成词干。

def clean_text(comment_text):
    comment_list = []
    for text in comment_text:
        # 将单词转换为小写
        text = text.lower()
        # 删除非字母、数字字符
        text = re.sub(r"[^a-z']", " ", text)
        # 恢复常见的简写
        text = re.sub(r"what's", "what is ", text)
        text = re.sub(r"\'s", " ", text)
        text = re.sub(r"\'ve", " have ", text)
        text = re.sub(r"can't", "can not ", text)
        text = re.sub(r"cannot", "can not ", text)
        text = re.sub(r"n't", " not ", text)
        text = re.sub(r"\'m", " am ", text)
        text = re.sub(r"\'re", " are ", text)
        text = re.sub(r"\'d", " will ", text)
        text = re.sub(r"ain\'t", " are not ", text)
        text = re.sub(r"aren't", " are not ", text)
        text = re.sub(r"couldn\'t", " can not ", text)
        text = re.sub(r"didn't", " do not ", text)
        text = re.sub(r"doesn't", " do not ", text)
        text = re.sub(r"don't", " do not ", text)
        text = re.sub(r"hadn't", " have not ", text)
        text = re.sub(r"hasn't", " have not ", text)
        text = re.sub(r"\'ll", " will ", text)
        #进行词干提取
        new_text = ""
        s = nltk.stem.snowball.EnglishStemmer()  # 英文词干提取器
        for word in word_tokenize(text):
            new_text = new_text + " " + s.stem(word)
        # 放回去
        comment_list.append(new_text)
    return comment_list

train_data_content = clean_text(train_data_content)
test_data_content = clean_text(test_data_content)

三 TF-IDF计算

这里使用tfidf将一段话转换成向量,取前5000常见单词,也就是一段话成为了5000维的向量,具体的不解释了...看代码

# 数据的TF-IDF信息计算
all_comment_list = list(train_data_content) + list(test_data_content)
text_vector = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode',token_pattern=r'\w{1,}',
                              max_features=5000, ngram_range=(1, 1), analyzer='word')
text_vector.fit(all_comment_list)
train_x = text_vector.transform(train_data_content)
test_x = text_vector.transform(test_data_content)
train_x = train_x.toarray()
test_x = test_x.toarray()
print(train_x.shape,test_x.shape,type(train_x)) # (5572, 5000) (1115, 5000) 

四 建模预测

最开始用的是神经网络,效果极差,貌似遇到了梯度消失,怎么训练都没用,最后忍无可忍,用sklearn的LogisticRegression,回归原初.....然后稍微调整一下C就得到了100%正确率,你敢信?

# 构建模型
clf = LogisticRegression(C=100.0)
clf.fit(train_x, train_y)
train_scores = clf.score(train_x, train_y)
print(train_scores)
test_y = clf.predict_proba(test_x)

# 预测答案
print(test_y.shape)
answer = pd.read_csv(open("./垃圾短信分类data/sampleSubmission.csv"))
for i in range(test_y.shape[0]):
    predit = test_y[i,0]
    if predit < 0.5:
        answer.loc[i,"Label"] = "spam"
    else:
        answer.loc[i,"Label"] = "ham"
answer.to_csv("./垃圾短信分类data/submission.csv",index=False)  # 不要保存引索列

最后结果,得分1.000(震惊了,真的存在100%正确率的.....)

排名3/108(有史以来最好的一次)

代码已经公布,参见github

你可能感兴趣的:(深度学习)