自然语言处理工具-NLTK

学习目标:

1.知道机器学习的步骤
2.知道nltk的使用

学习内容:

NLTK的使用步骤:

  1. 数据读取
  2. 清理数据
  3. 大小写转换
  4. 去除虚词
  5. 词根化
  6. 还原字符串
  7. 稀疏矩阵
  8. 最大过滤
  9. 建立词袋模型

完整代码:

import re
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# 1.数据读取:import the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting=3)
# 默认显示前5条
print(dataset.head())
print(dataset['Review'][0])  # Wow... Loved this place.

corpus = []
for i in range(0, 1000):
    # 2.清理数据:去掉所有的非字母,用空格替代
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    print(review)  # Wow    Loved this place
    # 3.大小写转换:把所有字符转换成小写
    review = review.lower()
    print(review)  # wow    loved this place
    # 4.去除虚词:停用词
    review = review.split()  # 将review转换为list
    print(review)  # ['wow', 'loved', 'this', 'place']
    review = [word for word in review if not word in set(stopwords.words('english'))]
    print(review)  # ['wow', 'loved', 'place']
    # 5.词根化(词态还原)
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    print(review)  # ['wow', 'love', 'place']
    # 6.还原字符串,用空格隔开
    review = ' '.join(review)
    print(review)  # wow love place
    print('-' * 100)
    corpus.append(review)

# 7.稀疏矩阵:9.建立词袋模型 文本特征提取
cv = CountVectorizer(max_features=1500)  # 8.最大过滤
X = cv.fit_transform(corpus).toarray()
print(X)
print(X.shape)
# 目标值
y = dataset.iloc[:, 1].values
# 数据集划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

# 朴素贝叶斯算法
classifier = GaussianNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

# 模型评估:混淆矩阵
cm = confusion_matrix(y_test, y_pred)
print(cm)

数据文件

链接:https://pan.baidu.com/s/1mdUTxdldjp-tN9_oXY-mJA
提取码:es96

你可能感兴趣的:(人工智能,#,自然语言处理,自然语言处理,机器学习,python)