以下内容详细剖析了NLP 中情感分析(Sentiment Analysis)和主题建模(Topic Modeling)的技术与方法,分别展示如何从文本中提取情感倾向和潜在主题,并提供示例代码和讲解,可在 Python 环境下直接运行。
情感分析(Sentiment Analysis)
1.1 概念与方法概览
1.2 传统机器学习方法
1.3 深度学习与预训练模型
1.4 代码示例:基于机器学习的情感分类
主题建模(Topic Modeling)
2.1 概念与 LDA 基本原理
2.2 LDA 以外的主题建模方法
2.3 代码示例:Gensim 实现 LDA 主题建模
总结与扩展
情感分析旨在判断文本在情感上的倾向,例如产品评论中的正面/负面/中性评价。
经典流程:
优点:实现简单、易解释
缺点:难以捕捉深层语义,效果受限于特征工程
以下示例使用 sklearn 展示简化流程:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# 1) 模拟数据
corpus = [
("I love this movie. It's fantastic!", "positive"),
("Absolutely terrible. Waste of time.", "negative"),
("Pretty good overall, but not the best.", "positive"),
("I hate this product, it's awful!", "negative"),
("The design is beautiful and I am satisfied.", "positive"),
("It's okay, not too bad, not too good.", "positive"), # 将“中性”视为positive示例
("Horrible experience, I'm disappointed.", "negative"),
("Could be better, I'm not fully happy with it.", "negative")
]
texts = [item[0] for item in corpus]
labels = [item[1] for item in corpus]
# 2) 数据切分
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.25, random_state=42
)
# 3) TF-IDF 向量化
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# 4) 训练逻辑回归
clf = LogisticRegression()
clf.fit(X_train_vec, y_train)
# 5) 测试与评估
y_pred = clf.predict(X_test_vec)
print("预测结果:", y_pred)
print("真实标签:", y_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
运行后可得到分类指标,如准确率、精确率、召回率等。
主题建模旨在从大量无标签文本中发现潜在主题。
以下示例使用 Gensim 库进行简单的 LDA 训练,演示流程。
# pip install gensim
import gensim
from gensim import corpora
import nltk
# 如果需要下载nltk资源
# nltk.download('stopwords')
# nltk.download('punkt')
from nltk.corpus import stopwords
documents = [
"I love to watch football games. Football is a great sport!",
"The bank is closing soon, check your bank account quickly.",
"I prefer basketball to football, it is more dynamic.",
"The investment bank raised interest rates yesterday.",
"He watches basketball and football every weekend.",
"Financial institutions are impacted by interest rate changes."
]
stop_words = set(stopwords.words('english'))
def tokenize_and_clean(text):
tokens = nltk.word_tokenize(text.lower())
filtered = [w for w in tokens if w.isalpha() and w not in stop_words]
return filtered
processed_docs = [tokenize_and_clean(doc) for doc in documents]
# 构建词典
dictionary = corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=1, no_above=0.9)
# 文档转换为词袋
corpus_bow = [dictionary.doc2bow(doc) for doc in processed_docs]
from gensim.models.ldamodel import LdaModel
num_topics = 2
lda_model = LdaModel(
corpus=corpus_bow,
id2word=dictionary,
num_topics=num_topics,
random_state=42,
passes=10,
alpha='auto'
)
for i in range(num_topics):
print(f"主题 {i}:")
print(lda_model.print_topic(i))
print("------")
# 对新文档进行推断
new_doc = "The interest rate for bank deposits is increasing."
bow_new_doc = dictionary.doc2bow(tokenize_and_clean(new_doc))
topic_probs = lda_model.get_document_topics(bow_new_doc)
print("\n新文档主题分布:", topic_probs)
num_topics=2
表示我们希望模型分出 2 个主题通过以上示例和讲解,你应该对如何从文本中提取情感倾向与潜在主题有了系统认识。无论是在舆情监测、产品评价分析、媒体聚类、学术文献整理等领域,情感分析与主题建模都能提供宝贵洞见,帮助深入洞察文本数据背后的价值。
【哈佛博后带小白玩转机器学习】 哔哩哔哩_bilibili
总课时超400+,时长75+小时