本次实验以AAAI 2014会议论文数据为基础,要求实现或调用无监督聚类算法,了解聚类方法。
每年国际上召开的大大小小学术会议不计其数,发表了非常多的论文。在计算机领域的一些大型学术会议上,一次就可以发表涉及各个方向的几百篇论文。按论文的主题、内容进行聚类,有助于人们高效地查找和获得所需要的论文。本案例数据来源于AAAI 2014上发表的约400篇文章,由UCI公开提供,提供包括标题、作者、关键词、摘要在内的信息,希望大家能根据这些信息,合理地构造特征向量来表示这些论文,并设计实现或调用聚类算法对论文进行聚类。最后也可以对聚类结果进行观察,看每一类都是什么样的论文,是否有一些主题。
注:group和topic也不能完全算是标签,因为
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import nltk
import sklearn
import seaborn as sns # 作图
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import sparse # 稀疏矩阵
RANDOM_STATE = 2023
data_df = pd.read_csv('./data/[UCI] AAAI-14 Accepted Papers - Papers.csv') # 读入 csv 文件为 pandas 的 DataFrame
data_df.head(5)
title | authors | groups | keywords | topics | abstract | |
---|---|---|---|---|---|---|
0 | Kernelized Bayesian Transfer Learning | Mehmet Gönen and Adam A. Margolin | Novel Machine Learning Algorithms (NMLA) | cross-domain learning\ndomain adaptation\nkern... | APP: Biomedical / Bioinformatics\nNMLA: Bayesi... | Transfer learning considers related but distin... |
1 | "Source Free" Transfer Learning for Text Class... | Zhongqi Lu, Yin Zhu, Sinno Pan, Evan Xiang, Yu... | AI and the Web (AIW)\nNovel Machine Learning A... | Transfer Learning\nAuxiliary Data Retrieval\nT... | AIW: Knowledge acquisition from the web\nAIW: ... | Transfer learning uses relevant auxiliary data... |
2 | A Generalization of Probabilistic Serial to Ra... | Haris Aziz and Paul Stursberg | Game Theory and Economic Paradigms (GTEP) | social choice theory\nvoting\nfair division\ns... | GTEP: Game Theory\nGTEP: Social Choice / Voting | The probabilistic serial (PS) rule is one of t... |
3 | Lifetime Lexical Variation in Social Media | Liao Lizi, Jing Jiang, Ying Ding, Heyan Huang ... | NLP and Text Mining (NLPTM) | Generative model\nSocial Networks\nAge Prediction | AIW: Web personalization and user modeling\nNL... | As the rapid growth of online social media att... |
4 | Hybrid Singular Value Thresholding for Tensor ... | Xiaoqin Zhang, Zhengyuan Zhou, Di Wang and Yi Ma | Knowledge Representation and Reasoning (KRR)\n... | tensor completion\nlow-rank recovery\nhybrid s... | KRR: Knowledge Representation (General/Other)\... | In this paper, we study the low-rank tensor co... |
查看dataframe
数据信息:
data_df.info()
RangeIndex: 398 entries, 0 to 397
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 398 non-null object
1 authors 398 non-null object
2 groups 396 non-null object
3 keywords 398 non-null object
4 topics 394 non-null object
5 abstract 398 non-null object
dtypes: object(6)
memory usage: 18.8+ KB
从以上信息可以看出,data_df存在空数据,应对其作处理
# stack()将df转换为series对象; [lambda x:x]只保留True元素
data_df.isnull().stack()[lambda x: x]
211 groups True
340 groups True
344 topics True
364 topics True
365 topics True
388 topics True
dtype: bool
对空数据进行填充为空字符处理
data_df = data_df.fillna('') # 填充空值为空字符串
将同一篇文章的不同类型数据结合,选择使用TF-IDF模型,对文本进行向量化
paper_df = data_df['title']+' '+data_df['authors']+' '+data_df['groups']+' '\
+data_df['keywords']+' '+data_df['topics']+' '+data_df['abstract']
paper_df
结果:
0 Kernelized Bayesian Transfer Learning Mehmet G...
1 "Source Free" Transfer Learning for Text Class...
2 A Generalization of Probabilistic Serial to Ra...
3 Lifetime Lexical Variation in Social Media Lia...
4 Hybrid Singular Value Thresholding for Tensor ...
...
393 Mapping Users Across Networks by Manifold Alig...
394 Compact Aspect Embedding For Diversified Query...
395 Contraction and Revision over DL-Lite TBoxes Z...
396 Zero Pronoun Resolution as Ranking Chen Chen a...
397 Supervised Transfer Sparse Coding Maruan Al-Sh...
Length: 398, dtype: object
vectorizer = TfidfVectorizer(max_df=0.9, min_df=10)
X_simple = vectorizer.fit_transform(paper_df)
def author_tokenizer(text):
authors = re.split("\s+and\s+|\s*,\s*", text) # 根据逗号或者and进行分词
return authors
authors = data_df['authors'][1]
author_split = author_tokenizer(authors)
print(authors,'\n',author_split)
结果:
Zhongqi Lu, Yin Zhu, Sinno Pan, Evan Xiang, Yujing Wang and Qiang Yang
['Zhongqi Lu', 'Yin Zhu', 'Sinno Pan', 'Evan Xiang', 'Yujing Wang', 'Qiang Yang']
def text_tokenizer(text):
# 分词
words = nltk.tokenize.word_tokenize(text)
# 去除停用词
stop_words = set(nltk.corpus.stopwords.words('english'))
words = [word for word in words if word.lower() not in stop_words]
# 词干化
stemmer = nltk.stem.PorterStemmer()
words = [stemmer.stem(word) for word in words]
return words
abstracts=data_df['abstract'][1]
abstracts_split = text_tokenizer(abstracts)
print(abstracts,'\n',abstracts_split)
结果:
Transfer learning uses relevant auxiliary data to help the learning task in a target domain where labeled data are usually insufficient to train an accurate model. Given appropriate auxiliary data, researchers have proposed many transfer learning models. How to find such auxiliary data, however, is of little research in the past. In this paper, we focus on this auxiliary data retrieval problem, and propose a transfer learning framework that effectively selects helpful auxiliary data from an open knowledge space (e.g. the World Wide Web). Because there is no need of manually selecting auxiliary data for different target domain tasks, we call our framework Source Free Transfer Learning (SFTL). For each target domain task, SFTL framework iteratively queries for the helpful auxiliary data based on the learned model and then updates the model using the retrieved auxiliary data. We highlight the automatic constructions of queries and the robustness of the SFTL framework. Our experiments on the 20 NewsGroup dataset and the Google search snippets dataset suggest that the new framework is capable to have the comparable performance to those state-of-the-art methods with dedicated selections of auxiliary data.
['transfer', 'learn', 'use', 'relev', 'auxiliari', 'data', 'help', 'learn', 'task', 'target', 'domain', 'label', 'data', 'usual', 'insuffici', 'train', 'accur', 'model', '.', 'given', 'appropri', 'auxiliari', 'data', ',', 'research', 'propos', 'mani', 'transfer', 'learn', 'model', '.', 'find', 'auxiliari', 'data', ',', 'howev', ',', 'littl', 'research', 'past', '.', 'paper', ',', 'focu', 'auxiliari', 'data', 'retriev', 'problem', ',', 'propos', 'transfer', 'learn', 'framework', 'effect', 'select', 'help', 'auxiliari', 'data', 'open', 'knowledg', 'space', '(', 'e.g', '.', 'world', 'wide', 'web', ')', '.', 'need', 'manual', 'select', 'auxiliari', 'data', 'differ', 'target', 'domain', 'task', ',', 'call', 'framework', 'sourc', 'free', 'transfer', 'learn', '(', 'sftl', ')', '.', 'target', 'domain', 'task', ',', 'sftl', 'framework', 'iter', 'queri', 'help', 'auxiliari', 'data', 'base', 'learn', 'model', 'updat', 'model', 'use', 'retriev', 'auxiliari', 'data', '.', 'highlight', 'automat', 'construct', 'queri', 'robust', 'sftl', 'framework', '.', 'experi', '20', 'newsgroup', 'dataset', 'googl', 'search', 'snippet', 'dataset', 'suggest', 'new', 'framework', 'capabl', 'compar', 'perform', 'state-of-the-art', 'method', 'dedic', 'select', 'auxiliari', 'data', '.']
查看每列名称:
data_df.columns
结果:
Index(['title', 'authors', 'groups', 'keywords', 'topics', 'abstract'], dtype='object')
创建 TF-IDF 矩阵:
vectorizer_authour = TfidfVectorizer(tokenizer = author_tokenizer)
vectorizer_text = TfidfVectorizer(tokenizer = text_tokenizer)
X_authours = vectorizer_authour.fit_transform(data_df['authors'].tolist())
X_title = vectorizer_text.fit_transform(data_df['title'].tolist())
X_groups = vectorizer_text.fit_transform(data_df['groups'].tolist())
X_keywords = vectorizer_text.fit_transform(data_df['keywords'].tolist())
X_topics = vectorizer_text.fit_transform(data_df['topics'].tolist())
vectorizer_texts = TfidfVectorizer(max_df=0.9, min_df=5, tokenizer = text_tokenizer)
X_abstract = vectorizer_texts.fit_transform(data_df['abstract'].tolist())
print(f'X_title:{X_title.shape}')
print(f'X_authours:{X_authours.shape}')
print(f'X_groups:{X_groups.shape}')
print(f'X_keywords:{X_keywords.shape}')
print(f'X_topics:{X_topics.shape}')
print(f'X_abstract:{X_abstract.shape}')
结果:
X_title:(398, 1124)
X_authours:(398, 1105)
X_groups:(398, 64)
X_keywords:(398, 1051)
X_topics:(398, 305)
X_abstract:(398, 1042)
将稀疏矩阵拼接
X_passage = sparse.hstack([X_title, X_authours, X_groups, X_keywords, X_topics, X_abstract]) # 稀疏向量拼接
print(X_passage.shape)
(398, 4691)
直接采用KMeans简单聚类
k = 5 #假设有5个类别
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(X_simple)
labels = model.labels_
data_df['label'] = labels
labels
结果:
array([1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 4, 1, 1, 1, 1, 3,
2, 1, 0, 1, 1, 2, 1, 1, 4, 0, 1, 1, 4, 3, 1, 4, 1, 4, 1, 3, 1, 0,
4, 3, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 3, 4, 1, 1, 4, 1, 3, 1, 1, 4,
3, 4, 1, 3, 4, 2, 1, 1, 1, 1, 3, 4, 1, 4, 1, 1, 1, 1, 1, 1, 1, 4,
1, 1, 1, 1, 0, 1, 1, 2, 1, 4, 1, 1, 3, 1, 1, 1, 3, 2, 4, 0, 1, 3,
4, 2, 1, 3, 1, 2, 1, 4, 1, 1, 1, 1, 1, 0, 4, 1, 1, 0, 1, 0, 1, 3,
1, 1, 4, 4, 1, 1, 0, 1, 3, 1, 1, 1, 1, 1, 0, 1, 0, 4, 1, 1, 0, 2,
1, 2, 1, 0, 1, 1, 1, 4, 3, 1, 2, 1, 4, 3, 0, 2, 3, 4, 0, 3, 3, 1,
1, 2, 4, 3, 3, 4, 1, 1, 3, 2, 1, 0, 4, 4, 4, 4, 2, 1, 1, 3, 0, 4,
2, 1, 2, 0, 1, 1, 3, 3, 0, 1, 1, 1, 1, 1, 3, 1, 1, 1, 0, 1, 0, 1,
1, 1, 1, 1, 4, 1, 3, 1, 1, 1, 3, 1, 1, 4, 1, 2, 3, 0, 2, 3, 1, 1,
1, 1, 1, 4, 1, 0, 1, 1, 2, 1, 4, 1, 1, 1, 0, 1, 1, 1, 1, 4, 1, 1,
1, 4, 0, 1, 1, 1, 4, 1, 4, 2, 1, 1, 1, 2, 1, 3, 1, 0, 1, 2, 2, 1,
1, 3, 1, 1, 1, 3, 2, 1, 3, 4, 1, 1, 1, 1, 1, 1, 1, 4, 4, 1, 1, 4,
0, 1, 1, 3, 0, 4, 2, 0, 1, 4, 1, 2, 4, 3, 1, 1, 3, 3, 3, 1, 1, 1,
4, 1, 1, 2, 2, 1, 4, 4, 2, 1, 3, 0, 4, 4, 1, 0, 0, 4, 3, 1, 1, 1,
3, 1, 3, 1, 3, 0, 1, 4, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 2, 1, 1,
3, 1, 3, 0, 1, 1, 0, 1, 1, 3, 1, 1, 2, 2, 1, 2, 4, 0, 1, 1, 1, 3,
1, 1])
总结分类规律
data_df[data_df['label']==4][['title', 'groups', 'topics']]
title | groups | topics | |
---|---|---|---|
2 | A Generalization of Probabilistic Serial to Ra... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Social Choice / Voting |
16 | Multi-Organ Exchange: The Whole is Greater tha... | Applications (APP)\nGame Theory and Economic P... | APP: Biomedical / Bioinformatics\nGTEP: Auctio... |
30 | The Computational Rise and Fall of Fairness | Game Theory and Economic Paradigms (GTEP) | GTEP: Social Choice / Voting |
34 | Lazy Defenders Are Almost Optimal Against Dili... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Imperfect Information |
37 | Game-theoretic Resource Allocation for Protect... | Applications (APP)\nGame Theory and Economic P... | APP: Security and Privacy\nGTEP: Game Theory\n... |
39 | A Strategy-Proof Online Auction with Time Disc... | Game Theory and Economic Paradigms (GTEP) | GTEP: Auctions and Market-Based Systems |
44 | Simultaneous Cake Cutting | Game Theory and Economic Paradigms (GTEP) | GTEP: Social Choice / Voting |
57 | Solving Imperfect Information Games Using Deco... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im... |
60 | Online (Budgeted) Social Choice | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Social Choice / Votin... |
65 | Fixing a Balanced Knockout Tournament | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Social Choice / Voting |
67 | Incomplete Preferences in Single-Peaked Electo... | Game Theory and Economic Paradigms (GTEP) | GTEP: Social Choice / Voting\nGTEP: Imperfect ... |
70 | A Control Dichotomy for Pure Scoring Rules | Game Theory and Economic Paradigms (GTEP) | GTEP: Social Choice / Voting |
77 | Biased Games | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Equilibrium |
79 | Preference Elicitation and Interview Minimizat... | Game Theory and Economic Paradigms (GTEP)\nMul... | APP: Computational Social Science\nGTEP: Socia... |
87 | Minimising Undesired Task Costs in Multi-robot... | Multiagent Systems (MAS)\nRobotics (ROB) | GTEP: Auctions and Market-Based Systems\nMAS: ... |
97 | Congestion Games for V2G-Enabled EV Charging | Computational Sustainability and AI (CSAI)\nGa... | CSAI: Modeling the interactions of agents with... |
106 | Evolutionary dynamics of learning algorithms o... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Adversarial Learning\nGTEP: Equilibrium\... |
110 | A Game-theoretic Analysis of Catalog Optimization | Game Theory and Economic Paradigms (GTEP)\nKno... | GTEP: Auctions and Market-Based Systems\nGTEP:... |
117 | Automatic Game Design via Mechanic Generation | Game Playing and Interactive Entertainment (GPIE) | GPIE: AI in Game Design\nGPIE: Procedural Cont... |
124 | False-Name Bidding and Economic Efficiency in ... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Auctions and Market-Based Systems\nGTEP:... |
134 | Mechanism Design for Scheduling with Uncertain... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Auctions and Market-Based Systems\nGTEP:... |
135 | Robust Winners and Winner Determination Polici... | Game Theory and Economic Paradigms (GTEP)\nMul... | APP: Computational Social Science\nGTEP: Socia... |
149 | Regret Transfer and Parameter Optimization | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im... |
161 | Trading Multiple Indivisible Goods with Indiff... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Social Choice / Votin... |
166 | Item Bidding for Combinatorial Public Projects | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Game Theory\nGTEP: Coordination and Coll... |
171 | Increasing VCG revenue by decreasing the quali... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Auctions and Market-Based Systems\nMAS: ... |
178 | Theory of Cooperation in Complex Social Networks | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Game Theory\nGTEP: Coordination and Coll... |
181 | Prices Matter for the Parameterized Complexity... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Game Theory\nGTEP: Social Choice / Votin... |
188 | Incentives for Truthful Information Elicitatio... | Game Theory and Economic Paradigms (GTEP)\nHum... | GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im... |
189 | Equilibria in Epidemic Containment Games | Applications (APP)\nComputational Sustainabili... | APP: Security and Privacy\nCSAI: Modeling the ... |
190 | Beat the Cheater: Computing Game-Theoretic Str... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im... |
191 | A Characterization of the Single-Peaked Single... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Game Theory\nGTEP: Social Choice / Votin... |
197 | Efficient buyer groups for prediction-of-use e... | Computational Sustainability and AI (CSAI)\nGa... | CSAI: Modeling the interactions of agents with... |
224 | On Detecting Nearly Structured Preference Prof... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Social Choice / Voting |
233 | Betting Strategies, Market Selection, and the ... | Game Theory and Economic Paradigms (GTEP) | GTEP: Auctions and Market-Based Systems |
245 | Leveraging Fee-Based, Imperfect Advisors in Hu... | Humans and AI (HAI) | HAI: Human-Computer Interaction |
252 | On the Structure of Synergies in Cooperative G... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory |
261 | On the Incompatibility of Efficiency and Strat... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Social Choice / Voting |
265 | Regret-based Optimization and Preference Elici... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory |
270 | Modal Ranking: A Uniquely Robust Voting Rule | Game Theory and Economic Paradigms (GTEP) | GTEP: Social Choice / Voting |
272 | Extending Tournament Solutions | Game Theory and Economic Paradigms (GTEP) | GTEP: Social Choice / Voting |
295 | On Computing Optimal Strategies in Open List P... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Game Theory\nGTEP: Social Choice / Votin... |
303 | Envy-Free Division of Sellable Goods | Game Theory and Economic Paradigms (GTEP) | GTEP: Auctions and Market-Based Systems\nGTEP:... |
304 | Potential-Aware Imperfect-Recall Abstraction w... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Imperfect Information |
307 | Voting with Rank Dependent Scoring Rules | Game Theory and Economic Paradigms (GTEP) | GTEP: Auctions and Market-Based Systems\nGTEP:... |
313 | Incentivizing High-quality Content from Hetero... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im... |
317 | New Models for Competitive Contagion | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Equilibrium |
320 | Approximate Equilibrium and Incentivizing Soci... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Coordination and Coll... |
330 | Internally Stable Kidney Exchange | Multiagent Systems (MAS) | GTEP: Auctions and Market-Based Systems\nMAS: ... |
336 | Strategyproof exchange with multiple private e... | Game Theory and Economic Paradigms (GTEP) | GTEP: Auctions and Market-Based Systems\nGTEP:... |
337 | Mechanism design for mobile geo-location adver... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Auctions and Market-Based Systems\nGTEP:... |
342 | A Multiarmed Bandit Incentive Mechanism for Cr... | Computational Sustainability and AI (CSAI)\nGa... | CSAI: Modeling the interactions of agents with... |
343 | Binary Aggregation by Selection of the Most Re... | Game Theory and Economic Paradigms (GTEP)\nKno... | GTEP: Social Choice / Voting\nKRR: Preferences... |
347 | Bounding the Support Size in Extensive Form Ga... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im... |
359 | The Fisher Market Game: Equilibrium and Welfare | Game Theory and Economic Paradigms (GTEP) | GTEP: Auctions and Market-Based Systems\nGTEP:... |
366 | On the Axiomatic Characterization of Runoff Vo... | Game Theory and Economic Paradigms (GTEP) | GTEP: Social Choice / Voting |
370 | Solving Zero-Sum Security Games in Discretized... | Game Theory and Economic Paradigms (GTEP)\nMul... | GTEP: Game Theory\nGTEP: Equilibrium\nMAS: Mul... |
390 | Using Response Functions to Measure Strategy S... | Game Theory and Economic Paradigms (GTEP) | GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im... |
通过查看每组聚类结果得知:
通过上述结果可知,简单聚类可以将文章分为几类,但是相互有所粘连
# 创建一个TSNE对象,指定要降维到的维数为2,随机数种子为RANDOM_STATE
tsne = sklearn.manifold.TSNE(n_components=2, random_state=RANDOM_STATE, init="random")
# 调用TSNE对象的fit_transform方法,传入X_simple数据集,返回一个降维后的数据数组,赋值给X_tsne
X_tsne = tsne.fit_transform(X_simple)
sns.scatterplot(x=X_tsne[:,0], y=X_tsne[:,1], hue=labels, palette="deep") # 散点图
通过上图显示,简单聚类可以成功聚类,但是结果有所粘连
通过使用3.2中得到的X_pasage
进行聚类,并聚集10类
model = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1, random_state=RANDOM_STATE) # KMean聚类
model.fit(X_passage)
labels = model.labels_
data_df['label'] = labels
labels
array([2, 4, 3, 4, 0, 0, 8, 2, 4, 0, 5, 6, 2, 4, 6, 8, 3, 4, 4, 0, 1, 9,
3, 2, 2, 8, 4, 5, 4, 8, 3, 5, 2, 2, 7, 9, 2, 7, 8, 7, 2, 1, 2, 6,
3, 9, 2, 4, 4, 5, 4, 4, 4, 4, 2, 1, 9, 7, 1, 2, 3, 2, 4, 8, 4, 3,
9, 3, 8, 9, 3, 9, 2, 2, 8, 8, 9, 7, 8, 3, 1, 4, 4, 0, 8, 1, 2, 3,
8, 2, 2, 4, 6, 1, 2, 5, 0, 7, 8, 4, 9, 4, 2, 4, 6, 5, 7, 6, 8, 9,
7, 5, 0, 9, 3, 5, 4, 7, 2, 0, 4, 2, 6, 6, 3, 8, 2, 6, 2, 9, 2, 1,
6, 2, 3, 3, 8, 0, 6, 4, 9, 2, 6, 4, 2, 8, 5, 2, 6, 7, 2, 0, 6, 3,
2, 5, 4, 6, 2, 4, 2, 3, 9, 2, 5, 8, 3, 1, 9, 3, 9, 3, 6, 9, 9, 2,
8, 5, 3, 9, 9, 3, 8, 0, 1, 5, 4, 5, 3, 7, 7, 3, 5, 2, 2, 6, 6, 3,
5, 2, 5, 6, 8, 4, 4, 9, 6, 8, 0, 1, 2, 2, 6, 4, 8, 4, 6, 2, 6, 0,
4, 2, 8, 1, 3, 2, 4, 4, 4, 8, 6, 8, 2, 7, 4, 5, 3, 6, 5, 8, 2, 4,
2, 2, 4, 4, 8, 5, 2, 2, 5, 0, 7, 2, 2, 4, 5, 0, 2, 2, 2, 3, 4, 0,
2, 7, 5, 4, 1, 4, 3, 2, 3, 5, 2, 2, 8, 5, 2, 9, 4, 6, 2, 5, 5, 0,
2, 9, 2, 4, 1, 9, 5, 2, 9, 3, 9, 4, 2, 2, 2, 8, 2, 3, 7, 8, 4, 3,
9, 8, 5, 1, 6, 7, 5, 6, 2, 7, 2, 5, 7, 9, 8, 6, 9, 1, 9, 2, 2, 2,
3, 2, 8, 5, 5, 8, 7, 3, 5, 8, 1, 6, 7, 3, 8, 6, 6, 7, 9, 0, 0, 4,
9, 2, 1, 2, 9, 6, 2, 7, 8, 4, 6, 8, 2, 2, 3, 4, 1, 0, 7, 5, 2, 8,
9, 4, 1, 6, 2, 8, 6, 0, 9, 9, 6, 4, 5, 5, 2, 5, 7, 6, 2, 1, 4, 9,
4, 2])
data_df[data_df['label']==9][['title', 'groups', 'topics']]
title | groups | topics | |
---|---|---|---|
21 | The Complexity of Reasoning with FODD and GFODD | Knowledge Representation and Reasoning (KRR) | KRR: Automated Reasoning and Theorem Proving\n... |
35 | PREGO: An Action Language for Belief-Based Cog... | Knowledge Representation and Reasoning (KRR) | KRR: Action, Change, and Causality\nKRR: Knowl... |
45 | Recovering from Selection Bias in Causal and S... | Knowledge Representation and Reasoning (KRR)\n... | KRR: Action, Change, and Causality\nRU: Bayesi... |
56 | A Parameterized Complexity Analysis of General... | Game Playing and Interactive Entertainment (GP... | GTEP: Social Choice / Voting\nKRR: Computation... |
66 | Querying Inconsistent Description Logic Knowle... | Knowledge Representation and Reasoning (KRR) | KRR: Ontologies\nKRR: Computational Complexity... |
69 | Knowledge Graph Embedding by Translating on Hy... | Knowledge Representation and Reasoning (KRR)\n... | KRR: Knowledge Representation (General/Other)\... |
71 | Fast consistency checking of very large real-w... | Knowledge Representation and Reasoning (KRR)\n... | KRR: Geometric, Spatial, and Temporal Reasonin... |
76 | The Computational Complexity of Structure-Base... | Knowledge Representation and Reasoning (KRR)\n... | KRR: Action, Change, and Causality\nKRR: Compu... |
100 | A Tractable Approach to ABox Abduction over De... | Knowledge Representation and Reasoning (KRR) | KRR: Description Logics\nKRR: Diagnosis and Ab... |
109 | Reasoning on LTL on Finite Traces: Insensitivi... | Knowledge Representation and Reasoning (KRR) | AIW: AI for web services: semantic description... |
113 | Programming by Example using Least General Gen... | Applications (APP)\nHeuristic Search and Optim... | APP: Intelligent User Interfaces\nAPP: Other A... |
129 | Using Model-Based Diagnosis to Improve Softwar... | Applications (APP)\nKnowledge Representation a... | APP: Other Applications\nKRR: Automated Reason... |
140 | Confident Reasoning on Raven’s Progressive Mat... | Knowledge Representation and Reasoning (KRR) | KRR: Geometric, Spatial, and Temporal Reasonin... |
162 | SenticNet 3: A Common and Common-Sense Knowled... | Cognitive Systems (CS)\nKnowledge Representati... | CS: Conceptual inference and reasoning\nKRR: C... |
168 | Backdoors to Planning | Knowledge Representation and Reasoning (KRR)\n... | KRR: Computational Complexity of Reasoning\nPS... |
170 | Datalog Rewritability of Disjunctive Datalog P... | Knowledge Representation and Reasoning (KRR) | KRR: Ontologies\nKRR: Automated Reasoning and ... |
173 | The Most Uncreative Examinee: A First Step tow... | Knowledge Representation and Reasoning (KRR) | KRR: Automated Reasoning and Theorem Proving |
174 | Acquiring Commonsense Knowledge for Sentiment ... | Human-Computation and Crowd Sourcing (HCC)\nKn... | HCC: Domain-specific implementation challenges... |
179 | Explanation-Based Approximate Weighted Model C... | Knowledge Representation and Reasoning (KRR)\n... | KRR: Logic Programming\nRU: Probabilistic Infe... |
180 | A Knowledge Compilation Map for Ordered Real-V... | Knowledge Representation and Reasoning (KRR) | KRR: Computational Complexity of Reasoning\nKR... |
205 | A reasoner for the RCC-5 and RCC-8 calculi ext... | Knowledge Representation and Reasoning (KRR)\n... | KRR: Computational Complexity of Reasoning\nKR... |
279 | Computing General First-order Parallel and Pri... | Knowledge Representation and Reasoning (KRR) | KRR: Common-Sense Reasoning\nKRR: Nonmonotonic... |
287 | Data Quality in Ontology-based Data Access: Th... | Knowledge Representation and Reasoning (KRR) | APP: Other Applications\nKRR: Ontologies\nKRR:... |
291 | Diagnosing Analogue Linear Systems Using Dynam... | Knowledge Representation and Reasoning (KRR) | KRR: Diagnosis and Abductive Reasoning |
294 | Elementary Loops Revisited | Knowledge Representation and Reasoning (KRR) | KRR: Logic Programming |
296 | Joint Morphological Generation and Syntactic L... | NLP and Knowledge Representation (NLPKR) | NLPKR: Natural Language Processing (General/Ot... |
308 | Implementing GOLOG in Answer Set Programming | Knowledge Representation and Reasoning (KRR)\n... | KRR: Action, Change, and Causality\nKRR: Logic... |
321 | Qualitative Reasoning with Modelica Models | Applications (APP)\nKnowledge Representation a... | APP: Other Applications\nKRR: Knowledge Repres... |
324 | Pathway Specification and Comparative Queries:... | Knowledge Representation and Reasoning (KRR) | APP: Biomedical / Bioinformatics\nKRR: Knowled... |
326 | Testable Implications of Linear Structural Equ... | Knowledge Representation and Reasoning (KRR)\n... | KRR: Action, Change, and Causality\nRU: Graphi... |
348 | Exploiting Support Sets for Answer Set Program... | Knowledge Representation and Reasoning (KRR) | KRR: Ontologies\nKRR: Description Logics\nKRR:... |
352 | Local-To-Global Consistency Implies Tractabili... | Knowledge Representation and Reasoning (KRR) | KRR: Computational Complexity of Reasoning\nKR... |
356 | Exploring the Boundaries of Decidable Verifica... | Knowledge Representation and Reasoning (KRR) | KRR: Action, Change, and Causality\nKRR: Geome... |
374 | Managing Change in Graph-structured Data Using... | Knowledge Representation and Reasoning (KRR) | KRR: Computational Complexity of Reasoning\nKR... |
382 | Coactive Learning for Locally Optimal Problem ... | Humans and AI (HAI)\nKnowledge Representation ... | HCC: Active learning from imperfect human labe... |
383 | Large Scale Analogical Reasoning | Cognitive Systems (CS)\nKnowledge Representati... | CS: Conceptual inference and reasoning\nCS: St... |
395 | Contraction and Revision over DL-Lite TBoxes | Knowledge Representation and Reasoning (KRR) | KRR: Belief Change\nKRR: Description Logics\nK... |
通过查看每组聚类结果可知,每类结果有较为清晰的特征:
# 创建一个TSNE对象,指定要降维到的维数为2,随机数种子为RANDOM_STATE
tsne = sklearn.manifold.TSNE(n_components=2, random_state=RANDOM_STATE, init="random")
# 调用TSNE对象的fit_transform方法,传入X_passage数据集,返回一个降维后的数据数组,赋值给X_tsne
X_tsne = tsne.fit_transform(X_passage)
sns.scatterplot(x=X_tsne[:,0], y=X_tsne[:,1], hue=labels, palette="deep") # 散点图
从上图可知,通过作者、词干等分词后,聚类效果更好
本章分析不同k值对聚类效果的影响,以及该数据集中k取什么效果最好
k_range = range(5,15)
label_dict = {}
for k in k_range:
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1, random_state=RANDOM_STATE)
model.fit(X_passage)
labels = model.labels_
label_dict[k]=labels
label_dict[7]
array([0, 0, 5, 6, 3, 3, 3, 0, 6, 3, 4, 0, 0, 6, 2, 3, 5, 6, 6, 3, 3, 1,
5, 0, 0, 3, 6, 4, 6, 6, 5, 4, 0, 0, 5, 1, 0, 5, 6, 5, 0, 1, 0, 3,
5, 1, 0, 3, 3, 4, 6, 6, 6, 3, 0, 6, 1, 5, 3, 0, 5, 0, 6, 3, 6, 5,
1, 5, 3, 1, 5, 1, 0, 0, 6, 3, 1, 5, 3, 5, 6, 3, 6, 3, 3, 6, 0, 5,
3, 0, 0, 6, 2, 3, 0, 4, 3, 5, 3, 3, 1, 6, 0, 6, 1, 4, 5, 2, 3, 1,
5, 4, 3, 1, 5, 4, 6, 3, 0, 3, 3, 0, 3, 2, 5, 3, 0, 2, 0, 1, 0, 1,
1, 0, 5, 5, 3, 3, 2, 3, 1, 0, 3, 6, 0, 3, 4, 0, 2, 5, 0, 3, 3, 5,
0, 4, 6, 2, 0, 6, 0, 5, 1, 0, 4, 3, 5, 6, 2, 5, 1, 5, 2, 1, 1, 0,
3, 4, 5, 1, 1, 5, 3, 3, 1, 4, 3, 6, 5, 5, 5, 5, 4, 0, 0, 1, 2, 5,
4, 0, 4, 2, 3, 3, 6, 1, 2, 3, 3, 6, 0, 3, 1, 3, 3, 6, 2, 0, 2, 3,
6, 3, 3, 3, 5, 0, 3, 6, 3, 3, 1, 3, 0, 5, 6, 4, 5, 2, 4, 3, 0, 3,
0, 0, 6, 6, 3, 4, 0, 0, 4, 3, 5, 0, 0, 6, 2, 3, 0, 0, 0, 5, 6, 3,
0, 5, 4, 0, 6, 6, 5, 0, 5, 4, 0, 0, 3, 4, 0, 1, 3, 2, 0, 4, 4, 3,
0, 1, 0, 6, 6, 1, 4, 0, 1, 5, 3, 6, 0, 0, 0, 3, 0, 5, 5, 0, 6, 5,
1, 3, 4, 1, 2, 5, 4, 2, 0, 5, 0, 4, 5, 1, 3, 1, 1, 1, 1, 0, 0, 0,
5, 0, 3, 4, 4, 3, 5, 5, 4, 3, 1, 2, 5, 5, 3, 2, 2, 5, 1, 3, 3, 6,
1, 0, 1, 0, 1, 2, 0, 5, 3, 3, 1, 3, 0, 0, 5, 6, 6, 3, 5, 4, 0, 3,
1, 3, 6, 2, 3, 3, 2, 3, 1, 1, 3, 3, 4, 4, 0, 4, 5, 2, 0, 6, 6, 1,
3, 0])
# 创建2行5列的子图布局
fig, axes = plt.subplots(2, 5, figsize=(25, 10))
# 将10个子图填充到子图布局中
for k, label in label_dict.items():
row, col = divmod(k-5, 5) # 根据k计算在子图布局中的行和列位置
ax = axes[row, col]
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=label, palette="deep", ax=ax)
ax.set_title("cluster = %d" % k)
# 调整子图布局
plt.tight_layout()
plt.show()
# 创建一个TSNE对象,指定要降维到的维数为3,随机数种子为RANDOM_STATE
tsne = sklearn.manifold.TSNE(n_components=3, random_state=RANDOM_STATE, init="random")
# 调用TSNE对象的fit_transform方法,传入X_passage数据集,返回一个降维后的数据数组,赋值给X_tsne
X_tsne = tsne.fit_transform(X_passage)
# 创建一个大画布,包含10个子图
fig, axes = plt.subplots(2, 5, figsize=(25, 10), subplot_kw={'projection': '3d'})
# 将10个子图填充到大画布中
for k, ax in zip(label_dict.keys(), axes.flatten()):
# 绘制散点图,指定散点的大小
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], X_tsne[:, 2], c=label_dict[k], cmap='Dark2')
# 设置标题,指定标题的字体大小
ax.set_title("cluster = %d" % k, fontsize=16)
# 调整子图布局
plt.tight_layout()
plt.show()
以上可见,用2d和3d图展示聚类效果,在5到14的Kmeans中没有聚类效果特别好的,但是感觉取7时聚类效果更好一点