统计2019年全年计算机各个方向论文数量
步骤
:
1.找到update为2019年的数据
2.找出categories为计算机的数据
3.统计数量
#导入包
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #爬取数据
import re #正则,匹配字符串模式
import requests #网络连接,发送网络请求,使用域名获取对应信息
import json #读取json格式数据
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图
data = [] #初始化
#使用with语句优势,1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open("arxiv-metadata-oai-snapshot.json",'r') as f:
for line in f:
data.append(json.loads(line))
'''
for idx, line in enumerate(f):
# 读取前100行,查看数据的时候,不需要跑很多,此处一定要注意
if idx >= 100:
break
'''
data = pd.DataFrame(data) #将list变为DataFrame格式,方便分析
data.shape #显示数据大小
(1796911, 14)
data.head(2)
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | update_date | authors_parsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0704.0001 | Pavel Nadolsky | C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-... | Calculation of prompt diphoton production cros... | 37 pages, 15 figures; published version | Phys.Rev.D76:013009,2007 | 10.1103/PhysRevD.76.013009 | ANL-HEP-PR-07-12 | hep-ph | None | A fully differential calculation in perturba... | [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... | 2008-11-26 | [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,... |
1 | 0704.0002 | Louis Theran | Ileana Streinu and Louis Theran | Sparsity-certifying Graph Decompositions | To appear in Graphs and Combinatorics | None | None | None | math.CO cs.CG | http://arxiv.org/licenses/nonexclusive-distrib... | We describe a new algorithm, the $(k,\ell)$-... | [{'version': 'v1', 'created': 'Sat, 31 Mar 200... | 2008-12-13 | [[Streinu, Ileana, ], [Theran, Louis, ]] |
数据集的字段解释:
id
:arXiv ID,可用于访问论文;submitter
:论文提交者;authors
:论文作者;title
:论文标题;comments
:论文页数和图表等其他信息;journal-ref
:论文发表的期刊的信息;doi
:数字对象标识符,https://www.doi.org;report-no
:报告编号;categories
:论文在 arXiv 系统的所属类别或标签;license
:文章的许可证;abstract
:论文摘要;versions
:论文版本;authors_parsed
:作者的信息。首先查看论文的种类信息,目的是了解一下数据集的基本信息
data['categories'].describe()
count 1796911
unique 62055
top astro-ph
freq 86914
Name: categories, dtype: object
-count
:元素个数;
-unique
:元素的不同种类;
-top
:出现频率最高的元素;
-freq
:出现频率最高的元素个数;
data['categories'].head(4)
0 hep-ph
1 math.CO cs.CG
2 physics.gen-ph
3 math.CO
Name: categories, dtype: object
查看一下categories的分类信息,同时需要依据官方的论文种类对其进行整理,方便我们找到计算机类的数据。
从官网爬取类别数据
#获取网页文本数据
websit_url = requests.get('https://arxiv.org/category_taxonomy').text
#爬取数据,使用lxml的解析器,加速
soup = BeautifulSoup(websit_url,'lxml')
#找到BeautifulScoup对应的标签入口
root = soup.find('div',{
'id':'category_taxonomy_list'})
#读取tags
tags = root.find_all(["h2","h3","h4","p"], recursive=True)
#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []
#进行
for t in tags:
if t.name == "h2":
level_1_name = t.text
level_2_code = t.text
level_2_name = t.text
elif t.name == "h3":
raw = t.text
level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串:raw
level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
elif t.name == "h4":
raw = t.text
level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
elif t.name == "p":
notes = t.text
level_1_names.append(level_1_name)
level_2_names.append(level_2_name)
level_2_codes.append(level_2_code)
level_3_names.append(level_3_name)
level_3_codes.append(level_3_code)
level_3_notes.append(notes)
#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
'group_name' : level_1_names,
'archive_name' : level_2_names,
'archive_id' : level_2_codes,
'category_name' : level_3_names,
'categories' : level_3_codes,
'category_description': level_3_notes
})
#按照 "group_name" 进行分组,在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy
group_name | archive_name | archive_id | category_name | categories | category_description | |
---|---|---|---|---|---|---|
0 | Computer Science | Computer Science | Computer Science | Artificial Intelligence | cs.AI | Covers all areas of AI except Vision, Robotics... |
1 | Computer Science | Computer Science | Computer Science | Hardware Architecture | cs.AR | Covers systems organization and hardware archi... |
2 | Computer Science | Computer Science | Computer Science | Computational Complexity | cs.CC | Covers models of computation, complexity class... |
3 | Computer Science | Computer Science | Computer Science | Computational Engineering, Finance, and Science | cs.CE | Covers applications of computer science to the... |
4 | Computer Science | Computer Science | Computer Science | Computational Geometry | cs.CG | Roughly includes material in ACM Subject Class... |
... | ... | ... | ... | ... | ... | ... |
150 | Statistics | Statistics | Statistics | Computation | stat.CO | Algorithms, Simulation, Visualization |
151 | Statistics | Statistics | Statistics | Methodology | stat.ME | Design, Surveys, Model Selection, Multiple Tes... |
152 | Statistics | Statistics | Statistics | Machine Learning | stat.ML | Covers machine learning papers (supervised, un... |
153 | Statistics | Statistics | Statistics | Other Statistics | stat.OT | Work in statistics that does not fit into the ... |
154 | Statistics | Statistics | Statistics | Statistics Theory | stat.TH | stat.TH is an alias for math.ST. Asymptotics, ... |
155 rows × 6 columns
观察上面的数据,我们会发现categories存在多种类的问题,使用空格作为了分隔符,这里我们需要去查看总共存在多少种分类
#使用集合的属性,直接去除重复值
unique_categories = set([i for j in [x.split(' ') for x in data['categories']] for i in j])
len(unique_categories)
176
unique_categories
{'acc-phys',
'adap-org',
'alg-geom',
'ao-sci',
'astro-ph',
'astro-ph.CO',
'astro-ph.EP',
'astro-ph.GA',
'astro-ph.HE',
'astro-ph.IM',
'astro-ph.SR',
'atom-ph',
'bayes-an',
'chao-dyn',
'chem-ph',
'cmp-lg',
'comp-gas',
'cond-mat',
'cond-mat.dis-nn',
'cond-mat.mes-hall',
'cond-mat.mtrl-sci',
'cond-mat.other',
'cond-mat.quant-gas',
'cond-mat.soft',
'cond-mat.stat-mech',
'cond-mat.str-el',
'cond-mat.supr-con',
'cs.AI',
'cs.AR',
'cs.CC',
'cs.CE',
'cs.CG',
'cs.CL',
'cs.CR',
'cs.CV',
'cs.CY',
'cs.DB',
'cs.DC',
'cs.DL',
'cs.DM',
'cs.DS',
'cs.ET',
'cs.FL',
'cs.GL',
'cs.GR',
'cs.GT',
'cs.HC',
'cs.IR',
'cs.IT',
'cs.LG',
'cs.LO',
'cs.MA',
'cs.MM',
'cs.MS',
'cs.NA',
'cs.NE',
'cs.NI',
'cs.OH',
'cs.OS',
'cs.PF',
'cs.PL',
'cs.RO',
'cs.SC',
'cs.SD',
'cs.SE',
'cs.SI',
'cs.SY',
'dg-ga',
'econ.EM',
'econ.GN',
'econ.TH',
'eess.AS',
'eess.IV',
'eess.SP',
'eess.SY',
'funct-an',
'gr-qc',
'hep-ex',
'hep-lat',
'hep-ph',
'hep-th',
'math-ph',
'math.AC',
'math.AG',
'math.AP',
'math.AT',
'math.CA',
'math.CO',
'math.CT',
'math.CV',
'math.DG',
'math.DS',
'math.FA',
'math.GM',
'math.GN',
'math.GR',
'math.GT',
'math.HO',
'math.IT',
'math.KT',
'math.LO',
'math.MG',
'math.MP',
'math.NA',
'math.NT',
'math.OA',
'math.OC',
'math.PR',
'math.QA',
'math.RA',
'math.RT',
'math.SG',
'math.SP',
'math.ST',
'mtrl-th',
'nlin.AO',
'nlin.CD',
'nlin.CG',
'nlin.PS',
'nlin.SI',
'nucl-ex',
'nucl-th',
'patt-sol',
'physics.acc-ph',
'physics.ao-ph',
'physics.app-ph',
'physics.atm-clus',
'physics.atom-ph',
'physics.bio-ph',
'physics.chem-ph',
'physics.class-ph',
'physics.comp-ph',
'physics.data-an',
'physics.ed-ph',
'physics.flu-dyn',
'physics.gen-ph',
'physics.geo-ph',
'physics.hist-ph',
'physics.ins-det',
'physics.med-ph',
'physics.optics',
'physics.plasm-ph',
'physics.pop-ph',
'physics.soc-ph',
'physics.space-ph',
'plasm-ph',
'q-alg',
'q-bio',
'q-bio.BM',
'q-bio.CB',
'q-bio.GN',
'q-bio.MN',
'q-bio.NC',
'q-bio.OT',
'q-bio.PE',
'q-bio.QM',
'q-bio.SC',
'q-bio.TO',
'q-fin.CP',
'q-fin.EC',
'q-fin.GN',
'q-fin.MF',
'q-fin.PM',
'q-fin.PR',
'q-fin.RM',
'q-fin.ST',
'q-fin.TR',
'quant-ph',
'solv-int',
'stat.AP',
'stat.CO',
'stat.ME',
'stat.ML',
'stat.OT',
'stat.TH',
'supr-con'}
官网是155种,我们的数据集是176种,问题不大
开始获取我们想要的数据
#得到2019年的论文数据
data['year'] = pd.to_datetime(data['update_date']).dt.year
#查找2019年的数据
data = data[data['year']== 2019]
#获取id,categories
data_d = pd.DataFrame(data[['id','categories']])
data_d = data_d.reset_index(drop=True)
data_d.shape
(170618, 2)
#将data_d与df_taxonomy拼接,找到论文的类型,使用值拼接,并删除重复的'id'和group_name
data_f = data_d.merge(df_taxonomy,on='categories',how='left').drop_duplicates(['id','group_name'])
#按照categories统计id的数量
data_f = pd.DataFrame(data_f.groupby('group_name')['id'].count())
data_f = data_f.rename(columns={
'id':'count'})
#一定要排序,否则会影响绘图
data_f = data_f.sort_values(by='count',ascending=False)
data_f = data_f.reset_index()
data_f
group_name | count | |
---|---|---|
0 | Physics | 38379 |
1 | Mathematics | 24495 |
2 | Computer Science | 18087 |
3 | Statistics | 1802 |
4 | Electrical Engineering and Systems Science | 1371 |
5 | Quantitative Biology | 886 |
6 | Quantitative Finance | 352 |
7 | Economics | 173 |
接下来可以查看计算机小类在2019年的情况
fig = plt.figure(figsize=(10,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1)
plt.pie(data_f['count'], labels=data_f['group_name'], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()
comp = data_d.merge(df_taxonomy,on='categories',how='left').query("group_name == 'Computer Science'").drop_duplicates(['id','group_name'])
comp = pd.DataFrame(comp.groupby(['category_name'])['id'].count()).rename(columns={
'id':'count'}).sort_values(by='count',ascending=False).reset_index()
comp.head(5)
category_name | count | |
---|---|---|
0 | Computer Vision and Pattern Recognition | 5559 |
1 | Computation and Language | 2153 |
2 | Cryptography and Security | 1067 |
3 | Robotics | 917 |
4 | Networking and Internet Architecture | 864 |
从结果看出,Computer Vision and Pattern Recognition(计算机视觉与模式识别)类是CS中数量最多的子类,这里我只计算了2019年的数据。