data analysis task1：Paper statistics

任务说明

任务主题：论文数量统计，即统计2019年全年计算机各个方向论文数量；
数据集：https://www.kaggle.com/Cornell-University/arxiv

1. 环境配置： google colab + kaggle数据集

colab 中运行脚本，导入arxiv datasset
!pip install kaggle
!mkdir -p ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle config set -n path -v /content
!kaggle datasets download -d Cornell-University/arxiv

1.论文数据统计

（1）解压文件

import zipfile
datapath = '/content/datasets/Cornell-University/arxiv/arxiv.zip'
datazip = zipfile.ZipFile(datapath)
print(datazip.namelist())
print(datazip.filename)
datazip.extractall()

（2）文件包导入

import seaborn as sns 
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt

（3）读取json数据

# read data
data = []
with open('/content/arxiv-metadata-oai-snapshot.json', 'r') as f:
  for line in f:
    data.append(json.loads(line))
data = pd.DataFrame(data)
data.shape

(1796911, 14)

#查看数据
data.head(1)

id	submitter	authors	title	comments	journal-ref	doi	report-no	categories	license	abstract	versions	update_date	authors_parsed
0704.0001	Pavel Nadolsky	C. Bal'azs, E. L. Berger, P. M. Nadolsky, C.-...	Calculation of prompt diphoton production cros...	37 pages, 15 figures; published version	Phys.Rev.D76:013009,2007	10.1103/PhysRevD.76.013009	ANL-HEP-PR-07-12	hep-ph	None	A fully differential calculation in perturba...	[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...	2008-11-26	[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...

列描述

编号	列名	描述
0	id	arXiv ID，可用于访问论文；
1	submitter	论文提交者；
2	authors	论文作者；
3	title	论文标题；
4	comments	论文页数和图表等其他信息；
5	journal-ref	论文发表的期刊的信息；
6	doi	数字对象标识符，https://www.doi.org；
7	report-no	报告编号；
8	categories	论文在 arXiv 系统的所属类别或标签；
9	license	文章的许可证；
10	abstract	论文摘要；
12	versions	论文版本；
13	authors_parsed	作者的信息；

（4）数据预处理

'''
count: 一列数据的元素个数
unique： 一列数据中元素的种类
top: 一列数据中出现频率最高的元素
freq: 一列数据中出现频率最高的元素的个数
'''
# 查看  categories
data['categories'].describe()
'''
output：
count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object
'''

有1796911个数据， 62055个种类，出现最多的类别是astro-ph，出现86914次

# 本数据集中出现了多少独立的数据集
unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
len(unique_categories)
unique_categories

 'ao-sci',
 'astro-ph', 'astro-ph.CO', 'astro-ph.EP', 'astro-ph.GA', 'astro-ph.HE', 'astro-ph.IM', 'astro-ph.SR',
 'atom-ph',
 'bayes-an',
 'chao-dyn',
 'chem-ph',
 'cmp-lg',
 'comp-gas',
 'cond-mat', 'cond-mat.dis-nn', 'cond-mat.mes-hall', 'cond-mat.mtrl-sci', 'cond-mat.other', 'cond-mat.quant-gas',
'cond-mat.soft', 'cond-mat.stat-mech', 'cond-mat.str-el', 'cond-mat.supr-con',
 'cs.AI', 'cs.AR','cs.CC', 'cs.CE', 'cs.CG', 'cs.CL','cs.CR', 'cs.CV', 'cs.CY', 'cs.DB','cs.DC', 'cs.DL','cs.DM','cs.DS','cs.ET', 'cs.FL', 'cs.GL','cs.GR','cs.GT','cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS','cs.PF', 'cs.PL','cs.RO','cs.SC','cs.SD','cs.SE','cs.SI', 'cs.SY',
 'dg-ga',
 'econ.EM','econ.GN', 'econ.TH',
 'eess.AS', 'eess.IV', 'eess.SP', 'eess.SY',
 'funct-an',
 'gr-qc',
 'hep-ex',
 'hep-lat',
 'hep-ph',
 'hep-th',
 'math-ph',
 'math.AC', 'math.AG', 'math.AP', 'math.AT', 'math.CA', 'math.CO', 'math.CT', 'math.CV', 'math.DG','math.DS', 'math.FA', 'math.GM', 'math.GN', 'math.GR', 'math.GT', 'math.HO', 'math.IT', 'math.KT', 'math.LO', 'math.MG', 'math.MP', 'math.NA', 'math.NT', 'math.OA', 'math.OC', 'math.PR', 'math.QA', 'math.RA', 'math.RT', 'math.SG', 'math.SP', 'math.ST',
 'mtrl-th',
 'nlin.AO','nlin.CD', 'nlin.CG', 'nlin.PS', 'nlin.SI',
 'nucl-ex','nucl-th',
 'patt-sol',
 'physics.acc-ph', 'physics.ao-ph', 'physics.app-ph', 'physics.atm-clus', 'physics.atom-ph', 'physics.bio-ph', 'physics.chem-ph', 'physics.class-ph', 'physics.comp-ph', 'physics.data-an', 'physics.ed-ph', 'physics.flu-dyn', 'physics.gen-ph', 'physics.geo-ph', 'physics.hist-ph', 'physics.ins-det', 'physics.med-ph', 'physics.optics', 'physics.plasm-ph', 'physics.pop-ph', 'physics.soc-ph','physics.space-ph',
 'plasm-ph',
 'q-alg',
 'q-bio','q-bio.BM','q-bio.CB', 'q-bio.GN','q-bio.MN', 'q-bio.NC','q-bio.OT','q-bio.PE', 'q-bio.QM','q-bio.SC', 'q-bio.TO',
'q-fin.CP','q-fin.EC','q-fin.GN','q-fin.MF','q-fin.PM', 'q-fin.PR', 'q-fin.RM', 'q-fin.ST', 'q-fin.TR',
 'quant-ph',
 'solv-int',
'stat.AP', 'stat.CO','stat.ME','stat.ML','stat.OT', 'stat.TH',
 'supr-con'}```

print(len(unique_categories))

# 对2019年以后的paper完成分析，
data['year'] = pd.to_datetime(data["update_date"]).dt.year # update_date 从str变成datetime格式，并提取year
del data["update_date"]
data = data[data["year"] >= 2019]
data.reset_index(drop = True, inplace = True) # 重新编号
data

395123 rows × 14 columns

# 2019年以后，计算机领域的数据
website_url = requests.get('https://arxiv.org/category_taxonomy').text # 获取网页的文本数据
soup = BeautifulSoup(website_url, 'lxml') # 爬取是数据，使用lxml解析，加速
print(website_url)
root = soup.find('div',{'id':'category_taxonomy_list'})
tags = root.find_all(["h2","h3","h4","p"],recursive = True) #读取tags
print(tags)

# 初始化 str 和 list变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

# ing
for t in tags:
  if t.name == "h2":
    level_1_name = t.text
    level_2_code = t.text
    level_2_name = t.text
  elif t.name == "h3":
    raw = t.text
    # 正则表达式 '.'表示匹配任意1个字符，‘*’表示匹配表示前一个字符出现0次、多次或者无限次。
    # "\(" 匹配（.
    # (.*) 为括号前所有的str,\((.*)\), 为后面括号的str/
    level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) # 括号里的文本
    level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw) # 括号前的文本
  elif t.name == "h4":
    raw = t.text
    level_3_code = re.sub(r"(.*) \((.*)\)",r"\1", raw)
    level_3_name = re.sub(r"(.*) \((.*)\)",r"\2", raw)
  elif t.name == "p":
    notes = t.text
    level_1_names.append(level_1_name)
    level_2_names.append(level_2_name)
    level_2_codes.append(level_2_code)
    level_3_names.append(level_3_name)
    level_3_codes.append(level_3_code)
    level_3_notes.append(notes)

根据以上信息生成dataframe 格式对的数据

df_taxonomy = pd.DataFrame({
  'group_name':level_1_names,
  'archive_name':level_2_names,
  'archive_id':level_2_codes,
  'category_name':level_3_names,
  'categories':level_3_codes,
  'category_description':level_3_notes
})
df_taxonomy.groupby(["group_name", "archive_name"])
df_taxonomy

No.	group_name	archive_name	archive_id	category_name	categories	category_description
0	Computer Science	Computer Science	Computer Science	Artificial Intelligence	cs.AI	Covers all areas of AI except Vision, Robotics...
1	Computer Science	Computer Science	Computer Science	Hardware Architecture	cs.AR	Covers systems organization and hardware archi...
2	Computer Science	Computer Science	Computer Science	Computational Complexity	cs.CC	Covers models of computation, complexity class...
3	Computer Science	Computer Science	Computer Science	Computational Engineering, Finance, and Science	cs.CE	Covers applications of computer science to the...
4	Computer Science	Computer Science	Computer Science	Computational Geometry	cs.CG	Roughly includes material in ACM Subject Class...
...	...	...	...	...	...	...
153	Statistics	Statistics	Statistics	Other Statistics	stat.OT	Work in statistics that does not fit into the ...
154	Statistics	Statistics	Statistics	Statistics Theory	stat.TH	stat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

数据可视化

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
_df
# 使用饼图对结果可视化
fig = plt.figure(figsize = (15,12)) 
# explode 每一块距离中心的距离
explode = (0,0,0,0.2,0.3,0.3,0.2,0.1)
plt.pie(_df["id"], labels = _df["group_name"], autopct="%1.2f%%", startangle = 160, explode=explode)
plt.tight_layout()
plt.show()

不同学科论文数量占比.png

查看2019、2020论文数量

group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")

category_name	2019	2020
Artificial Intelligence	558	757
Computation and Language	2153	2906
Computational Complexity	131	188
Computational Engineering, Finance, and Science	108	205
Computational Geometry	199	216
Computer Science and Game Theory	281	323
Computer Vision and Pattern Recognition	5559	6517
Computers and Society	346	564
Cryptography and Security	1067	1238
Data Structures and Algorithms	711	902
Databases	282	342
Digital Libraries	125	157
Discrete Mathematics	84	81
Distributed, Parallel, and Cluster Computing	715	774
Emerging Technologies	101	84
Formal Languages and Automata Theory	152	137
General Literature	5	5
Graphics	116	151
Hardware Architecture	95	159
Human-Computer Interaction	420	580
Information Retrieval	245	331
Logic in Computer Science	470	504
Machine Learning	177	538
Mathematical Software	27	45
Multiagent Systems	85	90
Multimedia	76	66
Networking and Internet Architecture	864	783
Neural and Evolutionary Computing	235	279
Numerical Analysis	40	11
Operating Systems	36	33
Other Computer Science	67	69
Performance	45	51
Programming Languages	268	294
Robotics	917	1298
Social and Information Networks	202	325
Software Engineering	659	804
Sound	7	4
Symbolic Computation	44	36
Systems and Control	415	133