data analysis task1:Paper statistics

任务说明

任务主题:论文数量统计,即统计2019年全年计算机各个方向论文数量;
数据集:https://www.kaggle.com/Cornell-University/arxiv

1. 环境配置: google colab + kaggle数据集

colab 中运行脚本,导入arxiv datasset
!pip install kaggle
!mkdir -p ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle config set -n path -v /content
!kaggle datasets download -d Cornell-University/arxiv

1.论文数据统计

(1) 解压文件

import zipfile
datapath = '/content/datasets/Cornell-University/arxiv/arxiv.zip'
datazip = zipfile.ZipFile(datapath)
print(datazip.namelist())
print(datazip.filename)
datazip.extractall()

(2)文件包导入
import seaborn as sns 
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
(3) 读取json数据
# read data
data = []
with open('/content/arxiv-metadata-oai-snapshot.json', 'r') as f:
  for line in f:
    data.append(json.loads(line))
data = pd.DataFrame(data)
data.shape
(1796911, 14)
#查看数据
data.head(1)
id submitter authors title comments journal-ref doi report-no categories license abstract versions update_date authors_parsed
0704.0001 Pavel Nadolsky C. Bal'azs, E. L. Berger, P. M. Nadolsky, C.-... Calculation of prompt diphoton production cros... 37 pages, 15 figures; published version Phys.Rev.D76:013009,2007 10.1103/PhysRevD.76.013009 ANL-HEP-PR-07-12 hep-ph None A fully differential calculation in perturba... [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... 2008-11-26 [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
列描述
编号 列名 描述
0 id arXiv ID,可用于访问论文;
1 submitter 论文提交者;
2 authors 论文作者;
3 title 论文标题;
4 comments 论文页数和图表等其他信息;
5 journal-ref 论文发表的期刊的信息;
6 doi 数字对象标识符,https://www.doi.org;
7 report-no 报告编号;
8 categories 论文在 arXiv 系统的所属类别或标签;
9 license 文章的许可证;
10 abstract 论文摘要;
12 versions 论文版本;
13 authors_parsed 作者的信息;
(4) 数据预处理
'''
count: 一列数据的元素个数
unique: 一列数据中元素的种类
top: 一列数据中出现频率最高的元素
freq: 一列数据中出现频率最高的元素的个数
'''
# 查看  categories
data['categories'].describe()
'''
output:
count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object
'''

有1796911个数据, 62055个种类,出现最多的类别是astro-ph,出现86914次

# 本数据集中出现了多少独立的数据集
unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
len(unique_categories)
unique_categories
 'ao-sci',
 'astro-ph', 'astro-ph.CO', 'astro-ph.EP', 'astro-ph.GA', 'astro-ph.HE', 'astro-ph.IM', 'astro-ph.SR',
 'atom-ph',
 'bayes-an',
 'chao-dyn',
 'chem-ph',
 'cmp-lg',
 'comp-gas',
 'cond-mat', 'cond-mat.dis-nn', 'cond-mat.mes-hall', 'cond-mat.mtrl-sci', 'cond-mat.other', 'cond-mat.quant-gas',
'cond-mat.soft', 'cond-mat.stat-mech', 'cond-mat.str-el', 'cond-mat.supr-con',
 'cs.AI', 'cs.AR','cs.CC', 'cs.CE', 'cs.CG', 'cs.CL','cs.CR', 'cs.CV', 'cs.CY', 'cs.DB','cs.DC', 'cs.DL','cs.DM','cs.DS','cs.ET', 'cs.FL', 'cs.GL','cs.GR','cs.GT','cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS','cs.PF', 'cs.PL','cs.RO','cs.SC','cs.SD','cs.SE','cs.SI', 'cs.SY',
 'dg-ga',
 'econ.EM','econ.GN', 'econ.TH',
 'eess.AS', 'eess.IV', 'eess.SP', 'eess.SY',
 'funct-an',
 'gr-qc',
 'hep-ex',
 'hep-lat',
 'hep-ph',
 'hep-th',
 'math-ph',
 'math.AC', 'math.AG', 'math.AP', 'math.AT', 'math.CA', 'math.CO', 'math.CT', 'math.CV', 'math.DG','math.DS', 'math.FA', 'math.GM', 'math.GN', 'math.GR', 'math.GT', 'math.HO', 'math.IT', 'math.KT', 'math.LO', 'math.MG', 'math.MP', 'math.NA', 'math.NT', 'math.OA', 'math.OC', 'math.PR', 'math.QA', 'math.RA', 'math.RT', 'math.SG', 'math.SP', 'math.ST',
 'mtrl-th',
 'nlin.AO','nlin.CD', 'nlin.CG', 'nlin.PS', 'nlin.SI',
 'nucl-ex','nucl-th',
 'patt-sol',
 'physics.acc-ph', 'physics.ao-ph', 'physics.app-ph', 'physics.atm-clus', 'physics.atom-ph', 'physics.bio-ph', 'physics.chem-ph', 'physics.class-ph', 'physics.comp-ph', 'physics.data-an', 'physics.ed-ph', 'physics.flu-dyn', 'physics.gen-ph', 'physics.geo-ph', 'physics.hist-ph', 'physics.ins-det', 'physics.med-ph', 'physics.optics', 'physics.plasm-ph', 'physics.pop-ph', 'physics.soc-ph','physics.space-ph',
 'plasm-ph',
 'q-alg',
 'q-bio','q-bio.BM','q-bio.CB', 'q-bio.GN','q-bio.MN', 'q-bio.NC','q-bio.OT','q-bio.PE', 'q-bio.QM','q-bio.SC', 'q-bio.TO',
'q-fin.CP','q-fin.EC','q-fin.GN','q-fin.MF','q-fin.PM', 'q-fin.PR', 'q-fin.RM', 'q-fin.ST', 'q-fin.TR',
 'quant-ph',
 'solv-int',
'stat.AP', 'stat.CO','stat.ME','stat.ML','stat.OT', 'stat.TH',
 'supr-con'}```

print(len(unique_categories))

# 对2019年以后的paper完成分析,
data['year'] = pd.to_datetime(data["update_date"]).dt.year # update_date 从str变成datetime格式,并提取year
del data["update_date"]
data = data[data["year"] >= 2019]
data.reset_index(drop = True, inplace = True) # 重新编号
data

395123 rows × 14 columns

# 2019年以后,计算机领域的数据
website_url = requests.get('https://arxiv.org/category_taxonomy').text # 获取网页的文本数据
soup = BeautifulSoup(website_url, 'lxml') # 爬取是数据,使用lxml解析,加速
print(website_url)
root = soup.find('div',{'id':'category_taxonomy_list'})
tags = root.find_all(["h2","h3","h4","p"],recursive = True) #读取tags
print(tags)

# 初始化 str 和 list变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

# ing
for t in tags:
  if t.name == "h2":
    level_1_name = t.text
    level_2_code = t.text
    level_2_name = t.text
  elif t.name == "h3":
    raw = t.text
    # 正则表达式 '.'表示匹配任意1个字符,‘*’表示匹配表示前一个字符出现0次、多次或者无限次。
    # "\(" 匹配(.
    # (.*) 为括号前所有的str,\((.*)\), 为后面括号的str/
    level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) # 括号里的文本
    level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw) # 括号前的文本
  elif t.name == "h4":
    raw = t.text
    level_3_code = re.sub(r"(.*) \((.*)\)",r"\1", raw)
    level_3_name = re.sub(r"(.*) \((.*)\)",r"\2", raw)
  elif t.name == "p":
    notes = t.text
    level_1_names.append(level_1_name)
    level_2_names.append(level_2_name)
    level_2_codes.append(level_2_code)
    level_3_names.append(level_3_name)
    level_3_codes.append(level_3_code)
    level_3_notes.append(notes)
根据以上信息生成dataframe 格式对的数据
df_taxonomy = pd.DataFrame({
  'group_name':level_1_names,
  'archive_name':level_2_names,
  'archive_id':level_2_codes,
  'category_name':level_3_names,
  'categories':level_3_codes,
  'category_description':level_3_notes
})
df_taxonomy.groupby(["group_name", "archive_name"])
df_taxonomy
No. group_name archive_name archive_id category_name categories category_description
0 Computer Science Computer Science Computer Science Artificial Intelligence cs.AI Covers all areas of AI except Vision, Robotics...
1 Computer Science Computer Science Computer Science Hardware Architecture cs.AR Covers systems organization and hardware archi...
2 Computer Science Computer Science Computer Science Computational Complexity cs.CC Covers models of computation, complexity class...
3 Computer Science Computer Science Computer Science Computational Engineering, Finance, and Science cs.CE Covers applications of computer science to the...
4 Computer Science Computer Science Computer Science Computational Geometry cs.CG Roughly includes material in ACM Subject Class...
... ... ... ... ... ... ...
153 Statistics Statistics Statistics Other Statistics stat.OT Work in statistics that does not fit into the ...
154 Statistics Statistics Statistics Statistics Theory stat.TH stat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

数据可视化
_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
_df
# 使用饼图对结果可视化
fig = plt.figure(figsize = (15,12)) 
# explode 每一块距离中心的距离
explode = (0,0,0,0.2,0.3,0.3,0.2,0.1)
plt.pie(_df["id"], labels = _df["group_name"], autopct="%1.2f%%", startangle = 160, explode=explode)
plt.tight_layout()
plt.show()
不同学科论文数量占比.png

查看2019、2020论文数量

group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id") 
category_name 2019 2020
Artificial Intelligence 558 757
Computation and Language 2153 2906
Computational Complexity 131 188
Computational Engineering, Finance, and Science 108 205
Computational Geometry 199 216
Computer Science and Game Theory 281 323
Computer Vision and Pattern Recognition 5559 6517
Computers and Society 346 564
Cryptography and Security 1067 1238
Data Structures and Algorithms 711 902
Databases 282 342
Digital Libraries 125 157
Discrete Mathematics 84 81
Distributed, Parallel, and Cluster Computing 715 774
Emerging Technologies 101 84
Formal Languages and Automata Theory 152 137
General Literature 5 5
Graphics 116 151
Hardware Architecture 95 159
Human-Computer Interaction 420 580
Information Retrieval 245 331
Logic in Computer Science 470 504
Machine Learning 177 538
Mathematical Software 27 45
Multiagent Systems 85 90
Multimedia 76 66
Networking and Internet Architecture 864 783
Neural and Evolutionary Computing 235 279
Numerical Analysis 40 11
Operating Systems 36 33
Other Computer Science 67 69
Performance 45 51
Programming Languages 268 294
Robotics 917 1298
Social and Information Networks 202 325
Software Engineering 659 804
Sound 7 4
Symbolic Computation 44 36
Systems and Control 415 133

你可能感兴趣的:(data analysis task1:Paper statistics)