数据分析入门(学术前沿趋势分析)Task1-论文数据统计

此次赛题是零基础入门数据分析(学术前沿趋势分析),使用公开的arXiv论文完成对应的数据分析操作。赛题内容包括对论文数量、作者出现频率、论文源码的统计,对论文进行分类以及对论文作者的关系进行建模。

目录

Ⅰ、数据及背景

Ⅱ、数据集介绍

Ⅲ、代码实现

Ⅰ、环境准备

Ⅱ、数据预处理

Ⅲ、数据分析及可视化

Ⅳ、参考资料


Ⅰ、数据及背景

  1. 主题:统计论文数量
  2. 内容:理解赛题、学习利用 Pandas 读取数据并进行统计
  3. 数据集:arXiv 重要的学术公开网站,也是搜索、浏览和下载学术论文的重要工具。arXiv论文涵盖的范围非常广,涉及物理学的庞大分支和计算机科学的众多子学科,如数学、统计学、电气工程、定量生物学和经济学等等。

Ⅱ、数据集介绍

数据集链接

数据集详细介绍链接

数据集部分介绍如下:

  • 数据集格式:

id

arXiv,可用于访问论文

submitter

论文提交者

authors

论文作者

title

论文标题

comments

论文页数和图表等其他信息

journal-ref

论文发表的期刊的信息

doi

数字对象标识符

report-no

报告编号

categories

论文在arXiv系统的所属类别或标签

license

文章的许可证

abstract

论文摘要

versions

论文版本

authors_parsed

作者的信息

  • 论文部分类别

astro-ph

天体物理学(Astrophysics)

astro-ph.CO

宇宙学与非银河系天体物理学(Cosmology and Nongalactic Astrophysics)

astro-ph.EP

地球与行星天体物理学(Earth and Planetary Astrophysics)

astro-ph.GA

星系(Astrophysics of Galaxies)

cs.AI

Artificial Intelligence

cs.AR

硬件体系结构(Hardware Architecture)

cs.CC

计算复杂度(Computational Complexity)

cs.CE

计算工程、金融与科学(Computational Engineering, Finance, and Science)

cs.CV

计算机视觉与模式识别(Computer Vision and Pattern Recognition)

cs.CY

计算机与社会(Computers and Society)

cs.DB

数据库(Databases)

cs.DC

分布式、并行与集群计算(Distributed, Parallel, and Cluster Computing)

cs.DL

数字图书馆(Digital Libraries)

cs.NA

数值分析(Numerical Analysis)

cs.NE

神经进化计算(Neural and Evolutionary Computing)

cs.NI

网络与互联网架构(Networking and Internet Architecture)

cs.OH

其他计算机科学(Other Computer Science)

cs.OS

操作系统(Operating Systems)

Ⅲ、代码实现

Ⅰ、环境准备

所需要的模块包括:seaborn、BeautifulSoup、requests、json、pandas、matplotlib。

Ⅱ、数据预处理

读取数据:

import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图工具
# 数据导入
data  = [] #初始化
#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open(r"D:/xyc/competPractice/dataAnalysis2101/archive/arxiv-metadata-oai-snapshot.json", 'r') as f: 
    for line in f: 
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式,方便使用pandas进行分析
print(data.shape) #显示数据大小
data.head() #显示数据的前五行

 结果如下:

(1796911, 14)

首先粗略统计论文种类信息:

'''
count:一列数据的元素个数;
unique:一列数据中元素的种类;
top:一列数据中出现频率最高的元素;
freq:一列数据中出现频率最高的元素的个数;
'''

data["categories"].describe()

结果如下:

count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object

以上结果表明数据集中共有1796911篇论文,分为62055个种类,论文数量最多的种类是天体物理学(astro-ph),该种类一共出现了86914次。

因为很多论文的类别实际上不止一种,所以仅有上面的统计远远不够,下面统计本数据集中共出现了多少种独立的数据集:

# 所有的种类(独立的)

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
print(len(unique_categories))
print(unique_categories)

数据分析入门(学术前沿趋势分析)Task1-论文数据统计_第1张图片

以上结果表明共有176种独立论文种类,比官网上给出的类别数量更多,说明官网上有部分类别并未统计。不过从结果可以看出,计算机方向的论文种类并未改变,仍旧是官方统计的40种。

本次赛题是前沿学术分析,因此选择近两年的论文数据进行分析:

# 2019年后的数据
data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式,并提取处year
del data["update_date"] #删除 update_date特征,其使命已完成
# 2019
data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据
# data.groupby(['categories','year']) #以 categories 进行排序,如果同一个categories 相同则使用 year 特征进行排序
data.groupby(['categories', 'year'])

# data.reset_index(drop=True, inplace=True) #重新编号
data.reset_index(drop=True, inplace=True) #重新编号

# data #查看结果
print("19年以后的数据:")
data

 19年以后的数据:

id submitter authors title comments journal-ref doi report-no categories license abstract versions authors_parsed year
0 0704.0297 Sung-Chul Yoon Sung-Chul Yoon, Philipp Podsiadlowski and Step... Remnant evolution after a carbon-oxygen white ... 15 pages, 15 figures, 3 tables, submitted to M... None 10.1111/j.1365-2966.2007.12161.x None astro-ph None We systematically explore the evolution of t... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... 2019
1 0704.0342 Patrice Ntumba Pungu B. Dugmore and PP. Ntumba Cofibrations in the Category of Frolicher Spac... 27 pages None None None math.AT None Cofibrations are defined in the category of ... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Dugmore, B., ], [Ntumba, PP., ]] 2019
2 0704.0360 Zaqarashvili T.V. Zaqarashvili and K Murawski Torsional oscillations of longitudinally inhom... 6 pages, 3 figures, accepted in A&A None 10.1051/0004-6361:20077246 None astro-ph None We explore the effect of an inhomogeneous ma... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Zaqarashvili, T. V., ], [Murawski, K, ]] 2019
3 0704.0525 Sezgin Ayg\"un Sezgin Aygun, Ismail Tarhan, Husnu Baysal On the Energy-Momentum Problem in Static Einst... This submission has been withdrawn by arXiv ad... Chin.Phys.Lett.24:355-358,2007 10.1088/0256-307X/24/2/015 None gr-qc None This paper has been removed by arXiv adminis... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa... 2019
4 0704.0535 Antonio Pipino Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... The Formation of Globular Cluster Systems in M... 32 pages (referee format), 9 figures, ApJ acce... Astrophys.J.665:295-305,2007 10.1086/519546 None astro-ph None The most massive elliptical galaxies show a ... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M... 2019
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395118 quant-ph/9911051 Stephen A. Fulling S. A. Fulling Large Numbers, the Chinese Remainder Theorem, ... 9 pages; Plain TeX with vanilla.sty and pictex... Phys. Rev. Applied 13, 024016 (2020) 10.1103/PhysRevApplied.13.024016 None quant-ph None This is a pedagogical article cited in the f... [{'version': 'v1', 'created': 'Thu, 11 Nov 199... [[Fulling, S. A., ]] 2020
395119 solv-int/9511005 Wen-Xiu Ma Wen-Xiu Ma, Benno Fuchssteiner Explicit and Exact Solutions to a Kolmogorov-P... 14pages, Latex, to appear in Intern. J. Nonlin... None 10.1016/0020-7462(95)00064-X None solv-int nlin.SI None Some explicit traveling wave solutions to a ... [{'version': 'v1', 'created': 'Tue, 14 Nov 199... [[Ma, Wen-Xiu, ], [Fuchssteiner, Benno, ]] 2019
395120 solv-int/9809008 Victor Enolskii J C Eilbeck, V Z Enol'skii, V B Kuznetsov, D V... Linear r-Matrix Algebra for a Hierarchy of One... plain LaTeX, 28 pages None None None solv-int nlin.SI None We consider a hierarchy of many-particle sys... [{'version': 'v1', 'created': 'Wed, 2 Sep 1998... [[Eilbeck, J C, ], [Enol'skii, V Z, ], [Kuznet... 2019
395121 solv-int/9909010 Pierre van Moerbeke M. Adler, T. Shiota and P. van Moerbeke Pfaff tau-functions 42 pages None None None solv-int adap-org hep-th nlin.AO nlin.SI None Consider the evolution $$ \frac{\pl m_\iy}{\... [{'version': 'v1', 'created': 'Wed, 15 Sep 199... [[Adler, M., ], [Shiota, T., ], [van Moerbeke,... 2019
395122 solv-int/9909014 David Fairlie D.B. Fairlie and A.N. Leznov The General Solution of the Complex Monge-Amp\... 13 pages, latex, no figures None 10.1088/0305-4470/33/25/307 None solv-int nlin.SI None A general solution to the Complex Monge-Amp\... [{'version': 'v1', 'created': 'Thu, 16 Sep 199... [[Fairlie, D. B., ], [Leznov, A. N., ]] 2019

395123 rows × 14 columns 

# 2020年后的数据
data20 = data[data["year"] >= 2020] #找出 year 中2020年以后的数据
data20.groupby(['categories','year']) #以 categories 进行排序,如果同一个categories 相同则使用 year 特征进行排序
data20.reset_index(drop=True, inplace=True) #重新编号
print("20年以后的数据:")
data20

20年以后的数据: 

id submitter authors title comments journal-ref doi report-no categories license abstract versions authors_parsed year
0 0704.0752 Davoud Kamani Davoud Kamani Actions for the Bosonic String with the Curved... 8 pages, Latex, no figure, Some minor changes ... Braz. J. Phys. 38, 268-271 (2008) 10.1590/S0103-97332008000200010 None hep-th None At first we introduce an action for the stri... [{'version': 'v1', 'created': 'Thu, 5 Apr 2007... [[Kamani, Davoud, ]] 2020
1 0704.0880 Qiuping A. Wang Q. A. Wang (ISMANS), F. Tsobnang (ISMANS), S. ... Stochastic action principle and maximum entropy This work is a further development of the idea... Chaos, Solitons and Fractals, 40(2009)2550-2556 None None cond-mat.stat-mech None A stochastic action principle for stochastic... [{'version': 'v1', 'created': 'Fri, 6 Apr 2007... [[Wang, Q. A., , ISMANS], [Tsobnang, F., , ISM... 2020
2 0704.1403 Alberto S. Cattaneo Alberto S. Cattaneo, Florian Schaetz Equivalences of Higher Derived Brackets 16 pages; minor changes; corrected typos; to a... J. Pure Appl. Algebra, 212, 2450-2460 (2008) 10.1016/j.jpaa.2008.03.013 None math.QA math.DG math.SG None This note elaborates on Th. Voronov's constr... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Cattaneo, Alberto S., ], [Schaetz, Florian, ]] 2020
3 0704.2498 Daniel H. Lenz Daniel Lenz, Nicolae Strungaru Pure Point spectrum for measure dynamical syst... 22 pages Journal de Math\'ematiques Pures et Appliqu\'e... 10.1016/j.matpur.2009.05.013 None math-ph math.MP http://arxiv.org/licenses/nonexclusive-distrib... We show equivalence of pure point diffractio... [{'version': 'v1', 'created': 'Thu, 19 Apr 200... [[Lenz, Daniel, ], [Strungaru, Nicolae, ]] 2020
4 0704.2967 Serhiy Samokhvalov E. Serhiy E. Samokhvalov Group-theoretic Description of Riemannian Spaces 14 pages Ukrainian Math. J., v.55 (2003), 1238-1248 10.1023/B:UKMA.0000018010.14309.76 None math.DG math.GR None It is shown that a locally geometrical struc... [{'version': 'v1', 'created': 'Mon, 23 Apr 200... [[Samokhvalov, Serhiy E., ]] 2020
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
224500 quant-ph/9802022 Masanao Ozawa Masanao Ozawa (Nagoya University) On the Concept of Quantum State Reduction: Inc... 15 pages, LaTeX, 1 Postscript figure Annals of the Japan Association for Philosophy... 10.4288/jafpos1956.11.107 None quant-ph None The argument is re-examined that the program... [{'version': 'v1', 'created': 'Mon, 9 Feb 1998... [[Ozawa, Masanao, , Nagoya University]] 2020
224501 quant-ph/9806088 Jens Eisert J. Eisert, M. Wilkens, and M. Lewenstein Quantum Games and Quantum Strategies 4 pages, 4 figures, typographic sign error in ... Phys. Rev. Lett. 83, 3077 (1999) 10.1103/PhysRevLett.83.3077 None quant-ph http://arxiv.org/licenses/nonexclusive-distrib... We investigate the quantization of non-zero ... [{'version': 'v1', 'created': 'Fri, 26 Jun 199... [[Eisert, J., ], [Wilkens, M., ], [Lewenstein,... 2020
224502 quant-ph/9807034 Jens Eisert J. Eisert (U. Potsdam, Germany), M. B. Plenio ... A comparison of entanglement measures 6 pages (RevTeX), 4 figures J. Mod. Opt. 46, 145 (1999) 10.1080/09500349908231260 J. Mod. Opt. 46, 145-154 (1999) quant-ph None We compare the entanglement of formation wit... [{'version': 'v1', 'created': 'Mon, 13 Jul 199... [[Eisert, J., , U. Potsdam, Germany], [Plenio,... 2020
224503 quant-ph/9910035 Pavel Exner P.Duclos, P.Exner, and D. Krejcirik Locally curved quantum layers LaTeX2e, 15 pages, to appear in the Ukrainian ... Ukrainian J. Phys. 45 (2000), 595-601 None None quant-ph cond-mat math-ph math.MP None We consider a quantum particle constrained t... [{'version': 'v1', 'created': 'Fri, 8 Oct 1999... [[Duclos, P., ], [Exner, P., ], [Krejcirik, D.... 2020
224504 quant-ph/9911051 Stephen A. Fulling S. A. Fulling Large Numbers, the Chinese Remainder Theorem, ... 9 pages; Plain TeX with vanilla.sty and pictex... Phys. Rev. Applied 13, 024016 (2020) 10.1103/PhysRevApplied.13.024016 None quant-ph None This is a pedagogical article cited in the f... [{'version': 'v1', 'created': 'Thu, 11 Nov 199... [[Fulling, S. A., ]] 2020

224505 rows × 14 columns

我们得到了2019年以后及2020年以后的所有论文数据,然后分别从中挑选出计算机领域的文章(来源:arXiv Category Taxonomy):

# 挑选出计算机领域的统计数据
from numpy import random
#爬取所有的类别
user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    ]
headers = {
    'User-Agent':'Mozilla/5.0',
    'Content-Type':'application/json',
    'method':'GET',
    'Accept':'application/vnd.github.cloak-preview'
}
headers['User-Agent'] = random.choice(user_agent_list) # 伪装浏览器头部
website_url = requests.get('https://arxiv.org/category_taxonomy', headers = headers, verify = False).text #获取网页的文本数据
soup = BeautifulSoup(website_url,'html.parser') #爬取数据
root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签入口
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags

#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

#进行
for t in tags:
    if t.name == "h2":
        level_1_name = t.text    
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串:raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)

#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})

#按照 "group_name" 进行分组,在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy
  group_name archive_name archive_id category_name categories category_description
0 Computer Science Computer Science Computer Science Artificial Intelligence cs.AI Covers all areas of AI except Vision, Robotics...
1 Computer Science Computer Science Computer Science Hardware Architecture cs.AR Covers systems organization and hardware archi...
2 Computer Science Computer Science Computer Science Computational Complexity cs.CC Covers models of computation, complexity class...
3 Computer Science Computer Science Computer Science Computational Engineering, Finance, and Science cs.CE Covers applications of computer science to the...
4 Computer Science Computer Science Computer Science Computational Geometry cs.CG Roughly includes material in ACM Subject Class...
... ... ... ... ... ... ...
150 Statistics Statistics Statistics Computation stat.CO Algorithms, Simulation, Visualization
151 Statistics Statistics Statistics Methodology stat.ME Design, Surveys, Model Selection, Multiple Tes...
152 Statistics Statistics Statistics Machine Learning stat.ML Covers machine learning papers (supervised, un...
153 Statistics Statistics Statistics Other Statistics stat.OT Work in statistics that does not fit into the ...
154 Statistics Statistics Statistics Statistics Theory stat.TH stat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

【注】此处记录出现过的一些问题及我的解决办法

问题1:最初的代码是这样的

soup = BeautifulSoup(website_url,'lxml') #爬取数据,这里使用lxml的解析器,加速

出现错误

FeatureNotFound

 我的解决方法是改成下面的

soup = BeautifulSoup(website_url,'html.parser') #爬取数据

 原因是有说法是bs4.0以后不用lxml,这里参考过的文章:bs4 FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

问题2:出现如下错误

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='baike.baidu.com', port=443):
 Max retries exceeded with url: https://baike.baidu.com/item/%E5%88%98%E5%BE%B7%E5%8D%8E/114923
 (Caused by NewConnectionError(':
 Failed to establish a new connection: [Errno -2] Name or service not known',))

这个错误经常出现,根据我出错的经验,这个错误可能是由于网络不稳定或是因访问频繁而导致IP被封,在爬虫中经常遇到。我的解决办法是添加如下“伪装头”:

user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    ]

headers = {
    'User-Agent':'Mozilla/5.0',
    'Content-Type':'application/json',
    'method':'GET',
    'Accept':'application/vnd.github.cloak-preview'
}

headers['User-Agent'] = random.choice(user_agent_list)

具体参考文章链接:

requests.exceptions.ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed c

python爬虫 requests异常requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded

Ⅲ、数据分析及可视化

首先查看一下所有大类的paper数量分布:

# 所有大类的paper数量分布(2019后)
_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df

通过共同属性 “categories” 进行两表的合并,以 “group_name” 作为类别进行统计,统计结果放入 “id” 列中并排序,统计结果如下: 

  group_name id
0 Physics 79985
1 Mathematics 51567
2 Computer Science 40067
3 Statistics 4054
4 Electrical Engineering and Systems Science 3297
5 Quantitative Biology 1994
6 Quantitative Finance 826
7 Economics 576

 同理,2020年后的数据如下:

# 所有大类的paper数量分布(2020后)
_df20 = data[data['year']>=2020].merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df20
  group_name id
0 Physics 41606
1 Mathematics 27072
2 Computer Science 21980
3 Statistics 2252
4 Electrical Engineering and Systems Science 1926
5 Quantitative Biology 1108
6 Quantitative Finance 474
7 Economics 403

 使用饼图进行可视化:

# 饼图可视化(2019后)
fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) 
plt.pie(_df["id"],  labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()

数据分析入门(学术前沿趋势分析)Task1-论文数据统计_第2张图片

同理可画出2020年后的分布饼图:

数据分析入门(学术前沿趋势分析)Task1-论文数据统计_第3张图片

接下来统计在计算机各个子领域2019年与2020年后的paper数量:

# 计算机各个子领域2019年与2020年的paper数量
group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id") 
year
category_name
2019 2020
Artificial Intelligence 558 757
Computation and Language 2153 2906
Computational Complexity 131 188
Computational Engineering, Finance, and Science 108 205
Computational Geometry 199 216
Computer Science and Game Theory 281 323
Computer Vision and Pattern Recognition 5559 6517
Computers and Society 346 564
Cryptography and Security 1067 1238
Data Structures and Algorithms 711 902
Databases 282 342
Digital Libraries 125 157
Discrete Mathematics 84 81
Distributed, Parallel, and Cluster Computing 715 774
Emerging Technologies 101 84
Formal Languages and Automata Theory 152 137
General Literature 5 5
Graphics 116 151
Hardware Architecture 95 159
Human-Computer Interaction 420 580
Information Retrieval 245 331
Logic in Computer Science 470 504
Machine Learning 177 538
Mathematical Software 27 45
Multiagent Systems 85 90
Multimedia 76 66
Networking and Internet Architecture 864 783
Neural and Evolutionary Computing 235 279
Numerical Analysis 40 11
Operating Systems 36 33
Other Computer Science 67 69
Performance 45 51
Programming Languages 268 294
Robotics 917 1298
Social and Information Networks 202 325
Software Engineering 659 804
Sound 7 4
Symbolic Computation 44 36
Systems and Control 415 133

可以看出,2020年大部分种类论文数量都大于2019年,同时两年最高产的论文类型都是计算机视觉与模式识别(Computer Vision and Pattern Recognition),并且数量远大于其他类型,看来CV和PR仍然是当前学术研究的主流方向。

Ⅳ、参考资料

Datawhale数据分析训练营学习手册(学术前沿趋势分析)——Task1:论文数据统计

Pandas DataFrame的基本属性详解

Pandas入门(二)——DataFrame结构及常用操作

bs4 FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

requests.exceptions.ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed c

python爬虫 requests异常requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded

你可能感兴趣的:(数据分析入门笔记,初学AI,数据分析,python)