数据分析-学术前沿趋势分析-论⽂数据统计

论文数据统计

  • 1  数据集简介
  • 2  arxiv论文类别介绍
  • 3  具体代码实现以及讲解
    • 3.1  导入包并读取原始数据
    • 3.2  数据预处理
      • 3.2.1  粗略统计论文的种类信息
      • 3.2.2  判断共出现多少独立种类
        • 3.2.2.1  代码解释
  • 4  数据分析及可视化
    • 4.1  查看所有⼤类的paper数量分布
      • 4.1.1  代码解释一
      • 4.1.2  代码解释二
    • 4.2  ⽤饼图进⾏结果可视化
    • 4.3  统计在计算机各个⼦领域2019年后的paper数量

数据集简介

数据集的格式如下:

id:arXiv ID,可用于访问论文;

submitter:论文提交者;

authors:论文作者;

title:论文标题;

comments:论文页数和图表等其他信息;

journal-ref:论文发表的期刊的信息;

doi:数字对象标识符,https://www.doi.org;

report-no:报告编号;

categories:论文在 arXiv 系统的所属类别或标签;

license:文章的许可证;

abstract:论文摘要;

versions:论文版本;

authors_parsed:作者的信息。

arxiv论文类别介绍

我们从arxiv官⽹,查询到论⽂的类别名称以及其解释如下。
链接: https://arxiv.org/help/api/user-manual 的 5.3 ⼩节的 Subject Classifications 的部分,或
https://arxiv.org/category_taxonomy, 具体的153种paper的类别部分如下:

'''
'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'''
"\n'astro-ph': 'Astrophysics',\n'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',\n'astro-ph.EP': 'Earth and Planetary Astrophysics',\n'astro-ph.GA': 'Astrophysics of Galaxies',\n'cs.AI': 'Artificial Intelligence',\n'cs.AR': 'Hardware Architecture',\n'cs.CC': 'Computational Complexity',\n'cs.CE': 'Computational Engineering, Finance, and Science',\n'cs.CV': 'Computer Vision and Pattern Recognition',\n'cs.CY': 'Computers and Society',\n'cs.DB': 'Databases',\n'cs.DC': 'Distributed, Parallel, and Cluster Computing',\n'cs.DL': 'Digital Libraries',\n'cs.NA': 'Numerical Analysis',\n'cs.NE': 'Neural and Evolutionary Computing',\n'cs.NI': 'Networking and Internet Architecture',\n'cs.OH': 'Other Computer Science',\n'cs.OS': 'Operating Systems',\n"

具体代码实现以及讲解

导入包并读取原始数据

# 导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息(用于封装http)
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图工具
# 读入数据
data  = []

#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open("arxiv-metadata-oai-2019.json", 'r') as f: 
    for idx, line in enumerate(f): #enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据索引,一般用在 for 循环当中。
        
        # 读取前100行,如果读取所有数据需要8G内存
        if idx >= 100:
            break
        
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式,方便使用pandas进行分析
data.shape #显示数据大小
(100, 14)
data.head() #显示数据的前五⾏
id submitter authors title comments journal-ref doi report-no categories license abstract versions update_date authors_parsed
0 0704.0297 Sung-Chul Yoon Sung-Chul Yoon, Philipp Podsiadlowski and Step... Remnant evolution after a carbon-oxygen white ... 15 pages, 15 figures, 3 tables, submitted to M... None 10.1111/j.1365-2966.2007.12161.x None astro-ph None We systematically explore the evolution of t... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... 2019-08-19 [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...
1 0704.0342 Patrice Ntumba Pungu B. Dugmore and PP. Ntumba Cofibrations in the Category of Frolicher Spac... 27 pages None None None math.AT None Cofibrations are defined in the category of ... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... 2019-08-19 [[Dugmore, B., ], [Ntumba, PP., ]]
2 0704.0360 Zaqarashvili T.V. Zaqarashvili and K Murawski Torsional oscillations of longitudinally inhom... 6 pages, 3 figures, accepted in A&A None 10.1051/0004-6361:20077246 None astro-ph None We explore the effect of an inhomogeneous ma... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... 2019-08-19 [[Zaqarashvili, T. V., ], [Murawski, K, ]]
3 0704.0525 Sezgin Ayg\"un Sezgin Aygun, Ismail Tarhan, Husnu Baysal On the Energy-Momentum Problem in Static Einst... This submission has been withdrawn by arXiv ad... Chin.Phys.Lett.24:355-358,2007 10.1088/0256-307X/24/2/015 None gr-qc None This paper has been removed by arXiv adminis... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... 2019-10-21 [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...
4 0704.0535 Antonio Pipino Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... The Formation of Globular Cluster Systems in M... 32 pages (referee format), 9 figures, ApJ acce... Astrophys.J.665:295-305,2007 10.1086/519546 None astro-ph None The most massive elliptical galaxies show a ... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... 2019-08-19 [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...

数据预处理

粗略统计论文的种类信息

count:⼀列数据的元素个数;

unique:⼀列数据中元素的种类;

top:⼀列数据中出现频率最⾼的元素;

freq:⼀列数据中出现频率最⾼的元素的个数;

data["categories"].describe()
count          100
unique          31
top       astro-ph
freq            46
Name: categories, dtype: object

以上的结果表明:共有100个数据,有31个⼦类(因为有论⽂的类别是多个,例如⼀篇paper的
类别是CS.AI & CS.MM和⼀篇paper的类别是CS.AI & CS.OS属于不同的⼦类别,这⾥仅仅是粗略统
计),其中最多的种类是astro-ph,即Astrophysics(天体物理学),共出现了46次。

判断共出现多少独立种类

这⾥使⽤了 split 函数将多类别使⽤ “ ”(空格)分开,组成list,并使⽤ for 循环将独⽴出现的类别找出
来,并使⽤ set 类别,将重复项去除得到最终所有的独⽴paper种类。

# 所有的种类(独⽴的)
unique_categories = set([i for l in [x.split(' ') for x in data["categories"]]
for i in l])
len(unique_categories)
unique_categories
{'astro-ph',
 'cond-mat.mes-hall',
 'cond-mat.str-el',
 'cs.FL',
 'cs.LO',
 'cs.NI',
 'gr-qc',
 'hep-ex',
 'hep-ph',
 'hep-th',
 'math-ph',
 'math.AC',
 'math.AG',
 'math.AT',
 'math.CA',
 'math.CO',
 'math.CV',
 'math.DG',
 'math.DS',
 'math.FA',
 'math.GR',
 'math.LO',
 'math.MP',
 'math.PR',
 'math.RA',
 'math.SG',
 'math.SP',
 'nlin.CD',
 'nucl-ex',
 'physics.acc-ph',
 'physics.class-ph',
 'physics.comp-ph',
 'quant-ph'}

Python split() 通过指定分隔符对字符串进行切片,如果参数 num 有指值,则分隔 num+1 个子字符串

split() 方法语法:
str.split(str="", num=string.count(str)).

参数

  • str – 分隔符,默认为所有的空字符,包括空格、换行(\n)、制表符(\t)等。
  • num – 分割次数。默认为 -1, 即分隔所有。

**返回值:**返回分割后的字符串列表。

我们的任务要求对于2019年以后的paper进⾏分析,所以⾸先对于时间特征进⾏预处理,从⽽得到2019
年以后的所有种类的论⽂

data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式,并提取处year
#to_datetime之后就可以用神奇的pandas.Series.dt.day或者pandas.Series.dt.month等方法获取到真实数据了!

del data["update_date"] #删除 update_date特征,其使命已完成

data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据,并将其他数据删除
# data.groupby(['categories','year']) #以 categories 进⾏排序,如果同⼀个categories相同则使⽤ year 特征进⾏排序

data.reset_index(drop=True, inplace=True) #重新编号
data #查看结果
id submitter authors title comments journal-ref doi report-no categories license abstract versions authors_parsed year
0 0704.0297 Sung-Chul Yoon Sung-Chul Yoon, Philipp Podsiadlowski and Step... Remnant evolution after a carbon-oxygen white ... 15 pages, 15 figures, 3 tables, submitted to M... None 10.1111/j.1365-2966.2007.12161.x None astro-ph None We systematically explore the evolution of t... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... 2019
1 0704.0342 Patrice Ntumba Pungu B. Dugmore and PP. Ntumba Cofibrations in the Category of Frolicher Spac... 27 pages None None None math.AT None Cofibrations are defined in the category of ... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Dugmore, B., ], [Ntumba, PP., ]] 2019
2 0704.0360 Zaqarashvili T.V. Zaqarashvili and K Murawski Torsional oscillations of longitudinally inhom... 6 pages, 3 figures, accepted in A&A None 10.1051/0004-6361:20077246 None astro-ph None We explore the effect of an inhomogeneous ma... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Zaqarashvili, T. V., ], [Murawski, K, ]] 2019
3 0704.0525 Sezgin Ayg\"un Sezgin Aygun, Ismail Tarhan, Husnu Baysal On the Energy-Momentum Problem in Static Einst... This submission has been withdrawn by arXiv ad... Chin.Phys.Lett.24:355-358,2007 10.1088/0256-307X/24/2/015 None gr-qc None This paper has been removed by arXiv adminis... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa... 2019
4 0704.0535 Antonio Pipino Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... The Formation of Globular Cluster Systems in M... 32 pages (referee format), 9 figures, ApJ acce... Astrophys.J.665:295-305,2007 10.1086/519546 None astro-ph None The most massive elliptical galaxies show a ... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M... 2019
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 0705.3267 Valeri Makarov V. V. Makarov and D. W. Murphy The local stellar velocity field via vector sp... accepted in AJ Astron.J.134:367-375,2007 10.1086/518242 None astro-ph None We analyze the local field of stellar tangen... [{'version': 'v1', 'created': 'Tue, 22 May 200... [[Makarov, V. V., ], [Murphy, D. W., ]] 2019
96 0705.3638 Eilat Glikman Eilat Glikman (1), S. G. Djorgovski (1), Danie... Discovery of Two Spectroscopically Peculiar, L... 15 pages, 5 figures, Accepted for publicated i... None 10.1086/520085 None astro-ph None We report the discovery of two low-luminosit... [{'version': 'v1', 'created': 'Thu, 24 May 200... [[Glikman, Eilat, , Caltech], [Djorgovski, S. ... 2019
97 0705.3769 Marc Schumann Marc Schumann (for the PERKEO II collaboration) Precision Measurements in Neutron Decay 6 pages, to appear in the proceedings of the X... None None None hep-ph None We present new precision measurements of ang... [{'version': 'v1', 'created': 'Fri, 25 May 200... [[Schumann, Marc, , for the PERKEO II collabor... 2019
98 0705.3804 Koji Terashi Koji Terashi (for the CDF and D0 Collaborations) Exclusive e+e-, Di-photon and Di-jet Productio... 4 pages, To be submitted to the proceedings of... None None None hep-ex None Results from studies on exclusive production... [{'version': 'v1', 'created': 'Fri, 25 May 200... [[Terashi, Koji, , for the CDF and D0 Collabor... 2019
99 0705.3857 Niels Martin M{\o}ller Niels Martin Moller Extremal metrics for spectral functions of Dir... 45 pages; title and content edited to reflect ... Adv. Math. 229 (2012), no. 2, 1001--1046. MR28... 10.1016/j.aim.2011.10.012 None math.SP math.DG http://arxiv.org/licenses/nonexclusive-distrib... Let (M^n, g) be a closed smooth Riemannian s... [{'version': 'v1', 'created': 'Fri, 25 May 200... [[Moller, Niels Martin, ]] 2019

100 rows × 14 columns

因为我们用的就是19年的数据,所以输出的就是原来的100行,下⾯我们挑选出计算机领域内的所有⽂章:(其实不是的,只是分了类,到可视化后面才挑选出计算机的文章)

#爬取所有的类别
website_url = requests.get('https://arxiv.org/category_taxonomy').text #获取⽹⻚的⽂本数据
soup = BeautifulSoup(website_url,'lxml') #爬取数据,这⾥使⽤lxml的解析器,加速
root = soup.find('div',{
     'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签⼊⼝,相当于每篇文章信息的开头第一行
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags,分别读取该篇文章信息的"h2(从属类别)","h3","h4(从属类别里的小类)","p(简述)"四个部分

数据分析-学术前沿趋势分析-论⽂数据统计_第1张图片

代码解释

** Request库的get()方法:**

最通常的方法是通过r=request.get(url)构造一个向服务器请求资源的url对象。

这个对象是Request库内部生成的。

这时候的r返回的是一个包含服务器资源的Response对象。包含从服务器返回的所有的相关资源。

  • url是什么?

url是通过http协议存取资源的一个路径,它就像我们电脑里面的一个文件的路径一样。

  • 这个函数完整的使用方法有三个参数:
    数据分析-学术前沿趋势分析-论⽂数据统计_第2张图片
root #id="category_taxonomy_list",意思是定义了一个id,它的值是“category_taxonomy_list”;class="accordion-head"意思是定义了一个类,它的类名是“accordion-head”;

Computer Science

cs.AI (Artificial Intelligence)

Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.

cs.AR (Hardware Architecture)

Covers systems organization and hardware architecture. Roughly includes material in ACM Subject Classes C.0, C.1, and C.5.

cs.CC (Computational Complexity)

Covers models of computation, complexity classes, structural complexity, complexity tradeoffs, upper and lower bounds. Roughly includes material in ACM Subject Classes F.1 (computation by abstract devices), F.2.3 (tradeoffs among complexity measures), and F.4.3 (formal languages), although some material in formal languages may be more appropriate for Logic in Computer Science. Some material in F.2.1 and F.2.2, may also be appropriate here, but is more likely to have Data Structures and Algorithms as the primary subject area.

cs.CE (Computational Engineering, Finance, and Science)

Covers applications of computer science to the mathematical modeling of complex systems in the fields of science, engineering, and finance. Papers here are interdisciplinary and applications-oriented, focusing on techniques and tools that enable challenging computational simulations to be performed, for which the use of supercomputers or distributed computing platforms is often required. Includes material in ACM Subject Classes J.2, J.3, and J.4 (economics).

cs.CG (Computational Geometry)

Roughly includes material in ACM Subject Classes I.3.5 and F.2.2.

cs.CL (Computation and Language)

Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.

cs.CR (Cryptography and Security)

Covers all areas of cryptography and security including authentication, public key cryptosytems, proof-carrying code, etc. Roughly includes material in ACM Subject Classes D.4.6 and E.3.

cs.CV (Computer Vision and Pattern Recognition)

Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.

cs.CY (Computers and Society)

Covers impact of computers on society, computer ethics, information technology and public policy, legal aspects of computing, computers and education. Roughly includes material in ACM Subject Classes K.0, K.2, K.3, K.4, K.5, and K.7.

cs.DB (Databases)

Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.

cs.DC (Distributed, Parallel, and Cluster Computing)

Covers fault-tolerance, distributed algorithms, stabilility, parallel computation, and cluster computing. Roughly includes material in ACM Subject Classes C.1.2, C.1.4, C.2.4, D.1.3, D.4.5, D.4.7, E.1.

cs.DL (Digital Libraries)

Covers all aspects of the digital library design and document and text creation. Note that there will be some overlap with Information Retrieval (which is a separate subject area). Roughly includes material in ACM Subject Classes H.3.5, H.3.6, H.3.7, I.7.

cs.DM (Discrete Mathematics)

Covers combinatorics, graph theory, applications of probability. Roughly includes material in ACM Subject Classes G.2 and G.3.

cs.DS (Data Structures and Algorithms)

Covers data structures and analysis of algorithms. Roughly includes material in ACM Subject Classes E.1, E.2, F.2.1, and F.2.2.

cs.ET (Emerging Technologies)

Covers approaches to information processing (computing, communication, sensing) and bio-chemical analysis based on alternatives to silicon CMOS-based technologies, such as nanoscale electronic, photonic, spin-based, superconducting, mechanical, bio-chemical and quantum technologies (this list is not exclusive). Topics of interest include (1) building blocks for emerging technologies, their scalability and adoption in larger systems, including integration with traditional technologies, (2) modeling, design and optimization of novel devices and systems, (3) models of computation, algorithm design and programming for emerging technologies.

cs.FL (Formal Languages and Automata Theory)

Covers automata theory, formal language theory, grammars, and combinatorics on words. This roughly corresponds to ACM Subject Classes F.1.1, and F.4.3. Papers dealing with computational complexity should go to cs.CC; papers dealing with logic should go to cs.LO.

cs.GL (General Literature)

Covers introductory material, survey material, predictions of future trends, biographies, and miscellaneous computer-science related material. Roughly includes all of ACM Subject Class A, except it does not include conference proceedings (which will be listed in the appropriate subject area).

cs.GR (Graphics)

Covers all aspects of computer graphics. Roughly includes material in all of ACM Subject Class I.3, except that I.3.5 is is likely to have Computational Geometry as the primary subject area.

cs.GT (Computer Science and Game Theory)

Covers all theoretical and applied aspects at the intersection of computer science and game theory, including work in mechanism design, learning in games (which may overlap with Learning), foundations of agent modeling in games (which may overlap with Multiagent systems), coordination, specification and formal methods for non-cooperative computational environments. The area also deals with applications of game theory to areas such as electronic commerce.

cs.HC (Human-Computer Interaction)

Covers human factors, user interfaces, and collaborative computing. Roughly includes material in ACM Subject Classes H.1.2 and all of H.5, except for H.5.1, which is more likely to have Multimedia as the primary subject area.

cs.IR (Information Retrieval)

Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.

cs.IT (Information Theory)

Covers theoretical and experimental aspects of information theory and coding. Includes material in ACM Subject Class E.4 and intersects with H.1.1.

cs.LG (Machine Learning)

Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.

cs.LO (Logic in Computer Science)

Covers all aspects of logic in computer science, including finite model theory, logics of programs, modal logic, and program verification. Programming language semantics should have Programming Languages as the primary subject area. Roughly includes material in ACM Subject Classes D.2.4, F.3.1, F.4.0, F.4.1, and F.4.2; some material in F.4.3 (formal languages) may also be appropriate here, although Computational Complexity is typically the more appropriate subject area.

cs.MA (Multiagent Systems)

Covers multiagent systems, distributed artificial intelligence, intelligent agents, coordinated interactions. and practical applications. Roughly covers ACM Subject Class I.2.11.

cs.MM (Multimedia)

Roughly includes material in ACM Subject Class H.5.1.

cs.MS (Mathematical Software)

Roughly includes material in ACM Subject Class G.4.

cs.NA (Numerical Analysis)

cs.NA is an alias for math.NA. Roughly includes material in ACM Subject Class G.1.

cs.NE (Neural and Evolutionary Computing)

Covers neural networks, connectionism, genetic algorithms, artificial life, adaptive behavior. Roughly includes some material in ACM Subject Class C.1.3, I.2.6, I.5.

cs.NI (Networking and Internet Architecture)

Covers all aspects of computer communication networks, including network architecture and design, network protocols, and internetwork standards (like TCP/IP). Also includes topics, such as web caching, that are directly relevant to Internet architecture and performance. Roughly includes all of ACM Subject Class C.2 except C.2.4, which is more likely to have Distributed, Parallel, and Cluster Computing as the primary subject area.

cs.OH (Other Computer Science)

This is the classification to use for documents that do not fit anywhere else.

cs.OS (Operating Systems)

Roughly includes material in ACM Subject Classes D.4.1, D.4.2., D.4.3, D.4.4, D.4.5, D.4.7, and D.4.9.

cs.PF (Performance)

Covers performance measurement and evaluation, queueing, and simulation. Roughly includes material in ACM Subject Classes D.4.8 and K.6.2.

cs.PL (Programming Languages)

Covers programming language semantics, language features, programming approaches (such as object-oriented programming, functional programming, logic programming). Also includes material on compilers oriented towards programming languages; other material on compilers may be more appropriate in Architecture (AR). Roughly includes material in ACM Subject Classes D.1 and D.3.

cs.RO (Robotics)

Roughly includes material in ACM Subject Class I.2.9.

cs.SC (Symbolic Computation)

Roughly includes material in ACM Subject Class I.1.

cs.SD (Sound)

Covers all aspects of computing with sound, and sound as an information channel. Includes models of sound, analysis and synthesis, audio user interfaces, sonification of data, computer music, and sound signal processing. Includes ACM Subject Class H.5.5, and intersects with H.1.2, H.5.1, H.5.2, I.2.7, I.5.4, I.6.3, J.5, K.4.2.

cs.SE (Software Engineering)

Covers design tools, software metrics, testing and debugging, programming environments, etc. Roughly includes material in all of ACM Subject Classes D.2, except that D.2.4 (program verification) should probably have Logics in Computer Science as the primary subject area.

cs.SI (Social and Information Networks)

Covers the design, analysis, and modeling of social and information networks, including their applications for on-line information access, communication, and interaction, and their roles as datasets in the exploration of questions in these and other domains, including connections to the social and biological sciences. Analysis and modeling of such networks includes topics in ACM Subject classes F.2, G.2, G.3, H.2, and I.2; applications in computing include topics in H.3, H.4, and H.5; and applications at the interface of computing and other disciplines include topics in J.1--J.7. Papers on computer communication systems and network protocols (e.g. TCP/IP) are generally a closer fit to the Networking and Internet Architecture (cs.NI) category.

cs.SY (Systems and Control)

cs.SY is an alias for eess.SY. This section includes theoretical and experimental research covering all facets of automatic control systems. The section is focused on methods of control system analysis and design using tools of modeling, simulation and optimization. Specific areas of research include nonlinear, distributed, adaptive, stochastic and robust control in addition to hybrid and discrete event systems. Application areas include automotive and aerospace control systems, network control, biological systems, multiagent and cooperative control, robotics, reinforcement learning, sensor networks, control of cyber-physical and energy-related systems, and control of computing systems.

Economics

econ.EM (Econometrics)

Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.

econ.GN (General Economics)

General methodological, applied, and empirical contributions to economics.

econ.TH (Theoretical Economics)

Includes theoretical contributions to Contract Theory, Decision Theory, Game Theory, General Equilibrium, Growth, Learning and Evolution, Macroeconomics, Market and Mechanism Design, and Social Choice.

Electrical Engineering and Systems Science

eess.AS (Audio and Speech Processing)

Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome. Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval; audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.

eess.IV (Image and Video Processing)

Theory, algorithms, and architectures for the formation, capture, processing, communication, analysis, and display of images, video, and multidimensional signals in a wide variety of applications. Topics of interest include: mathematical, statistical, and perceptual image and video modeling and representation; linear and nonlinear filtering, de-blurring, enhancement, restoration, and reconstruction from degraded, low-resolution or tomographic data; lossless and lossy compression and coding; segmentation, alignment, and recognition; image rendering, visualization, and printing; computational imaging, including ultrasound, tomographic and magnetic resonance imaging; and image and video analysis, synthesis, storage, search and retrieval.

eess.SP (Signal Processing)

Theory, algorithms, performance analysis and applications of signal and data analysis, including physical modeling, processing, detection and parameter estimation, learning, mining, retrieval, and information extraction. The term "signal" includes speech, audio, sonar, radar, geophysical, physiological, (bio-) medical, image, video, and multimodal natural and man-made signals, including communication signals and data. Topics of interest include: statistical signal processing, spectral estimation and system identification; filter design, adaptive filtering / stochastic learning; (compressive) sampling, sensing, and transform-domain methods including fast algorithms; signal processing for machine learning and machine learning for signal processing applications; in-network and graph signal processing; convex and nonconvex optimization methods for signal processing applications; radar, sonar, and sensor array beamforming and direction finding; communications signal processing; low power, multi-core and system-on-chip signal processing; sensing, communication, analysis and optimization for cyber-physical systems such as power grids and the Internet of Things.

eess.SY (Systems and Control)

This section includes theoretical and experimental research covering all facets of automatic control systems. The section is focused on methods of control system analysis and design using tools of modeling, simulation and optimization. Specific areas of research include nonlinear, distributed, adaptive, stochastic and robust control in addition to hybrid and discrete event systems. Application areas include automotive and aerospace control systems, network control, biological systems, multiagent and cooperative control, robotics, reinforcement learning, sensor networks, control of cyber-physical and energy-related systems, and control of computing systems.

Mathematics

math.AC (Commutative Algebra)

Commutative rings, modules, ideals, homological algebra, computational aspects, invariant theory, connections to algebraic geometry and combinatorics

math.AG (Algebraic Geometry)

Algebraic varieties, stacks, sheaves, schemes, moduli spaces, complex geometry, quantum cohomology

math.AP (Analysis of PDEs)

Existence and uniqueness, boundary conditions, linear and non-linear operators, stability, soliton theory, integrable PDE's, conservation laws, qualitative dynamics

math.AT (Algebraic Topology)

Homotopy theory, homological algebra, algebraic treatments of manifolds

math.CA (Classical Analysis and ODEs)

Special functions, orthogonal polynomials, harmonic analysis, ODE's, differential relations, calculus of variations, approximations, expansions, asymptotics

math.CO (Combinatorics)

Discrete mathematics, graph theory, enumeration, combinatorial optimization, Ramsey theory, combinatorial game theory

math.CT (Category Theory)

Enriched categories, topoi, abelian categories, monoidal categories, homological algebra

math.CV (Complex Variables)

Holomorphic functions, automorphic group actions and forms, pseudoconvexity, complex geometry, analytic spaces, analytic sheaves

math.DG (Differential Geometry)

Complex, contact, Riemannian, pseudo-Riemannian and Finsler geometry, relativity, gauge theory, global analysis

math.DS (Dynamical Systems)

Dynamics of differential equations and flows, mechanics, classical few-body problems, iterations, complex dynamics, delayed differential equations

math.FA (Functional Analysis)

Banach spaces, function spaces, real functions, integral transforms, theory of distributions, measure theory

math.GM (General Mathematics)

Mathematical material of general interest, topics not covered elsewhere

math.GN (General Topology)

Continuum theory, point-set topology, spaces with algebraic structure, foundations, dimension theory, local and global properties

math.GR (Group Theory)

Finite groups, topological groups, representation theory, cohomology, classification and structure

math.GT (Geometric Topology)

Manifolds, orbifolds, polyhedra, cell complexes, foliations, geometric structures

math.HO (History and Overview)

Biographies, philosophy of mathematics, mathematics education, recreational mathematics, communication of mathematics, ethics in mathematics

math.IT (Information Theory)

math.IT is an alias for cs.IT. Covers theoretical and experimental aspects of information theory and coding.

math.KT (K-Theory and Homology)

Algebraic and topological K-theory, relations with topology, commutative algebra, and operator algebras

math.LO (Logic)

Logic, set theory, point-set topology, formal mathematics

math.MG (Metric Geometry)

Euclidean, hyperbolic, discrete, convex, coarse geometry, comparisons in Riemannian geometry, symmetric spaces

math.MP (Mathematical Physics)

math.MP is an alias for math-ph. Mathematical methods in quantum field theory, quantum mechanics, statistical mechanics, condensed matter, nuclear and atomic physics.

math.NA (Numerical Analysis)

Numerical algorithms for problems in analysis and algebra, scientific computation

math.NT (Number Theory)

Prime numbers, diophantine equations, analytic number theory, algebraic number theory, arithmetic geometry, Galois theory

math.OA (Operator Algebras)

Algebras of operators on Hilbert space, C^*-algebras, von Neumann algebras, non-commutative geometry

math.OC (Optimization and Control)

Operations research, linear programming, control theory, systems theory, optimal control, game theory

math.PR (Probability)

Theory and applications of probability and stochastic processes: e.g. central limit theorems, large deviations, stochastic differential equations, models from statistical mechanics, queuing theory

math.QA (Quantum Algebra)

Quantum groups, skein theories, operadic and diagrammatic algebra, quantum field theory

math.RA (Rings and Algebras)

Non-commutative rings and algebras, non-associative algebras, universal algebra and lattice theory, linear algebra, semigroups

math.RT (Representation Theory)

Linear representations of algebras and groups, Lie theory, associative algebras, multilinear algebra

math.SG (Symplectic Geometry)

Hamiltonian systems, symplectic flows, classical integrable systems

math.SP (Spectral Theory)

Schrodinger operators, operators on manifolds, general differential operators, numerical studies, integral operators, discrete models, resonances, non-self-adjoint operators, random operators/matrices

math.ST (Statistics Theory)

Applied, computational and theoretical statistics: e.g. statistical inference, regression, time series, multivariate analysis, data analysis, Markov chain Monte Carlo, design of experiments, case studies

Physics

Astrophysics
(astro-ph)

astro-ph.CO (Cosmology and Nongalactic Astrophysics)

Phenomenology of early universe, cosmic microwave background, cosmological parameters, primordial element abundances, extragalactic distance scale, large-scale structure of the universe. Groups, superclusters, voids, intergalactic medium. Particle astrophysics: dark energy, dark matter, baryogenesis, leptogenesis, inflationary models, reheating, monopoles, WIMPs, cosmic strings, primordial black holes, cosmological gravitational radiation

astro-ph.EP (Earth and Planetary Astrophysics)

Interplanetary medium, planetary physics, planetary astrobiology, extrasolar planets, comets, asteroids, meteorites. Structure and formation of the solar system

astro-ph.GA (Astrophysics of Galaxies)

Phenomena pertaining to galaxies or the Milky Way. Star clusters, HII regions and planetary nebulae, the interstellar medium, atomic and molecular clouds, dust. Stellar populations. Galactic structure, formation, dynamics. Galactic nuclei, bulges, disks, halo. Active Galactic Nuclei, supermassive black holes, quasars. Gravitational lens systems. The Milky Way and its contents

astro-ph.HE (High Energy Astrophysical Phenomena)

Cosmic ray production, acceleration, propagation, detection. Gamma ray astronomy and bursts, X-rays, charged particles, supernovae and other explosive phenomena, stellar remnants and accretion systems, jets, microquasars, neutron stars, pulsars, black holes

astro-ph.IM (Instrumentation and Methods for Astrophysics)

Detector and telescope design, experiment proposals. Laboratory Astrophysics. Methods for data analysis, statistical methods. Software, database design

astro-ph.SR (Solar and Stellar Astrophysics)

White dwarfs, brown dwarfs, cataclysmic variables. Star formation and protostellar systems, stellar astrobiology, binary and multiple systems of stars, stellar evolution and structure, coronas. Central stars of planetary nebulae. Helioseismology, solar neutrinos, production and detection of gravitational radiation from stellar systems

Condensed Matter
(cond-mat)

cond-mat.dis-nn (Disordered Systems and Neural Networks)

Description coming soon

cond-mat.mes-hall (Mesoscale and Nanoscale Physics)

Semiconducting nanostructures: quantum dots, wires, and wells. Single electronics, spintronics, 2d electron gases, quantum Hall effect, nanotubes, graphene, plasmonic nanostructures

cond-mat.mtrl-sci (Materials Science)

Techniques, synthesis, characterization, structure. Structural phase transitions, mechanical properties, phonons. Defects, adsorbates, interfaces

cond-mat.other (Other Condensed Matter)

Work in condensed matter that does not fit into the other cond-mat classifications

cond-mat.quant-gas (Quantum Gases)

Ultracold atomic and molecular gases, Bose-Einstein condensation, Feshbach resonances, spinor condensates, optical lattices, quantum simulation with cold atoms and molecules, macroscopic interference phenomena

cond-mat.soft (Soft Condensed Matter)

Membranes, polymers, liquid crystals, glasses, colloids, granular matter

cond-mat.stat-mech (Statistical Mechanics)

Phase transitions, thermodynamics, field theory, non-equilibrium phenomena, renormalization group and scaling, integrable models, turbulence

cond-mat.str-el (Strongly Correlated Electrons)

Quantum magnetism, non-Fermi liquids, spin liquids, quantum criticality, charge density waves, metal-insulator transitions

cond-mat.supr-con (Superconductivity)

Superconductivity: theory, models, experiment. Superflow in helium

General Relativity and Quantum Cosmology
(gr-qc)

gr-qc (General Relativity and Quantum Cosmology)

Description coming soon

High Energy Physics - Experiment
(hep-ex)

hep-ex (High Energy Physics - Experiment)

Description coming soon

High Energy Physics - Lattice
(hep-lat)

hep-lat (High Energy Physics - Lattice)

Description coming soon

High Energy Physics - Phenomenology
(hep-ph)

hep-ph (High Energy Physics - Phenomenology)

Description coming soon

High Energy Physics - Theory
(hep-th)

hep-th (High Energy Physics - Theory)

Description coming soon

Mathematical Physics
(math-ph)

math-ph (Mathematical Physics)

Description coming soon

Nonlinear Sciences
(nlin)

nlin.AO (Adaptation and Self-Organizing Systems)

adaptation, self-organizing systems, statistical physics, fluctuating systems, stochastic processes, interacting particle systems, machine learning

nlin.CD (Chaotic Dynamics)

dynamical systems, chaos, quantum chaos, topological dynamics, cycle expansions, turbulence, propagation

nlin.CG (Cellular Automata and Lattice Gases)

computational methods, time series analysis, signal processing, wavelets, lattice gases

nlin.PS (Pattern Formation and Solitons)

pattern formation, coherent structures, solitons

nlin.SI (Exactly Solvable and Integrable Systems)

exactly solvable systems, integrable PDEs, integrable ODEs, Painleve analysis, integrable discrete maps, solvable lattice models, integrable quantum systems

Nuclear Experiment
(nucl-ex)

nucl-ex (Nuclear Experiment)

Description coming soon

Nuclear Theory
(nucl-th)

nucl-th (Nuclear Theory)

Description coming soon

Physics
(physics)

physics.acc-ph (Accelerator Physics)

Description coming soon

physics.ao-ph (Atmospheric and Oceanic Physics)

Description coming soon

physics.app-ph (Applied Physics)

Description coming soon

physics.atm-clus (Atomic and Molecular Clusters)

Description coming soon

physics.atom-ph (Atomic Physics)

Description coming soon

physics.bio-ph (Biological Physics)

Description coming soon

physics.chem-ph (Chemical Physics)

Description coming soon

physics.class-ph (Classical Physics)

Description coming soon

physics.comp-ph (Computational Physics)

Description coming soon

physics.data-an (Data Analysis, Statistics and Probability)

Description coming soon

physics.ed-ph (Physics Education)

Description coming soon

physics.flu-dyn (Fluid Dynamics)

Description coming soon

physics.gen-ph (General Physics)

Description coming soon

physics.geo-ph (Geophysics)

Description coming soon

physics.hist-ph (History and Philosophy of Physics)

Description coming soon

physics.ins-det (Instrumentation and Detectors)

Description coming soon

physics.med-ph (Medical Physics)

Description coming soon

physics.optics (Optics)

Description coming soon

physics.plasm-ph (Plasma Physics)

Description coming soon

physics.pop-ph (Popular Physics)

Description coming soon

physics.soc-ph (Physics and Society)

Description coming soon

physics.space-ph (Space Physics)

Description coming soon

Quantum Physics
(quant-ph)

quant-ph (Quantum Physics)

Description coming soon

Quantitative Biology

q-bio.BM (Biomolecules)

DNA, RNA, proteins, lipids, etc.; molecular structures and folding kinetics; molecular interactions; single-molecule manipulation.

q-bio.CB (Cell Behavior)

Cell-cell signaling and interaction; morphogenesis and development; apoptosis; bacterial conjugation; viral-host interaction; immunology

q-bio.GN (Genomics)

DNA sequencing and assembly; gene and motif finding; RNA editing and alternative splicing; genomic structure and processes (replication, transcription, methylation, etc); mutational processes.

q-bio.MN (Molecular Networks)

Gene regulation, signal transduction, proteomics, metabolomics, gene and enzymatic networks

q-bio.NC (Neurons and Cognition)

Synapse, cortex, neuronal dynamics, neural network, sensorimotor control, behavior, attention

q-bio.OT (Other Quantitative Biology)

Work in quantitative biology that does not fit into the other q-bio classifications

q-bio.PE (Populations and Evolution)

Population dynamics, spatio-temporal and epidemiological models, dynamic speciation, co-evolution, biodiversity, foodwebs, aging; molecular evolution and phylogeny; directed evolution; origin of life

q-bio.QM (Quantitative Methods)

All experimental, numerical, statistical and mathematical contributions of value to biology

q-bio.SC (Subcellular Processes)

Assembly and control of subcellular structures (channels, organelles, cytoskeletons, capsules, etc.); molecular motors, transport, subcellular localization; mitosis and meiosis

q-bio.TO (Tissues and Organs)

Blood flow in vessels, biomechanics of bones, electrical waves, endocrine system, tumor growth

Quantitative Finance

q-fin.CP (Computational Finance)

Computational methods, including Monte Carlo, PDE, lattice and other numerical methods with applications to financial modeling

q-fin.EC (Economics)

q-fin.EC is an alias for econ.GN. Economics, including micro and macro economics, international economics, theory of the firm, labor economics, and other economic topics outside finance

q-fin.GN (General Finance)

Development of general quantitative methodologies with applications in finance

q-fin.MF (Mathematical Finance)

Mathematical and analytical methods of finance, including stochastic, probabilistic and functional analysis, algebraic, geometric and other methods

q-fin.PM (Portfolio Management)

Security selection and optimization, capital allocation, investment strategies and performance measurement

q-fin.PR (Pricing of Securities)

Valuation and hedging of financial securities, their derivatives, and structured products

q-fin.RM (Risk Management)

Measurement and management of financial risks in trading, banking, insurance, corporate and other applications

q-fin.ST (Statistical Finance)

Statistical, econometric and econophysics analyses with applications to financial markets and economic data

q-fin.TR (Trading and Market Microstructure)

Market microstructure, liquidity, exchange and auction design, automated trading, agent-based modeling and market-making

Statistics

stat.AP (Applications)

Biology, Education, Epidemiology, Engineering, Environmental Sciences, Medical, Physical Sciences, Quality Control, Social Sciences

stat.CO (Computation)

Algorithms, Simulation, Visualization

stat.ME (Methodology)

Design, Surveys, Model Selection, Multiple Testing, Multivariate Methods, Signal and Image Processing, Time Series, Smoothing, Spatial Statistics, Survival Analysis, Nonparametric and Semiparametric Methods

stat.ML (Machine Learning)

Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding

stat.OT (Other Statistics)

Work in statistics that does not fit into the other stat classifications

stat.TH (Statistics Theory)

stat.TH is an alias for math.ST. Asymptotics, Bayesian Inference, Decision Theory, Estimation, Foundations, Inference, Testing.

tags
[

Computer Science

,

cs.AI (Artificial Intelligence)

,

Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.

,

cs.AR (Hardware Architecture)

,

Covers systems organization and hardware architecture. Roughly includes material in ACM Subject Classes C.0, C.1, and C.5.

,

cs.CC (Computational Complexity)

,

Covers models of computation, complexity classes, structural complexity, complexity tradeoffs, upper and lower bounds. Roughly includes material in ACM Subject Classes F.1 (computation by abstract devices), F.2.3 (tradeoffs among complexity measures), and F.4.3 (formal languages), although some material in formal languages may be more appropriate for Logic in Computer Science. Some material in F.2.1 and F.2.2, may also be appropriate here, but is more likely to have Data Structures and Algorithms as the primary subject area.

,

cs.CE (Computational Engineering, Finance, and Science)

,

Covers applications of computer science to the mathematical modeling of complex systems in the fields of science, engineering, and finance. Papers here are interdisciplinary and applications-oriented, focusing on techniques and tools that enable challenging computational simulations to be performed, for which the use of supercomputers or distributed computing platforms is often required. Includes material in ACM Subject Classes J.2, J.3, and J.4 (economics).

,

cs.CG (Computational Geometry)

,

Roughly includes material in ACM Subject Classes I.3.5 and F.2.2.

,

cs.CL (Computation and Language)

,

Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.

,

cs.CR (Cryptography and Security)

,

Covers all areas of cryptography and security including authentication, public key cryptosytems, proof-carrying code, etc. Roughly includes material in ACM Subject Classes D.4.6 and E.3.

,

cs.CV (Computer Vision and Pattern Recognition)

,

Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.

,

cs.CY (Computers and Society)

,

Covers impact of computers on society, computer ethics, information technology and public policy, legal aspects of computing, computers and education. Roughly includes material in ACM Subject Classes K.0, K.2, K.3, K.4, K.5, and K.7.

,

cs.DB (Databases)

,

Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.

,

cs.DC (Distributed, Parallel, and Cluster Computing)

,

Covers fault-tolerance, distributed algorithms, stabilility, parallel computation, and cluster computing. Roughly includes material in ACM Subject Classes C.1.2, C.1.4, C.2.4, D.1.3, D.4.5, D.4.7, E.1.

,

cs.DL (Digital Libraries)

,

Covers all aspects of the digital library design and document and text creation. Note that there will be some overlap with Information Retrieval (which is a separate subject area). Roughly includes material in ACM Subject Classes H.3.5, H.3.6, H.3.7, I.7.

,

cs.DM (Discrete Mathematics)

,

Covers combinatorics, graph theory, applications of probability. Roughly includes material in ACM Subject Classes G.2 and G.3.

,

cs.DS (Data Structures and Algorithms)

,

Covers data structures and analysis of algorithms. Roughly includes material in ACM Subject Classes E.1, E.2, F.2.1, and F.2.2.

,

cs.ET (Emerging Technologies)

,

Covers approaches to information processing (computing, communication, sensing) and bio-chemical analysis based on alternatives to silicon CMOS-based technologies, such as nanoscale electronic, photonic, spin-based, superconducting, mechanical, bio-chemical and quantum technologies (this list is not exclusive). Topics of interest include (1) building blocks for emerging technologies, their scalability and adoption in larger systems, including integration with traditional technologies, (2) modeling, design and optimization of novel devices and systems, (3) models of computation, algorithm design and programming for emerging technologies.

,

cs.FL (Formal Languages and Automata Theory)

,

Covers automata theory, formal language theory, grammars, and combinatorics on words. This roughly corresponds to ACM Subject Classes F.1.1, and F.4.3. Papers dealing with computational complexity should go to cs.CC; papers dealing with logic should go to cs.LO.

,

cs.GL (General Literature)

,

Covers introductory material, survey material, predictions of future trends, biographies, and miscellaneous computer-science related material. Roughly includes all of ACM Subject Class A, except it does not include conference proceedings (which will be listed in the appropriate subject area).

,

cs.GR (Graphics)

,

Covers all aspects of computer graphics. Roughly includes material in all of ACM Subject Class I.3, except that I.3.5 is is likely to have Computational Geometry as the primary subject area.

,

cs.GT (Computer Science and Game Theory)

,

Covers all theoretical and applied aspects at the intersection of computer science and game theory, including work in mechanism design, learning in games (which may overlap with Learning), foundations of agent modeling in games (which may overlap with Multiagent systems), coordination, specification and formal methods for non-cooperative computational environments. The area also deals with applications of game theory to areas such as electronic commerce.

,

cs.HC (Human-Computer Interaction)

,

Covers human factors, user interfaces, and collaborative computing. Roughly includes material in ACM Subject Classes H.1.2 and all of H.5, except for H.5.1, which is more likely to have Multimedia as the primary subject area.

,

cs.IR (Information Retrieval)

,

Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.

,

cs.IT (Information Theory)

,

Covers theoretical and experimental aspects of information theory and coding. Includes material in ACM Subject Class E.4 and intersects with H.1.1.

,

cs.LG (Machine Learning)

,

Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.

,

cs.LO (Logic in Computer Science)

,

Covers all aspects of logic in computer science, including finite model theory, logics of programs, modal logic, and program verification. Programming language semantics should have Programming Languages as the primary subject area. Roughly includes material in ACM Subject Classes D.2.4, F.3.1, F.4.0, F.4.1, and F.4.2; some material in F.4.3 (formal languages) may also be appropriate here, although Computational Complexity is typically the more appropriate subject area.

,

cs.MA (Multiagent Systems)

,

Covers multiagent systems, distributed artificial intelligence, intelligent agents, coordinated interactions. and practical applications. Roughly covers ACM Subject Class I.2.11.

,

cs.MM (Multimedia)

,

Roughly includes material in ACM Subject Class H.5.1.

,

cs.MS (Mathematical Software)

,

Roughly includes material in ACM Subject Class G.4.

,

cs.NA (Numerical Analysis)

,

cs.NA is an alias for math.NA. Roughly includes material in ACM Subject Class G.1.

,

cs.NE (Neural and Evolutionary Computing)

,

Covers neural networks, connectionism, genetic algorithms, artificial life, adaptive behavior. Roughly includes some material in ACM Subject Class C.1.3, I.2.6, I.5.

,

cs.NI (Networking and Internet Architecture)

,

Covers all aspects of computer communication networks, including network architecture and design, network protocols, and internetwork standards (like TCP/IP). Also includes topics, such as web caching, that are directly relevant to Internet architecture and performance. Roughly includes all of ACM Subject Class C.2 except C.2.4, which is more likely to have Distributed, Parallel, and Cluster Computing as the primary subject area.

,

cs.OH (Other Computer Science)

,

This is the classification to use for documents that do not fit anywhere else.

,

cs.OS (Operating Systems)

,

Roughly includes material in ACM Subject Classes D.4.1, D.4.2., D.4.3, D.4.4, D.4.5, D.4.7, and D.4.9.

,

cs.PF (Performance)

,

Covers performance measurement and evaluation, queueing, and simulation. Roughly includes material in ACM Subject Classes D.4.8 and K.6.2.

,

cs.PL (Programming Languages)

,

Covers programming language semantics, language features, programming approaches (such as object-oriented programming, functional programming, logic programming). Also includes material on compilers oriented towards programming languages; other material on compilers may be more appropriate in Architecture (AR). Roughly includes material in ACM Subject Classes D.1 and D.3.

,

cs.RO (Robotics)

,

Roughly includes material in ACM Subject Class I.2.9.

,

cs.SC (Symbolic Computation)

,

Roughly includes material in ACM Subject Class I.1.

,

cs.SD (Sound)

,

Covers all aspects of computing with sound, and sound as an information channel. Includes models of sound, analysis and synthesis, audio user interfaces, sonification of data, computer music, and sound signal processing. Includes ACM Subject Class H.5.5, and intersects with H.1.2, H.5.1, H.5.2, I.2.7, I.5.4, I.6.3, J.5, K.4.2.

,

cs.SE (Software Engineering)

,

Covers design tools, software metrics, testing and debugging, programming environments, etc. Roughly includes material in all of ACM Subject Classes D.2, except that D.2.4 (program verification) should probably have Logics in Computer Science as the primary subject area.

,

cs.SI (Social and Information Networks)

,

Covers the design, analysis, and modeling of social and information networks, including their applications for on-line information access, communication, and interaction, and their roles as datasets in the exploration of questions in these and other domains, including connections to the social and biological sciences. Analysis and modeling of such networks includes topics in ACM Subject classes F.2, G.2, G.3, H.2, and I.2; applications in computing include topics in H.3, H.4, and H.5; and applications at the interface of computing and other disciplines include topics in J.1--J.7. Papers on computer communication systems and network protocols (e.g. TCP/IP) are generally a closer fit to the Networking and Internet Architecture (cs.NI) category.

,

cs.SY (Systems and Control)

,

cs.SY is an alias for eess.SY. This section includes theoretical and experimental research covering all facets of automatic control systems. The section is focused on methods of control system analysis and design using tools of modeling, simulation and optimization. Specific areas of research include nonlinear, distributed, adaptive, stochastic and robust control in addition to hybrid and discrete event systems. Application areas include automotive and aerospace control systems, network control, biological systems, multiagent and cooperative control, robotics, reinforcement learning, sensor networks, control of cyber-physical and energy-related systems, and control of computing systems.

,

Economics

,

econ.EM (Econometrics)

,

Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.

,

econ.GN (General Economics)

,

General methodological, applied, and empirical contributions to economics.

,

econ.TH (Theoretical Economics)

,

Includes theoretical contributions to Contract Theory, Decision Theory, Game Theory, General Equilibrium, Growth, Learning and Evolution, Macroeconomics, Market and Mechanism Design, and Social Choice.

,

Electrical Engineering and Systems Science

,

eess.AS (Audio and Speech Processing)

,

Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome. Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval; audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.

,

eess.IV (Image and Video Processing)

,

Theory, algorithms, and architectures for the formation, capture, processing, communication, analysis, and display of images, video, and multidimensional signals in a wide variety of applications. Topics of interest include: mathematical, statistical, and perceptual image and video modeling and representation; linear and nonlinear filtering, de-blurring, enhancement, restoration, and reconstruction from degraded, low-resolution or tomographic data; lossless and lossy compression and coding; segmentation, alignment, and recognition; image rendering, visualization, and printing; computational imaging, including ultrasound, tomographic and magnetic resonance imaging; and image and video analysis, synthesis, storage, search and retrieval.

,

eess.SP (Signal Processing)

,

Theory, algorithms, performance analysis and applications of signal and data analysis, including physical modeling, processing, detection and parameter estimation, learning, mining, retrieval, and information extraction. The term "signal" includes speech, audio, sonar, radar, geophysical, physiological, (bio-) medical, image, video, and multimodal natural and man-made signals, including communication signals and data. Topics of interest include: statistical signal processing, spectral estimation and system identification; filter design, adaptive filtering / stochastic learning; (compressive) sampling, sensing, and transform-domain methods including fast algorithms; signal processing for machine learning and machine learning for signal processing applications; in-network and graph signal processing; convex and nonconvex optimization methods for signal processing applications; radar, sonar, and sensor array beamforming and direction finding; communications signal processing; low power, multi-core and system-on-chip signal processing; sensing, communication, analysis and optimization for cyber-physical systems such as power grids and the Internet of Things.

,

eess.SY (Systems and Control)

,

This section includes theoretical and experimental research covering all facets of automatic control systems. The section is focused on methods of control system analysis and design using tools of modeling, simulation and optimization. Specific areas of research include nonlinear, distributed, adaptive, stochastic and robust control in addition to hybrid and discrete event systems. Application areas include automotive and aerospace control systems, network control, biological systems, multiagent and cooperative control, robotics, reinforcement learning, sensor networks, control of cyber-physical and energy-related systems, and control of computing systems.

,

Mathematics

,

math.AC (Commutative Algebra)

,

Commutative rings, modules, ideals, homological algebra, computational aspects, invariant theory, connections to algebraic geometry and combinatorics

,

math.AG (Algebraic Geometry)

,

Algebraic varieties, stacks, sheaves, schemes, moduli spaces, complex geometry, quantum cohomology

,

math.AP (Analysis of PDEs)

,

Existence and uniqueness, boundary conditions, linear and non-linear operators, stability, soliton theory, integrable PDE's, conservation laws, qualitative dynamics

,

math.AT (Algebraic Topology)

,

Homotopy theory, homological algebra, algebraic treatments of manifolds

,

math.CA (Classical Analysis and ODEs)

,

Special functions, orthogonal polynomials, harmonic analysis, ODE's, differential relations, calculus of variations, approximations, expansions, asymptotics

,

math.CO (Combinatorics)

,

Discrete mathematics, graph theory, enumeration, combinatorial optimization, Ramsey theory, combinatorial game theory

,

math.CT (Category Theory)

,

Enriched categories, topoi, abelian categories, monoidal categories, homological algebra

,

math.CV (Complex Variables)

,

Holomorphic functions, automorphic group actions and forms, pseudoconvexity, complex geometry, analytic spaces, analytic sheaves

,

math.DG (Differential Geometry)

,

Complex, contact, Riemannian, pseudo-Riemannian and Finsler geometry, relativity, gauge theory, global analysis

,

math.DS (Dynamical Systems)

,

Dynamics of differential equations and flows, mechanics, classical few-body problems, iterations, complex dynamics, delayed differential equations

,

math.FA (Functional Analysis)

,

Banach spaces, function spaces, real functions, integral transforms, theory of distributions, measure theory

,

math.GM (General Mathematics)

,

Mathematical material of general interest, topics not covered elsewhere

,

math.GN (General Topology)

,

Continuum theory, point-set topology, spaces with algebraic structure, foundations, dimension theory, local and global properties

,

math.GR (Group Theory)

,

Finite groups, topological groups, representation theory, cohomology, classification and structure

,

math.GT (Geometric Topology)

,

Manifolds, orbifolds, polyhedra, cell complexes, foliations, geometric structures

,

math.HO (History and Overview)

,

Biographies, philosophy of mathematics, mathematics education, recreational mathematics, communication of mathematics, ethics in mathematics

,

math.IT (Information Theory)

,

math.IT is an alias for cs.IT. Covers theoretical and experimental aspects of information theory and coding.

,

math.KT (K-Theory and Homology)

,

Algebraic and topological K-theory, relations with topology, commutative algebra, and operator algebras

,

math.LO (Logic)

,

Logic, set theory, point-set topology, formal mathematics

,

math.MG (Metric Geometry)

,

Euclidean, hyperbolic, discrete, convex, coarse geometry, comparisons in Riemannian geometry, symmetric spaces

,

math.MP (Mathematical Physics)

,

math.MP is an alias for math-ph. Mathematical methods in quantum field theory, quantum mechanics, statistical mechanics, condensed matter, nuclear and atomic physics.

,

math.NA (Numerical Analysis)

,

Numerical algorithms for problems in analysis and algebra, scientific computation

,

math.NT (Number Theory)

,

Prime numbers, diophantine equations, analytic number theory, algebraic number theory, arithmetic geometry, Galois theory

,

math.OA (Operator Algebras)

,

Algebras of operators on Hilbert space, C^*-algebras, von Neumann algebras, non-commutative geometry

,

math.OC (Optimization and Control)

,

Operations research, linear programming, control theory, systems theory, optimal control, game theory

,

math.PR (Probability)

,

Theory and applications of probability and stochastic processes: e.g. central limit theorems, large deviations, stochastic differential equations, models from statistical mechanics, queuing theory

,

math.QA (Quantum Algebra)

,

Quantum groups, skein theories, operadic and diagrammatic algebra, quantum field theory

,

math.RA (Rings and Algebras)

,

Non-commutative rings and algebras, non-associative algebras, universal algebra and lattice theory, linear algebra, semigroups

,

math.RT (Representation Theory)

,

Linear representations of algebras and groups, Lie theory, associative algebras, multilinear algebra

,

math.SG (Symplectic Geometry)

,

Hamiltonian systems, symplectic flows, classical integrable systems

,

math.SP (Spectral Theory)

,

Schrodinger operators, operators on manifolds, general differential operators, numerical studies, integral operators, discrete models, resonances, non-self-adjoint operators, random operators/matrices

,

math.ST (Statistics Theory)

,

Applied, computational and theoretical statistics: e.g. statistical inference, regression, time series, multivariate analysis, data analysis, Markov chain Monte Carlo, design of experiments, case studies

,

Physics

,

Astrophysics
(astro-ph)

,

astro-ph.CO (Cosmology and Nongalactic Astrophysics)

,

Phenomenology of early universe, cosmic microwave background, cosmological parameters, primordial element abundances, extragalactic distance scale, large-scale structure of the universe. Groups, superclusters, voids, intergalactic medium. Particle astrophysics: dark energy, dark matter, baryogenesis, leptogenesis, inflationary models, reheating, monopoles, WIMPs, cosmic strings, primordial black holes, cosmological gravitational radiation

,

astro-ph.EP (Earth and Planetary Astrophysics)

,

Interplanetary medium, planetary physics, planetary astrobiology, extrasolar planets, comets, asteroids, meteorites. Structure and formation of the solar system

,

astro-ph.GA (Astrophysics of Galaxies)

,

Phenomena pertaining to galaxies or the Milky Way. Star clusters, HII regions and planetary nebulae, the interstellar medium, atomic and molecular clouds, dust. Stellar populations. Galactic structure, formation, dynamics. Galactic nuclei, bulges, disks, halo. Active Galactic Nuclei, supermassive black holes, quasars. Gravitational lens systems. The Milky Way and its contents

,

astro-ph.HE (High Energy Astrophysical Phenomena)

,

Cosmic ray production, acceleration, propagation, detection. Gamma ray astronomy and bursts, X-rays, charged particles, supernovae and other explosive phenomena, stellar remnants and accretion systems, jets, microquasars, neutron stars, pulsars, black holes

,

astro-ph.IM (Instrumentation and Methods for Astrophysics)

,

Detector and telescope design, experiment proposals. Laboratory Astrophysics. Methods for data analysis, statistical methods. Software, database design

,

astro-ph.SR (Solar and Stellar Astrophysics)

,

White dwarfs, brown dwarfs, cataclysmic variables. Star formation and protostellar systems, stellar astrobiology, binary and multiple systems of stars, stellar evolution and structure, coronas. Central stars of planetary nebulae. Helioseismology, solar neutrinos, production and detection of gravitational radiation from stellar systems

,

Condensed Matter
(cond-mat)

,

cond-mat.dis-nn (Disordered Systems and Neural Networks)

,

Description coming soon

,

cond-mat.mes-hall (Mesoscale and Nanoscale Physics)

,

Semiconducting nanostructures: quantum dots, wires, and wells. Single electronics, spintronics, 2d electron gases, quantum Hall effect, nanotubes, graphene, plasmonic nanostructures

,

cond-mat.mtrl-sci (Materials Science)

,

Techniques, synthesis, characterization, structure. Structural phase transitions, mechanical properties, phonons. Defects, adsorbates, interfaces

,

cond-mat.other (Other Condensed Matter)

,

Work in condensed matter that does not fit into the other cond-mat classifications

,

cond-mat.quant-gas (Quantum Gases)

,

Ultracold atomic and molecular gases, Bose-Einstein condensation, Feshbach resonances, spinor condensates, optical lattices, quantum simulation with cold atoms and molecules, macroscopic interference phenomena

,

cond-mat.soft (Soft Condensed Matter)

,

Membranes, polymers, liquid crystals, glasses, colloids, granular matter

,

cond-mat.stat-mech (Statistical Mechanics)

,

Phase transitions, thermodynamics, field theory, non-equilibrium phenomena, renormalization group and scaling, integrable models, turbulence

,

cond-mat.str-el (Strongly Correlated Electrons)

,

Quantum magnetism, non-Fermi liquids, spin liquids, quantum criticality, charge density waves, metal-insulator transitions

,

cond-mat.supr-con (Superconductivity)

,

Superconductivity: theory, models, experiment. Superflow in helium

,

General Relativity and Quantum Cosmology
(gr-qc)

,

gr-qc (General Relativity and Quantum Cosmology)

,

Description coming soon

,

High Energy Physics - Experiment
(hep-ex)

,

hep-ex (High Energy Physics - Experiment)

,

Description coming soon

,

High Energy Physics - Lattice
(hep-lat)

,

hep-lat (High Energy Physics - Lattice)

,

Description coming soon

,

High Energy Physics - Phenomenology
(hep-ph)

,

hep-ph (High Energy Physics - Phenomenology)

,

Description coming soon

,

High Energy Physics - Theory
(hep-th)

,

hep-th (High Energy Physics - Theory)

,

Description coming soon

,

Mathematical Physics
(math-ph)

,

math-ph (Mathematical Physics)

,

Description coming soon

,

Nonlinear Sciences
(nlin)

,

nlin.AO (Adaptation and Self-Organizing Systems)

,

adaptation, self-organizing systems, statistical physics, fluctuating systems, stochastic processes, interacting particle systems, machine learning

,

nlin.CD (Chaotic Dynamics)

,

dynamical systems, chaos, quantum chaos, topological dynamics, cycle expansions, turbulence, propagation

,

nlin.CG (Cellular Automata and Lattice Gases)

,

computational methods, time series analysis, signal processing, wavelets, lattice gases

,

nlin.PS (Pattern Formation and Solitons)

,

pattern formation, coherent structures, solitons

,

nlin.SI (Exactly Solvable and Integrable Systems)

,

exactly solvable systems, integrable PDEs, integrable ODEs, Painleve analysis, integrable discrete maps, solvable lattice models, integrable quantum systems

,

Nuclear Experiment
(nucl-ex)

,

nucl-ex (Nuclear Experiment)

,

Description coming soon

,

Nuclear Theory
(nucl-th)

,

nucl-th (Nuclear Theory)

,

Description coming soon

,

Physics
(physics)

,

physics.acc-ph (Accelerator Physics)

,

Description coming soon

,

physics.ao-ph (Atmospheric and Oceanic Physics)

,

Description coming soon

,

physics.app-ph (Applied Physics)

,

Description coming soon

,

physics.atm-clus (Atomic and Molecular Clusters)

,

Description coming soon

,

physics.atom-ph (Atomic Physics)

,

Description coming soon

,

physics.bio-ph (Biological Physics)

,

Description coming soon

,

physics.chem-ph (Chemical Physics)

,

Description coming soon

,

physics.class-ph (Classical Physics)

,

Description coming soon

,

physics.comp-ph (Computational Physics)

,

Description coming soon

,

physics.data-an (Data Analysis, Statistics and Probability)

,

Description coming soon

,

physics.ed-ph (Physics Education)

,

Description coming soon

,

physics.flu-dyn (Fluid Dynamics)

,

Description coming soon

,

physics.gen-ph (General Physics)

,

Description coming soon

,

physics.geo-ph (Geophysics)

,

Description coming soon

,

physics.hist-ph (History and Philosophy of Physics)

,

Description coming soon

,

physics.ins-det (Instrumentation and Detectors)

,

Description coming soon

,

physics.med-ph (Medical Physics)

,

Description coming soon

,

physics.optics (Optics)

,

Description coming soon

,

physics.plasm-ph (Plasma Physics)

,

Description coming soon

,

physics.pop-ph (Popular Physics)

,

Description coming soon

,

physics.soc-ph (Physics and Society)

,

Description coming soon

,

physics.space-ph (Space Physics)

,

Description coming soon

,

Quantum Physics
(quant-ph)

,

quant-ph (Quantum Physics)

,

Description coming soon

,

Quantitative Biology

,

q-bio.BM (Biomolecules)

,

DNA, RNA, proteins, lipids, etc.; molecular structures and folding kinetics; molecular interactions; single-molecule manipulation.

,

q-bio.CB (Cell Behavior)

,

Cell-cell signaling and interaction; morphogenesis and development; apoptosis; bacterial conjugation; viral-host interaction; immunology

,

q-bio.GN (Genomics)

,

DNA sequencing and assembly; gene and motif finding; RNA editing and alternative splicing; genomic structure and processes (replication, transcription, methylation, etc); mutational processes.

,

q-bio.MN (Molecular Networks)

,

Gene regulation, signal transduction, proteomics, metabolomics, gene and enzymatic networks

,

q-bio.NC (Neurons and Cognition)

,

Synapse, cortex, neuronal dynamics, neural network, sensorimotor control, behavior, attention

,

q-bio.OT (Other Quantitative Biology)

,

Work in quantitative biology that does not fit into the other q-bio classifications

,

q-bio.PE (Populations and Evolution)

,

Population dynamics, spatio-temporal and epidemiological models, dynamic speciation, co-evolution, biodiversity, foodwebs, aging; molecular evolution and phylogeny; directed evolution; origin of life

,

q-bio.QM (Quantitative Methods)

,

All experimental, numerical, statistical and mathematical contributions of value to biology

,

q-bio.SC (Subcellular Processes)

,

Assembly and control of subcellular structures (channels, organelles, cytoskeletons, capsules, etc.); molecular motors, transport, subcellular localization; mitosis and meiosis

,

q-bio.TO (Tissues and Organs)

,

Blood flow in vessels, biomechanics of bones, electrical waves, endocrine system, tumor growth

,

Quantitative Finance

,

q-fin.CP (Computational Finance)

,

Computational methods, including Monte Carlo, PDE, lattice and other numerical methods with applications to financial modeling

,

q-fin.EC (Economics)

,

q-fin.EC is an alias for econ.GN. Economics, including micro and macro economics, international economics, theory of the firm, labor economics, and other economic topics outside finance

,

q-fin.GN (General Finance)

,

Development of general quantitative methodologies with applications in finance

,

q-fin.MF (Mathematical Finance)

,

Mathematical and analytical methods of finance, including stochastic, probabilistic and functional analysis, algebraic, geometric and other methods

,

q-fin.PM (Portfolio Management)

,

Security selection and optimization, capital allocation, investment strategies and performance measurement

,

q-fin.PR (Pricing of Securities)

,

Valuation and hedging of financial securities, their derivatives, and structured products

,

q-fin.RM (Risk Management)

,

Measurement and management of financial risks in trading, banking, insurance, corporate and other applications

,

q-fin.ST (Statistical Finance)

,

Statistical, econometric and econophysics analyses with applications to financial markets and economic data

,

q-fin.TR (Trading and Market Microstructure)

,

Market microstructure, liquidity, exchange and auction design, automated trading, agent-based modeling and market-making

,

Statistics

,

stat.AP (Applications)

,

Biology, Education, Epidemiology, Engineering, Environmental Sciences, Medical, Physical Sciences, Quality Control, Social Sciences

,

stat.CO (Computation)

,

Algorithms, Simulation, Visualization

,

stat.ME (Methodology)

,

Design, Surveys, Model Selection, Multiple Testing, Multivariate Methods, Signal and Image Processing, Time Series, Smoothing, Spatial Statistics, Survival Analysis, Nonparametric and Semiparametric Methods

,

stat.ML (Machine Learning)

,

Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding

,

stat.OT (Other Statistics)

,

Work in statistics that does not fit into the other stat classifications

,

stat.TH (Statistics Theory)

,

stat.TH is an alias for math.ST. Asymptotics, Bayesian Inference, Decision Theory, Estimation, Foundations, Inference, Testing.

]
#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

正则操作:re.sub(pattern, repl, string, count=0, flags=0)

这⾥主要说明⼀下下面代码中的正则操作,这⾥我们使⽤re.sub来⽤于替换字符串中的匹配项。由于relpeace只能用于替换指定的字符(把333变成444),但是re.sub(正则替换)可以用来替换任意字符(把123变成444)。

  • pattern : 正则中的模式字符串。
  • repl : 替换的字符串,也可为⼀个函数。
  • string : 要被查找替换的原始字符串。
  • count : 模式匹配后替换的最⼤次数,默认 0 表示替换所有的匹配。
  • flags : 编译时⽤的匹配模式,数字形式。
  • 其中pattern、 repl、 string为必选参数
#进⾏
#(.*)为括号前所有的str,\((.*)\)为后面括号的str
for t in tags:
    if t.name == "h2":
        level_1_name = t.text
        level_2_code = t.text
        level_2_name = t.text    
        #level_1_name 、level_2_code 、level_2_name 都输出h2后面的文本
        
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串: raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw) #通俗地说就是把raw换成括号前的str
        #level_2_code输出将h3后的文字换成括号后的str
        #level_2_name输出将h3后的文字换成括号前所有的str
        
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
        #level_3_code输出将h4后的文字换成括号前所有的str
        #level_3_name输出将h4后的文字换成括号后的str
        
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)
        
        #把这些数据都按顺序串联起来
#根据以上信息⽣成dataframe格式的数据,分别定义每列的数据,所以之前的正则操作就是为了输出这个DataFrame
df_taxonomy = pd.DataFrame({
     
'group_name' : level_1_names,
'archive_name' : level_2_names,
'archive_id' : level_2_codes,
'category_name' : level_3_names,
'categories' : level_3_codes,
'category_description': level_3_notes
})
#按照 "group_name" 进⾏分组,在组内使⽤ "archive_name" 进⾏排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy
group_name archive_name archive_id category_name categories category_description
0 Computer Science Computer Science Computer Science Artificial Intelligence cs.AI Covers all areas of AI except Vision, Robotics...
1 Computer Science Computer Science Computer Science Hardware Architecture cs.AR Covers systems organization and hardware archi...
2 Computer Science Computer Science Computer Science Computational Complexity cs.CC Covers models of computation, complexity class...
3 Computer Science Computer Science Computer Science Computational Engineering, Finance, and Science cs.CE Covers applications of computer science to the...
4 Computer Science Computer Science Computer Science Computational Geometry cs.CG Roughly includes material in ACM Subject Class...
... ... ... ... ... ... ...
150 Statistics Statistics Statistics Computation stat.CO Algorithms, Simulation, Visualization
151 Statistics Statistics Statistics Methodology stat.ME Design, Surveys, Model Selection, Multiple Tes...
152 Statistics Statistics Statistics Machine Learning stat.ML Covers machine learning papers (supervised, un...
153 Statistics Statistics Statistics Other Statistics stat.OT Work in statistics that does not fit into the ...
154 Statistics Statistics Statistics Statistics Theory stat.TH stat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

以下是一个正则操作的案例:

import re
phone = "2004-959-559 # 这是⼀个电话号码"
# 删除注释
num = re.sub(r'#.*$', "", phone)
print ("电话号码 : ", num)

# 移除⾮数字的内容
num = re.sub(r'\D', "", phone) #\D用于匹配非数字的字符
print ("电话号码 : ", num)
电话号码 :  2004-959-559 
电话号码 :  2004959559

详细了解可以参考: https://www.runoob.com/python3/python3-reg-expressions.html

数据分析-学术前沿趋势分析-论⽂数据统计_第3张图片

对于我们的代码来说

raw = "Astrophysics(astro-ph)"
re.sub(r"(.*)\((.*)\)",r"\2",raw)
#output = astro-ph

# .表示匹配任意1个字符
# *表示匹配前一个字符出现0次多次或者无限次
# \(表示匹配“(”
#具体的这个语句,(.*)为括号前所有的str,\((.*)\)为后面括号的str
#例如:
#原始的str为:Astrophysics(astro-ph)
#经过 re.sub(r"(.*)\((.*)\)",r"\2",raw)后的str为 astro-ph
#经过 re.sub(r"(.*)\((.*)\)",r"\1",raw)后的str为 Astrophysics
'astro-ph'

对应的参数:

正则中的模式字符串 pattern 的格式为 “任意字符” + “(” + “任意字符” + “)”。

替换的字符串 repl 为第2个分组的内容。

要被查找替换的原始字符串 string 为原始的爬取的数据。

这⾥推荐⼤家⼀个在线正则表达式测试的⽹站: https://tool.oschina.net/regex/

正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。

例如:

  • runoo+b,可以匹配 runoob、runooob、runoooooob 等,+ 号代表前面的字符必须至少出现一次(1次或多次)。

  • runoo*b,可以匹配 runob、runoob、runoooooob 等,* 号代表前面的字符可以不出现,也可以出现一次或者多次(0次、或1次、或多次)。

  • colou?r 可以匹配 color 或者 colour,? 问号代表前面的字符最多只可以出现一次(0次、或1次)。

  • . 表示匹配任意1个字符

  • * 表示匹配前一个字符出现0次多次或者无限次

  • \( 表示匹配“(”

数据分析及可视化

查看所有⼤类的paper数量分布

_df = data.merge(df_taxonomy, on="categories",how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({
     "id":"count"}).sort_values(by="id",ascending=False).reset_index()
_df
group_name id
0 Physics 28
1 Mathematics 10
2 Computer Science 1

代码解释一

dataframe的merge是按照两个dataframe共有的column进行连接,两个dataframe必须具有同名的column。

merge(left, right, how= ‘inner’, on= None, left_on= None, right_on= None, left_index= False, right_index= False, sort= True, suffixes=( ‘_x’, ‘_y’), copy= True, indicator= False)

参数说明:

  • left与right:两个不同的DataFrame
  • how:指的是合并(连接)的方式有inner(内连接),left(左外连接),right(右外连接),outer(全外连接);默认为inner
  • on : 指的是用于连接的列索引名称。必须存在右右两个DataFrame对象中,如果没有指定且其他参数也未指定则以两个DataFrame的列名交集做为连接键
  • left_on:左侧DataFrame中用作连接键的列名;这个参数中左右列名不相同,但代表的含义相同时非常有用。
  • right_on:右侧DataFrame中用作 连接键的列名
  • left_index:使用左侧DataFrame中的行索引做为连接键
  • right_index:使用右侧DataFrame中的行索引做为连接键
  • sort:默认为True,将合并的数据进行排序。在大多数情况下设置为False可以提高性能
  • suffixes:字符串值组成的元组,用于指定当左右DataFrame存在相同列名时在列名后面附加的后缀名称,默认为(’_x’,’_y’)
  • copy:默认为True,总是将数据复制到数据结构中;大多数情况下设置为False可以提高性能
  • indicator:在 0.17.0中还增加了一个显示合并数据中来源情况;如只来自己于左边(left_only)、两者(both)

代码解释二

merge函数将两个dataframe以共同的属性 “categories” 进⾏合并

groupby函数用group_name 作为类别进⾏统计

drop_duplicates为去重复项;

agg({“id”:“count”})为id那列显示计数值;

sort_values(by=“id”,ascending=False)根据id列倒序排列;

reset_index()重置索引

⽤饼图进⾏结果可视化

fig = plt.figure(figsize=(15,12))
explode = ( 0, 0, 0.1)
plt.pie(_df["id"], labels=_df["group_name"], autopct='%1.2f%%',startangle=160, explode=explode)
plt.tight_layout()
plt.show()

数据分析-学术前沿趋势分析-论⽂数据统计_第4张图片

pyplot.pie(x, explode=None, labels=None……)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iiZsnIdr-1610551925065)(attachment:image.png)]

统计在计算机各个⼦领域2019年后的paper数量

我们同样使⽤ merge 函数,对于两个dataframe 共同的特征 categories 进⾏合并并且进⾏查询。然后
我们再对于数据进⾏统计和排序从⽽得到以下的结果:

group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name ==@group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")#pivot:数据透视表
year 2019
category_name
Networking and Internet Architecture 1

因为我这是用的缩减后的数据,如果是用全部数据的话,得到的是如下结论:

我们可以从结果看出, Computer Vision and Pattern Recognition(计算机视觉与模式识别)类是CS中
paper数量最多的⼦类,遥遥领先于其他的CS⼦类,并且paper的数量还在逐年增加;另外,
Computation and Language(计算与语⾔)、 Cryptography and Security(密码学与安全)以及
Robotics(机器⼈学)的2019年paper数量均超过1000或接近1000,这与我们的认知是⼀致的。


你可能感兴趣的:(python基础,数据分析,python,数据分析,正则表达式)