使用Python网络爬虫抓取CodeForces题目

文章目录

    • 1. 背景
    • 2. 前期准备
    • 3. 获取网页内容
    • 4. 内容处理
      • 4.1. Limit
      • 4.2. Problem Description
      • 4.3. Input
      • 4.4. Output
      • 4.5. Sample Input & Output
      • 4.6. Note
      • 4.7. Source
    • 5. 输出

1. 背景

最近做题的时候要写一些题解,在把CodeForces的题目复制下来的时候,数学公式的处理比较麻烦,所以我用Pythonurllib.requestBeautifulSoup4库对题目信息进行了爬取,写题解的时候时间节约了很多。

考虑到大家可能也会遇到同样的问题,写一篇笔记分享给大家。

2. 前期准备

安装urllibBeautifulSoup库。

pip3 install urllib
pip3 install beautifulsoup4

3. 获取网页内容

以 CodeForces 1353 B. Two Arrays And Swaps 为例。

# 导入库
import urllib.request
import bs4
from bs4 import BeautifulSoup

# 题目属性
problemSet = "1353"
problemId = "B"

# 题目链接
url = f"https://codeforces.com/problemset/problem/{problemSet}/{problemId}"
# 获取网页内容
html = urllib.request.urlopen(url).read()
# 格式化
soup = BeautifulSoup(html,'lxml')

# 存储
data_dict = {}
# 找到主体内容
mainContent = soup.find_all(name="div", attrs={"class" :"problem-statement"})[0]

4. 内容处理

4.1. Limit

先从比较简单的信息入手,找到题目标题、时间、和内存限制。

# Limit
# 找到题目标题、时间、和内存限制
# Title
data_dict['Title'] = f"CodeForces {problemSet} " + mainContent.find_all(name="div", attrs={"class":"title"})[0].contents[-1]
# Time Limit
data_dict['Time Limit'] = mainContent.find_all(name="div", attrs={"class":"time-limit"})[0].contents[-1]
# Memory Limit
data_dict['Memory Limit'] = mainContent.find_all(name="div", attrs={"class":"memory-limit"})[0].contents[-1]




定义函数,处理主体内容中诡异的空格和公式的三个美元符号$$$

def divTextProcess(div):
    """
    处理
标签中

的文本内容 """ strBuffer = '' # 遍历处理每个

标签 for each in div.find_all("p"): for content in each.contents: # 如果不是第一个,加换行符 if (strBuffer != ''): strBuffer += '\n\n' # 处理 if (type(content) != bs4.element.Tag): # 如果是文本,添加至字符串buffer中 strBuffer += content.replace(" ", " ").replace("$$$", "$") else: # 如果是html元素,如span等,加上粗体 strBuffer += "**" + content.contents[0].replace(" ", " ").replace("$$$", "$") + "**" # 返回结果 return strBuffer

4.2. Problem Description

获取题目描述,由于题目描述的

标签没有idclass属性,这里通过找列表中第10div的方式来获取。

# 处理题目描述
data_dict['Problem Description'] = divTextProcess(mainContent.find_all("div")[10])

4.3. Input

输入描述

div = mainContent.find_all(name="div", attrs={"class":"input-specification"})[0]
data_dict['Input'] = divTextProcess(div)

4.4. Output

输出描述

div = mainContent.find_all(name="div", attrs={"class":"output-specification"})[0]
data_dict['Output'] = divTextProcess(div)

4.5. Sample Input & Output

输入样例,用代码框环境包围。

# Input
div = mainContent.find_all(name="div", attrs={"class":"input"})[0]
data_dict['Sample Input'] = "```cpp" + div.find_all("pre")[0].contents[0] + '```'
# Onput
div = mainContent.find_all(name="div", attrs={"class":"output"})[0]
data_dict['Sample Output'] = "```cpp" + div.find_all("pre")[0].contents[0] + '```'


4.6. Note

样例说明

# 若有样例说明
if(len(mainContent.find_all(name="div", attrs={"class":"note"})) > 0):
    div = mainContent.find_all(name="div", attrs={"class":"note"})[0]
    data_dict['Note'] = divTextProcess(div)
    

4.7. Source

题目链接

data_dict['Source'] = '[' + data_dict['Title'] + ']' + '(' + url + ')'

5. 输出

for each in data_dict.keys():
    print('### ' + each + '\n')
    print(data_dict[each].replace("\n\n**", "**").replace("**\n\n", "**") + '\n')
    

下面是最后的输出结果

### Title

CodeForces 1353 B. Two Arrays And Swaps

### Time Limit

1 second

### Memory Limit

256 megabytes

### Problem Description

You are given two arrays $a$ and $b$ both consisting of $n$ positive (greater than zero) integers. You are also given an integer $k$.

In one move, you can choose two indices $i$ and $j$ ($1 \le i, j \le n$) and swap $a_i$ and $b_j$ (i.e. $a_i$ becomes $b_j$ and vice versa). Note that $i$ and $j$ can be equal or different (in particular, swap $a_2$ with $b_2$ or swap $a_3$ and $b_9$ both are acceptable moves).

Your task is to find the **maximum** possible sum you can obtain in the array $a$ if you can do no more than (i.e. at most) $k$ such moves (swaps).

You have to answer $t$ independent test cases.

### Input

The first line of the input contains one integer $t$ ($1 \le t \le 200$) — the number of test cases. Then $t$ test cases follow.

The first line of the test case contains two integers $n$ and $k$ ($1 \le n \le 30; 0 \le k \le n$) — the number of elements in $a$ and $b$ and the maximum number of moves you can do. The second line of the test case contains $n$ integers $a_1, a_2, \dots, a_n$ ($1 \le a_i \le 30$), where $a_i$ is the $i$-th element of $a$. The third line of the test case contains $n$ integers $b_1, b_2, \dots, b_n$ ($1 \le b_i \le 30$), where $b_i$ is the $i$-th element of $b$.

### Output

For each test case, print the answer — the **maximum** possible sum you can obtain in the array $a$ if you can do no more than (i.e. at most) $k$ swaps.

### Sample Input

// 这里会有```cpp代码环境,在这里为了展示方便去掉了
5
2 1
1 2
3 4
5 5
5 5 6 6 5
1 2 5 4 3
5 3
1 2 3 4 5
10 9 10 10 9
4 0
2 2 4 3
2 4 2 3
4 4
1 2 2 1
4 4 5 4

### Sample Output

6
27
39
11
17

### Note

In the first test case of the example, you can swap $a_1 = 1$ and $b_2 = 4$, so $a=[4, 2]$ and $b=[3, 1]$.

In the second test case of the example, you don't need to swap anything.

In the third test case of the example, you can swap $a_1 = 1$ and $b_1 = 10$, $a_3 = 3$ and $b_3 = 10$ and $a_2 = 2$ and $b_4 = 10$, so $a=[10, 10, 10, 4, 5]$ and $b=[1, 9, 3, 2, 9]$.

In the fourth test case of the example, you cannot swap anything.

In the fifth test case of the example, you can swap arrays $a$ and $b$, so $a=[4, 4, 5, 4]$ and $b=[1, 2, 2, 1]$.

### Source

[CodeForces 1353 B. Two Arrays And Swaps](https://codeforces.com/problemset/problem/1353/B)




















联系邮箱:[email protected]

Github:https://github.com/CurrenWong

欢迎转载/Star/Fork,有问题欢迎通过邮箱交流。

你可能感兴趣的:(算法刷题笔记,#,网络爬虫)