【Python】爬取高校数据(名字,院校特色,所在地,性质)。可用于判断高校是否为双一流,本科/专科等分析

源网站:http://college.gaokao.com/schlist/p1

利用Python的lxml库进行html解析,

源代码:

import requests
from lxml import etree
import pandas as pd
import csv
# 请求URL
url = 'http://college.gaokao.com/schlist/p'

# 构建请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# 爬取数据
data = []
for page in range(1, 37):
    print(f"正在爬取第{page}页...")
    print(url + str(page))
    res = requests.get(url + str(page), headers=headers)
    html = etree.HTML(res.content.decode('gb2312'), parser=etree.HTMLParser(encoding='gb2312'))
    items = html.xpath('//*[@id="wrapper"]/div[4]/div[1]/dl')[0:]
    print('items:', len(items))
    for item in items:
        name = item.xpath('dt/strong/a/text()')[0].strip()  # 中文名
        location = item.xpath('dd/ul/li[1]/text()')[0].strip()  # 所在地
        level = item.xpath('dd/ul/li[2]/span')  # 院校特色
        lev_str = ''
        for lev in level: lev_str += lev.xpath('text()')[0] + ' '
        lev_str = lev_str[0:len(lev_str)-1]
        nature = item.xpath('dd/ul/li[5]/text()')[0][5:]  # 高校性质
        data.append([name, location, lev_str, nature])

# 保存数据到本地
df = pd.DataFrame(data, columns=['中文名', '所在地', '办学层次', '高校性质'])
df.to_csv('university_data.csv', index=False, encoding='utf-8-sig')
print(f"数据保存成功,共获取{len(data)}条数据。")




csv数据格式:

中文名,所在地,办学层次,高校性质
北京大学,高校所在地:北京,211 985,本科
中国人民大学,高校所在地:北京,211 985,本科

工具类:

import csv


def get_school_level(school):
    with open('university_data.csv', encoding='utf-8') as f:
        reader = csv.reader(f)
        next(reader)  # 跳过文件头
        for row in reader:
            if row[0] == school and row[2]!='无':
                return row[2]
    return None


def get_school_nature(school):
    with open('university_data.csv', encoding='utf-8') as f:
        reader = csv.reader(f)
        next(reader)  # 跳过文件头
        for row in reader:
            if row[0] == school:
                return row[3]
    return None


school = '浙江大学'
res1 = get_school_level(school)
res2 = get_school_nature(school)
print(res1)
print(res2)

你可能感兴趣的:(Python,python,开发语言,爬虫)