基于python爬取全国2822所高校在各省,近三年的录取分数线

数据更新:爬取的2022、2021、2020三年的数据如下
链接:https://pan.baidu.com/s/1UrYmrE5chYuJ6VeJCLbdzA
提取码:ozu5

最近全国高考结束,考生都在等分当中,鉴于自己之前一直有个想法,爬取各高校的信息,方便考生选择,因此完成了一下代码,爬取了全国2822所高校,包括本科和高职院校,在各省的分数线。

下图是各高校在湖北省的,经过高校软科排名排序后的近3年录取分数情况:
基于python爬取全国2822所高校在各省,近三年的录取分数线_第1张图片
完整的数据下载地址
链接:https://pan.baidu.com/s/1uohDZQk2SPSjI0htZBJd1g
提取码:z1db

基于python爬取全国2822所高校在各省,近三年的录取分数线_第2张图片

数据中分数栏,空白部分,说明该学校在该省不招生。

部分代码如下,未优化…(代码已更新)

from ast import Str
from time import sleep
import requests
import json
import csv
import time
import random
from sqlalchemy import null

def save_data(s,data):
    with open('D:/PYTHON_CODE/高校分数线/'+s+'.csv', encoding='UTF-8', mode='a+',newline='') as f:
        f_csv = csv.writer(f)
        f_csv.writerow(data)
    f.close()
headers_list = [
    {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (iPad; CPU OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/87.0.4280.77 Mobile/15E148 Safari/604.1'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.109 Safari/537.36 CrKey/1.54.248666'
    }, {
        'user-agent': 'Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.188 Safari/537.36 CrKey/1.54.250320'
    }, {
        'user-agent': 'Mozilla/5.0 (BB10; Touch) AppleWebKit/537.10+ (KHTML, like Gecko) Version/10.0.9.2372 Mobile Safari/537.10+'
    }, {
        'user-agent': 'Mozilla/5.0 (PlayBook; U; RIM Tablet OS 2.1.0; en-US) AppleWebKit/536.2+ (KHTML like Gecko) Version/7.2.1.0 Safari/536.2+'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; U; Android 4.3; en-us; SM-N900T Build/JSS15J) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; U; Android 4.1; en-us; GT-N7100 Build/JRO03C) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; U; Android 4.0; en-us; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 7.0; SM-G950U Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; SM-G965U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.111 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 8.1.0; SM-T837A) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.80 Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; U; en-us; KFAPWI Build/JDQ39) AppleWebKit/535.19 (KHTML, like Gecko) Silk/3.13 Safari/535.19 Silk-Accelerated=true'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; LGMS323 Build/KOT49I.MS32310c) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 550) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Mobile Safari/537.36 Edge/14.14263'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Moto G (4)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 10 Build/MOB31T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; Nexus 5X Build/OPR4.170623.006) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 7.1.1; Nexus 6 Build/N6F26U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; Nexus 6P Build/OPP3.170518.006) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 7 Build/MOB30X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 520)'
    }, {
        'user-agent': 'Mozilla/5.0 (MeeGo; NokiaN9) AppleWebKit/534.13 (KHTML, like Gecko) NokiaBrowser/8.5.0 Mobile Safari/534.13'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 9; Pixel 3 Build/PQ1A.181105.017.A1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 10; Pixel 4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 11; Pixel 3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.181 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (Linux; Android 8.0.0; Pixel 2 XL Build/OPD1.170816.004) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36'
    }, {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1'
    }, {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1'
    }, {
        'user-agent': 'Mozilla/5.0 (iPad; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1'
    }
]

headers = random.choice(headers_list)

def get_url(url):
    try:
        response = requests.get(url, headers=headers, timeout=1)  # 超时设置为10秒
    except:
        for i in range(4):  # 循环去请求网站
            response = requests.get(url, headers=headers, timeout=20)
            if response.status_code == 200:
                break
    html_str = response.text
    return html_str

print("#########"
      " 版权所有:殷宗敏 & 数据接口来源-https://www.gaokao.cn/school/search  & 在此表示感谢!"
      "##########")

url = 'https://static-data.gaokao.cn/www/2.0/school/name.json'
html = requests.get(url).text
unicodestr=json.loads(html)  #将string转化为dict
dat = unicodestr["data"]

province_id=[{"name":11,"value":"北京"},{"name":12,"value":"天津"},{"name":13,"value":"河北"},{"name":14,"value":"山西"},{"name":15,"value":"内蒙古"},{"name":21,"value":"辽宁"},{"name":22,"value":"吉林"},{"name":23,"value":"黑龙江"},{"name":31,"value":"上海"},{"name":32,"value":"江苏"},{"name":33,"value":"浙江"},{"name":34,"value":"安徽"},{"name":35,"value":"福建"},{"name":36,"value":"江西"},{"name":37,"value":"山东"},{"name":41,"value":"河南"},{"name":42,"value":"湖北"},{"name":43,"value":"湖南"},{"name":44,"value":"广东"},{"name":45,"value":"广西"},{"name":46,"value":"海南"},{"name":50,"value":"重庆"},{"name":51,"value":"四川"},{"name":52,"value":"贵州"},{"name":53,"value":"云南"},{"name":54,"value":"西藏"},{"name":61,"value":"陕西"},{"name":62,"value":"甘肃"},{"name":63,"value":"青海"},{"name":64,"value":"宁夏"},{"name":65,"value":"新疆"}]
for l in province_id:
    header = ['名称', '省', '市', '区', '地址','介绍' ,'985','211','软科排名','学校类型','学校属性','特色专业',"2022分数线","2021分数线","2020分数线"]
    with open('D:/PYTHON_CODE/高校分数线/'+l["value"]+'.csv', encoding='utf-8-sig', mode='w',newline='') as f:
        f_csv = csv.writer(f)
        f_csv.writerow(header)
        #f.close()
    for i in dat:
        schoolid = i['school_id']
        schoolname = i['name']
        
        
        url1 = 'https://static-data.gaokao.cn/www/2.0/school/'+schoolid+'/info.json'

        print("正在下载"+schoolname)

        html1 = get_url(url1)
        unicodestr1=json.loads(html1)  #将string转化为dict
        if len(unicodestr1) !=0:
            dat1 = unicodestr1["data"]

            name = dat1["name"]
            content = dat1["content"]
            f985 = dat1["f985"]
            if f985 =="1":
                f985 = "是"
            else:
                f985 = "否"
            f211 = dat1["f211"]
            if f211 =="1":
                f211 = "是"
            else:
                f211 = "否"

            ruanke_rank = dat1["ruanke_rank"]
            if ruanke_rank=='0':
                ruanke_rank =''
            type_name= dat1["type_name"]
            school_nature_name = dat1["school_nature_name"]
            province_name = dat1["province_name"]
            city_name = dat1["city_name"]
            town_name = dat1["town_name"]
            address = dat1["address"]
            special =[]
            for j in  dat1["special"]:
                special.append(j["special_name"])        
            pro_type_min=dat1["pro_type_min"]


            fen2021=''
            fen2020=''
            fen2022=''

            for k in pro_type_min.keys():
                    # print(k)
                    # print(l["name"])
                if int(k) == l["name"]:
                    print(pro_type_min[k])
                    for m in pro_type_min[k]:
                        if  m['year'] == 2022:
                            s = ' '
                            for j in m['type'].keys():
                                if j == '2073':
                                    s = s+'物理类:'+m['type'][j] +' '
                                if j == '2074':
                                    s = s+'历史类:'+m['type'][j] +' '
                                if j == '1':
                                    s = s+'理科:'+m['type'][j] +' '
                                if j == '2':
                                    s = s+'文科:'+m['type'][j] +' '
                                if j == '3':
                                    s = s+'综合类:'+m['type'][j] +' '
                            fen2022 =  s
                        elif  m['year'] == 2021:
                            s = ' '
                            for j in m['type'].keys():
                                if j == '2073':
                                    s = s+'物理类:'+m['type'][j] +' '
                                if j == '2074':
                                    s = s+'历史类:'+m['type'][j] +' '
                                if j == '1':
                                    s = s+'理科:'+m['type'][j] +' '
                                if j == '2':
                                    s = s+'文科:'+m['type'][j] +' '
                                if j == '3':
                                    s = s+'综合类:'+m['type'][j] +' '
                            fen2021 =  s
                        else:
                            s = ' '
                            for j in m['type'].keys():
                                if j == '2073':
                                    s = s+'物理类:'+m['type'][j] +' '
                                if j == '2074':
                                    s = s+'历史类:'+m['type'][j] +' '
                                if j == '1':
                                    s = s+'理科:'+m['type'][j] +' '
                                if j == '2':
                                    s = s+'文科:'+m['type'][j] +' '
                                if j == '3':
                                    s = s+'综合类:'+m['type'][j] +' '
                            fen2020 =  s
                                
                    
            tap = (name,province_name,city_name,town_name,address,content,f985,f211,ruanke_rank,type_name,school_nature_name,special,fen2022,fen2021,fen2020)
            save_data(l["value"],tap)

   

你可能感兴趣的:(python,python,爬虫)