Python爬取“”中国最好大学排名”,

源代码参考
北京理工大学公开课《Python网络爬虫与信息提取》中的“中国大学排名爬虫”

源代码基础上:
(1)添加headers;
(2)观察该网站不同年份网址大同小异,所以可以选择(2016-2019)年的任意年直接爬取对应年份的数据。
(3)确定每次爬取“前多少所学校信息”,即前多少名。

import requests
from bs4 import BeautifulSoup
import bs4

def getHtmlText(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
               'Upgrade-Insecure-Requests': '1'
               }
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding= r.apparent_encoding
        #print(r.text)
        return r.text      
    except:
        return " "

def fillUnivList(ulist,html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance (tr,bs4.element.Tag):
            tds = tr('td')
            #print(tds[0])  #62
            #print(tds[0].string) #62
            ulist.append([tds[0].string, tds[1].string, tds[2].string, tds[3].string])

def printUnivList(ulist,num):
    tlp = "{0:^10}\t{1:{4}^10}\t{2:^10}\t{3:^10}"
    print(tlp.format("排名","学校","省市","总分",chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tlp.format(u[0],u[1],u[2],u[3],chr(12288)))    
    print("suc" + str(num))
            
def main():
      year = input("请输入需要查询的年份:")
      url  = "http://www.zuihaodaxue.com/zuihaodaxuepaiming" + year + ".html"
      print(year +"年中国最好大学排名的网址:" + url)
      html = getHtmlText(url)
      
      uinfo = []
      fillUnivList(uinfo,html)
      
      num = int(input("请输入需要查询的学校数量:"))
      #input读取的类型是str(字符串),需要转换成int类型;否则出现TypeError: 'str' object cannot be interpreted as an integer
      printUnivList(uinfo,num)
       
if __name__ == "__main__":
    main()

①运行成功
Python爬取“”中国最好大学排名”,_第1张图片
②出现异常:
|
File “C:…spyder-py3/temp.py”, line 32, in printUnivList
print(tlp.format(u[0],u[1],u[2],u[3],chr(12288)))

TypeError: unsupported format string passed to NoneType.format

不足:
(1)视频课爬取前三列数据,尝试前四列数据出现异常;
(2)下拉列表信息抓取未曾实现;

你可能感兴趣的:(爬虫)