Python爬虫——爬取网页时出现中文乱码问题

Python爬虫——爬取网页时出现中文乱码问题

一、查看网页源代码的编码方式

如何看网页源代码的编码方式:打开指定网页,以本页面为例,右键打开网页源代码,在标签中查看,此处显示的编码方式为utf-8

二、网页编码方式为utf-8时不一定不乱码

例如中国新闻网的网页编码方式是utf-8,但是使用爬虫爬取网页源代码时,结果是乱码的

URL = 'https://www.chinanews.com/scroll-news/news1.html'
resp = requests.get(
    url=URL,
    headers={
     
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
        'Cookie': 'cnsuuid=ae27ca94-6648-a265-8959-579aae56af412059.884365667969_1610175649841'
    }
)
soup = bs4.BeautifulSoup(resp.text, 'html.parser')
print(soup)

结果如图:
Python爬虫——爬取网页时出现中文乱码问题_第1张图片

接下来需要转换一下编码:

URL = 'https://www.chinanews.com/scroll-news/news1.html'
resp = requests.get(
    url=URL,
    headers={
     
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
        'Cookie': 'cnsuuid=ae27ca94-6648-a265-8959-579aae56af412059.884365667969_1610175649841'
    }
)
# 解析后乱码,将乱码的编码形式转换成utf-8
resp.encoding = 'utf-8'
soup = bs4.BeautifulSoup(resp.text, 'html.parser')
print(soup)

运行结果为,结果正常
Python爬虫——爬取网页时出现中文乱码问题_第2张图片

三、网页编码的另一种情况:GBK

例如王者荣耀官网:编码形式为GBK

URL = 'https://pvp.qq.com/'
resp = requests.get(
    url=URL,
    headers={
     
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
        'Cookie': 'cnsuuid=ae27ca94-6648-a265-8959-579aae56af412059.884365667969_1610175649841'
    }
)
soup = bs4.BeautifulSoup(resp.text, 'html.parser')
print(soup)

结果为乱码
在这里插入图片描述

这次再使用上面的方法解决:

URL = 'https://pvp.qq.com/'
resp = requests.get(
    url=URL,
    headers={
     
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
        'Cookie': 'cnsuuid=ae27ca94-6648-a265-8959-579aae56af412059.884365667969_1610175649841'
    }
)
# 解析后乱码,将乱码的编码形式转换成utf-8
resp.encoding = 'utf-8'
soup = bs4.BeautifulSoup(resp.text, 'html.parser')
print(soup)

结果依旧乱码:
Python爬虫——爬取网页时出现中文乱码问题_第3张图片

这是什么情况呢?

换一种方式继续:

URL = 'https://pvp.qq.com/'
resp = requests.get(
    url=URL,
    headers={
     
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
        'Cookie': 'cnsuuid=ae27ca94-6648-a265-8959-579aae56af412059.884365667969_1610175649841'
    }
)
soup = bs4.BeautifulSoup(resp.text.encode('iso-8859-1').decode('gbk'), 'html.parser')
print(soup)

结果为:
Python爬虫——爬取网页时出现中文乱码问题_第4张图片

搞定!

你可能感兴趣的:(Python学习,IT,免费,python,乱码,爬虫)