Python判断网页编码

有一种渴,只有酒才能滋润,这种渴就是孤独。

根据网页返回编码寻找数据

比如我要找到这个网页的标题,那么直接正则匹配(.*?)就可以,但是许多时候因为编码问题requests这个库没办法正确解析,所以获取不到数据。

解决办法:

        r_port_top = requests.get(url=str('http://'+url), headers=headers, timeout=5)
        if r_port_top.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('(.*?)', encode_content, re.S).group().replace('',
                                                                                                 '').replace(
                '', '')

这种办法就是先判断网页的编码,然后转换之。但是有的时候是utf-8编码就没办法,接下来来个终极版的。

    try:
        UA = random.choice(headerss)
        headers = {'User-Agent': UA}
        r_port_top = requests.get(url=str('http://'+url), headers=headers, timeout=5)
        if r_port_top.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('(.*?)', encode_content, re.S).group().replace('',
                                                                                                 '').replace(
                '', '')
        elif r_port_top.encoding == 'GB2312':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('(.*?)', encode_content, re.S).group().replace('',
                                                                                                 '').replace(
                '', '')
        elif r_port_top.encoding == 'gb2312':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('(.*?)', encode_content, re.S).group().replace('',
                                                                                                 '').replace(
                '', '')
        elif r_port_top.encoding == 'GBK':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('(.*?)', encode_content, re.S).group().replace('',
                                                                                                 '').replace(
                '', '')
        elif r_port_top.encoding == 'gbk':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('(.*?)', encode_content, re.S).group().replace('',
                                                                                                 '').replace(
                '', '')
        else:
            port_title = re.search('(.*?)', r_port_top.content, re.S).group().replace('',
                                                                                                     '').replace(
                '', '')
    except:
        try:
            port_title = re.search('(.*?)', r_port_top.content, re.S).group().replace('',
                                                                                                     '').replace(
                '', '')
        except:
            port_title = '暂时无法获取网站标题'

使用chardet直接判断转换

上面那个方法实在是太傻了,使用chardet轻松解决网页编码问题。

# -*- coding: utf-8 -*-
# @Time    : 2018/5/4 0004 8:55
# @Author  : Langzi
# @Blog    : www.langzi.fun
# @File    : get urls.py
# @Software: PyCharm
import sys
import chardet
import re
import requests

reload(sys)
sys.setdefaultencoding('utf-8')

url = 'https://stackoverflow.com'
d1 = requests.get(url)
print d1.content
if isinstance(d1.content,unicode):
    pass
else:
    codesty = chardet.detect(d1.content)
    a = d1.content.decode(codesty['encoding'])

得到的a就是网页最终编码后的结果,这个时候直接re.search(‘(.*?)‘,a)就可以达到了匹配所有网址的标题了。

个人博客:www.langzi.fun
欢迎交流Python开发,安全测试。

你可能感兴趣的:(Python,判断网页编码,chardet,爬虫编码问题)