UnicodeEncodeError: ‘gbk‘ codec can‘t encode character ‘\xee‘ in position 71: illegal multibyte sequ

成功解决:UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xee’ in position 71: illegal multibyte sequence

原始代码

with open('douban.html','r',encoding = 'utf-8') as f:
   data = f.read()
   f.close()
print(data)

出现的错误

UnicodeEncodeError: ‘gbk‘ codec can‘t encode character ‘\xee‘ in position 71: illegal multibyte sequ_第1张图片

不能读取html文件,尝试过多种编码方式都不能解决,最后在读取的时候使用编码并解码的方式成功解决。(先用gbk编码,忽略掉非法字符,然后再译码)

解决后的代码

with open('douban250.html', 'r', encoding='utf-8' ) as f:
    data = f.read().encode('GBK','ignore').decode('GBK')
    f.close()
print(data)

输出结果:
UnicodeEncodeError: ‘gbk‘ codec can‘t encode character ‘\xee‘ in position 71: illegal multibyte sequ_第2张图片成功读到了,之后就可以做相应的操作啦!

test = etree.HTML( data )
key = test.xpath( '//*[@id="content"]/div/div[1]/ol/li' )

row = []
for i in key:
    name_chinese = i.xpath( './div/div[2]/div[1]/a/span[1]/text()' )[0]
    name_english = i.xpath( './div[1]/div[2]/div[1]/a/span[2]/text()' )[0].strip()
    name = name_chinese + name_english
    director = i.xpath( './div/div[2]/div[2]/p[1]/text()' )[0].strip()
    time = i.xpath( './div[1]/div[2]/div[2]/p/text()' )[1].strip()
    rating = i.xpath( './div/div[2]/div[2]/div/span[1]/@class' )[0][6:-2]
    rating_num = i.xpath( './div/div[2]/div[2]/div/span[2]/text()' )[0]
    comment_num = i.xpath( './div/div[2]/div[2]/div/span[4]/text()' )[0].replace( '人评价', '' )
    comments = i.xpath( './div/div[2]/div[2]/p[2]/span/text()' )[0]

   data = {'名称': name, '详情':director, '时间':time, '等级':rating, '评分':rating_num, '评价人数':comment_num, '评价':comments}
    row.append(data)
    
row = pd.DataFrame(row)

row.to_excel('电影信息.xlsx')

这是最后得到的结果UnicodeEncodeError: ‘gbk‘ codec can‘t encode character ‘\xee‘ in position 71: illegal multibyte sequ_第3张图片

你可能感兴趣的:(Python爬虫,html,爬虫)