爬取网页中文出现乱码的解决方法

网页编码gb2312,爬取中文text出现乱码,解决方法:

str1 = paper.css('a::text').extract_first()
str1 = str1.encode("ISO 8859-1")
print(str1.decode('gbk'))
 

python 字符串string 开头r b u f 含义 str bytes 转换 format

字符串开头r b u f各含义:

b'input\n' # bytes字节符,打印以b开头。
输出:
b'input\n'
r'input\n' # 非转义原生字符,经处理'\n'变成了'\\'和'n'。也就是\n表示的是两个字符,而不是换行。
输出:
'input\\n'
u'input\n' # unicode编码字符,python3默认字符串编码方式。
输出:
'input\n'
import time
t0 = time.time()
time.sleep(1)
name = 'processing'
print(f'{name} done in {time.time() - t0:.2f} s')  # 以f开头表示在字符串内支持大括号内的python 表达式
输出:
processing done in 1.00 s
类似于f开头,大括号变量,:定义格式
coord = (3, 5)
'X: {0[0]};  Y: {0[1]}'.format(coord)

'{0}, {1}, {0}'.format(*'abc')      # unpacking argument sequence
'a, b, a'

'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
'Coordinates: 37.24N, -115.81W'

'{:,}'.format(1234567890)
'1,234,567,890'

'Correct answers: {:.2%}'.format(points/total)
'Correct answers: 86.36%'

str与bytes转换:

'€20'.encode('utf-8')
# b'\xe2\x82\xac20'
b'\xe2\x82\xac20'.decode('utf-8')
# '€20'
s1 = '123'
print(s1)
print(type(s1))
s2 = b'123'
print(s2)
print(type(s2))

区别输出:
123

b'123'

Python 2 将字符串处理为 bytes 类型。 
Python 3 将字符串处理为 unicode 类型

str转bytes:
bytes('123', encoding='utf8')
str.encode('123')

bytes转str:
str(b'123', encoding='utf-8')
bytes.decode(b'123')

 # bytes object
  b = b"example"
 
  # str object
  s = "example"
 
  # str to bytes
  bytes(s, encoding = "utf8")
 
  # bytes to str
  str(b, encoding = "utf-8")
 
  # an alternative method
  # str to bytes
  str.encode(s)
 
  # bytes to str
  bytes.decode(b)

 

你可能感兴趣的:(Python,爬虫)