网页编码gb2312,爬取中文text出现乱码,解决方法:
str1 = paper.css('a::text').extract_first()
str1 = str1.encode("ISO 8859-1")
print(str1.decode('gbk'))
python 字符串string 开头r b u f 含义 str bytes 转换 format
字符串开头r b u f各含义:
b'input\n' # bytes字节符,打印以b开头。
输出:
b'input\n'
r'input\n' # 非转义原生字符,经处理'\n'变成了'\\'和'n'。也就是\n表示的是两个字符,而不是换行。
输出:
'input\\n'
u'input\n' # unicode编码字符,python3默认字符串编码方式。
输出:
'input\n'
import time
t0 = time.time()
time.sleep(1)
name = 'processing'
print(f'{name} done in {time.time() - t0:.2f} s') # 以f开头表示在字符串内支持大括号内的python 表达式
输出:
processing done in 1.00 s
类似于f开头,大括号变量,:定义格式
coord = (3, 5)
'X: {0[0]}; Y: {0[1]}'.format(coord)
'{0}, {1}, {0}'.format(*'abc') # unpacking argument sequence
'a, b, a'
'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
'Coordinates: 37.24N, -115.81W'
'{:,}'.format(1234567890)
'1,234,567,890'
'Correct answers: {:.2%}'.format(points/total)
'Correct answers: 86.36%'
str与bytes转换:
'€20'.encode('utf-8')
# b'\xe2\x82\xac20'
b'\xe2\x82\xac20'.decode('utf-8')
# '€20'
s1 = '123'
print(s1)
print(type(s1))
s2 = b'123'
print(s2)
print(type(s2))
区别输出:
123
b'123'
Python 2 将字符串处理为 bytes 类型。
Python 3 将字符串处理为 unicode 类型
str转bytes:
bytes('123', encoding='utf8')
str.encode('123')
bytes转str:
str(b'123', encoding='utf-8')
bytes.decode(b'123')
# bytes object
b = b"example"
# str object
s = "example"
# str to bytes
bytes(s, encoding = "utf8")
# bytes to str
str(b, encoding = "utf-8")
# an alternative method
# str to bytes
str.encode(s)
# bytes to str
bytes.decode(b)