爬虫入门5(爬取糗事百科)

import requests
import re
headers={"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'}
info_lists=[]
def judgement_sex(class_name):
if class_name=="womenIcon":
return "女"
else:
return "男"
def get_info(url):
res=requests.get(url)
ids=re.findall("

(.?)

",res.text,re.S)
levels=re.findall('
(.?)
',res.text,re.S)
sexs=re.findall('div class="articleGender (.?)">',res.text,re.S)
contents=re.findall('
.?(.?).?
',res.text,re.S)
for id,level,sex,content in zip(ids,levels,sexs,contents):
info={
"id":id,
"level":level,
"sex":judgement_sex(sex),
"content":content,
}
info_lists.append(info)

if name=="main":
urls=['https://www.qiushibaike.com/text/page/{}/'.format(str(i)) for i in range(1,14)]
for url in urls:
get_info(url)
for info in info_lists:
print(info["id"])
print(info["level"])
print(info['sex'])
print(info['content'])
如果输出到文件的话就输出txt文件,注意不能直接向文件中输出字典,要慢慢的输出。

你可能感兴趣的:(爬虫入门5(爬取糗事百科))