爬虫三大库 request、BeautifulSoup、lxml库
推荐使用lxml作为解释器,其效率高
使用请求头来伪装浏览器,右键检查,请求头在network中寻找User-Agent,找到network后刷新一下
拉到最下面。
import lxml
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
res=requests.get("http://bj.xiaozhu.com/",headers=headers)
soup=BeautifulSoup(res.text,"lxml")
print(soup.prettify())
解析得到的Soup文档有find()、find_all()、selector()方法
1.soup.find_all("div","item") #查找div标签,class="item"
2.find()方法与find_all()方法类似,前者查找全部,后者第一个
3.selector()方法
soup.selector(div.item>a>h1) #括号内容通过Chrome浏览器复制得到,如图,选择想要查看的信息,右键检查
此时便能得到如下:
page_list > ul > li:nth-child(1) > div.result_btm_con.lodgeunitname > div:nth-child(1) > span > i
把li:nth-child(1)改为li:nth-of-type(1),注意>左右都有一个空格。亲测比较难用,了解而已。
用find_all()比较好使,代码如下:
import lxml
import requests
from bs4import BeautifulSoup
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
res=requests.get("http://bj.xiaozhu.com/",headers=headers)
soup=BeautifulSoup(res.text,"lxml")
prices=soup.find_all("span","result_price")
names=soup.find_all("span","result_title hiddenTxt")
for price,namein zip(prices,names):
print(name.get_text(),price.get_text())
综合应用,爬取北京地区短租房信息:
爬取多页的信息,首先手动翻页,得到每页的地址,如下:
http://bj.xiaozhu.com/search-duanzufang-p2-0/
http://bj.xiaozhu.com/search-duanzufang-p3-0/
发现了吧,更改pi就可以实现自动翻页,本次爬取全部的信息。
'''
import lxml
import requests
from bs4 import BeautifulSoup
import time
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
def get_links(url):
web_data=requests.get(url,headers=headers)
soup=BeautifulSoup(web_data.text,"lxml")
prices=soup.find_all("span","result_price")
names=soup.find_all("span","result_title hiddenTxt")
informations=soup.find_all("em","hiddenTxt")
links=soup.find_all("a","resule_img_a")
for name,price,link,information in zip(names,prices,links,informations):
data={
"name":name.get_text().strip(),
"price":price.get_text().strip(),
"href":link.get("href"),
"information":information.get_text().replace("\n"," ").strip()
}
print(data)
urls=['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(number) for number in range(1,14)]
for url in urls:
get_links(url)
time.sleep(2)
'''