爬虫三大库 request、BeautifulSoup、lxml库

推荐使用lxml作为解释器，其效率高

使用请求头来伪装浏览器，右键检查，请求头在network中寻找User-Agent，找到network后刷新一下
拉到最下面。

import lxml

import requests

from bs4 import BeautifulSoup

headers={

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'

}

res=requests.get("http://bj.xiaozhu.com/",headers=headers)

soup=BeautifulSoup(res.text,"lxml")

print(soup.prettify())

解析得到的Soup文档有find()、find_all()、selector()方法

1.soup.find_all（"div","item"） #查找div标签，class="item"

2.find()方法与find_all()方法类似，前者查找全部，后者第一个

3.selector()方法

soup.selector(div.item>a>h1) #括号内容通过Chrome浏览器复制得到，如图，选择想要查看的信息，右键检查

image

此时便能得到如下：

page_list > ul > li:nth-child(1) > div.result_btm_con.lodgeunitname > div:nth-child(1) > span > i

把li:nth-child(1)改为li:nth-of-type(1),注意>左右都有一个空格。亲测比较难用，了解而已。

用find_all()比较好使，代码如下：

import lxml

import requests

from bs4import BeautifulSoup

headers={

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'

}

res=requests.get("http://bj.xiaozhu.com/",headers=headers)

soup=BeautifulSoup(res.text,"lxml")

prices=soup.find_all("span","result_price")

names=soup.find_all("span","result_title hiddenTxt")

for price,namein zip(prices,names):

print(name.get_text(),price.get_text())

综合应用，爬取北京地区短租房信息：

爬取多页的信息，首先手动翻页，得到每页的地址，如下：

http://bj.xiaozhu.com/search-duanzufang-p2-0/

http://bj.xiaozhu.com/search-duanzufang-p3-0/

发现了吧，更改pi就可以实现自动翻页，本次爬取全部的信息。
'''
import lxml
import requests
from bs4 import BeautifulSoup
import time

headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}

def get_links(url):
web_data=requests.get(url,headers=headers)
soup=BeautifulSoup(web_data.text,"lxml")
prices=soup.find_all("span","result_price")
names=soup.find_all("span","result_title hiddenTxt")
informations=soup.find_all("em","hiddenTxt")
links=soup.find_all("a","resule_img_a")
for name,price,link,information in zip(names,prices,links,informations):
data={
"name":name.get_text().strip(),
"price":price.get_text().strip(),
"href":link.get("href"),
"information":information.get_text().replace("\n"," ").strip()
}
print(data)
urls=['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(number) for number in range(1,14)]
for url in urls:
get_links(url)
time.sleep(2)
'''

爬虫入门1

page_list > ul > li:nth-child(1) > div.result_btm_con.lodgeunitname > div:nth-child(1) > span > i

你可能感兴趣的:(爬虫入门1)