Python爬虫实战--(二)解析网页中的元素

  • 使用requests发送请求
  • 自己写selector
  • 根据属性值筛选指定内容
  • 一对多关系的筛选
  • 爬取分页
  • 模拟手机端访问来抓取图片
  • 总结

上一篇我们解析了本地的网页,而这一篇我们去解析真实的网络环境中的网页。
目标:用Request + Beautifulsoup库爬取Tripadvisor网站的内容。
Tripadvisor的网址:https://www.tripadvisor.cn/Attractions-g60763-Activities-New_York_City_New_York.html

使用requests发送请求

首先导入requests库和beautifulsoup库

import requests
from bs4 import BeautifulSoup

调用requests.get()方法获得指定url的response,然后利用Beautifulsoup对response进行解析来获得网页的html源码。

url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
print(soup)

Python爬虫实战--(二)解析网页中的元素_第1张图片

自己写selector

同上一节一样,接下来我们要确定我们要爬取内容的位置在哪。譬如选中网页中的一个标题“中央公园”,然后右键审查,复制它的selector。利用soup.select()方法来获取对应部分的html内容。
Python爬虫实战--(二)解析网页中的元素_第2张图片

titles = soup.select('#taplc_attraction_coverpage_attraction_0 > div:nth-of-type(1) > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a')
print(titles)
[<a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="1|poi|105127" data-tpid="162" data-tpp="CoverPage" href="/Attraction_Review-g60763-d105127-Reviews-Central_Park-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">中央公园</a>]

我们现在获取了一个标题的信息,但我们想要获取所有的标题信息。我们回到网页的html,发现每一个标题的上层标签都具有类似的格式:div class=”item name”…。
Python爬虫实战--(二)解析网页中的元素_第3张图片
Python爬虫实战--(二)解析网页中的元素_第4张图片
我们假定所有div class=”item name”…这样标签下的a标签是我们想要的标题信息,更改代码内的参数进行尝试。在实际中我们想知道某一个标签是不是包含了我们想要的全部内容,我们可以复制标签的内容然后在审查内搜索,看一下搜索到的内容是不是我们想要爬取的内容。掌握这种方法后,我们观察一个标签就可以自己写它的selector了而不用再去复制了。对于父子标签我们使用“>”符号,对于确定某一个div我们使用“div.class的属性值”来进行确定。

titles = soup.select('div.item.name > a')
print(titles)
[<a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="1|poi|105127" data-tpid="162" data-tpp="CoverPage" href="/Attraction_Review-g60763-d105127-Reviews-Central_Park-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">中央公园</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="1|poi|1687489" data-tpid="162" data-tpp="CoverPage" href="/Attraction_Review-g60763-d1687489-Reviews-The_National_9_11_Memorial_Museum-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">9/11纪念馆</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="1|poi|105125" data-tpid="162" data-tpp="CoverPage" href="/Attraction_Review-g60763-d105125-Reviews-The_Metropolitan_Museum_of_Art-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">大都会艺术博物馆</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="1|poi|587661" data-tpid="162" data-tpp="CoverPage" href="/Attraction_Review-g60763-d587661-Reviews-Top_of_the_Rock_Observation_Deck-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">峭石之巅观景台</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="2|poi|587661" data-tpid="175" data-tpp="CoverPage" href="/Attraction_Review-g60763-d587661-Reviews-Top_of_the_Rock_Observation_Deck-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">峭石之巅观景台</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="2|poi|143363" data-tpid="175" data-tpp="CoverPage" href="/Attraction_Review-g60763-d143363-Reviews-Staten_Island_Ferry-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">史泰登岛渡轮</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="2|poi|548557" data-tpid="175" data-tpp="CoverPage" href="/Attraction_Review-g60763-d548557-Reviews-Roosevelt_Island_Aerial_Tram-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">罗斯福岛棕榈泉</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="2|poi|8072300" data-tpid="175" data-tpp="CoverPage" href="/Attraction_Review-g60763-d8072300-Reviews-One_World_Observatory_World_Trade_Center-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">世贸一号观景台</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="3|poi|1687489" data-tpid="9" data-tpp="CoverPage" href="/Attraction_Review-g60763-d1687489-Reviews-The_National_9_11_Memorial_Museum-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">9/11纪念馆</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="3|poi|104370" data-tpid="9" data-tpp="CoverPage" href="/Attraction_Review-g60763-d104370-Reviews-Ellis_Island-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">埃利斯岛</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="3|poi|9868012" data-tpid="9" data-tpp="CoverPage" href="/Attraction_Review-g60763-d9868012-Reviews-World_Trade_Center_Memorial_Foundation-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">World Trade Center Memorial Foundation</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="3|poi|136072" data-tpid="9" data-tpp="CoverPage" href="/Attraction_Review-g60763-d136072-Reviews-Governors_Island_National_Monument-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">总督岛国家纪念碑</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="4|poi|272517" data-tpid="20" data-tpp="CoverPage" href="/Attraction_Review-g60763-d272517-Reviews-Conservatory_Garden-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">温室花园</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="4|poi|532140" data-tpid="20" data-tpp="CoverPage" href="/Attraction_Review-g60763-d532140-Reviews-Shakespeare_Garden-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">莎士比亚公园</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="4|poi|4550105" data-tpid="20" data-tpp="CoverPage" href="/Attraction_Review-g60763-d4550105-Reviews-Winter_Garden-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">玻璃花房</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="4|poi|3546109" data-tpid="20" data-tpp="CoverPage" href="/Attraction_Review-g60763-d3546109-Reviews-The_Jefferson_Market_Garden-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">The Jefferson Market Garden</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="5|poi|1687489" data-tpid="145" data-tpp="CoverPage" href="/Attraction_Review-g60763-d1687489-Reviews-The_National_9_11_Memorial_Museum-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">9/11纪念馆</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="5|poi|105125" data-tpid="145" data-tpp="CoverPage" href="/Attraction_Review-g60763-d105125-Reviews-The_Metropolitan_Museum_of_Art-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">大都会艺术博物馆</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="5|poi|107466" data-tpid="145" data-tpp="CoverPage" href="/Attraction_Review-g60763-d107466-Reviews-Frick_Collection-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">弗里克美术收藏馆</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="5|poi|626630" data-tpid="145" data-tpp="CoverPage" href="/Attraction_Review-g60763-d626630-Reviews-Ground_Zero_Museum_Workshop-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">世贸大厦遗址博物馆工作室</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="6|poi|110164" data-tpid="150" data-tpp="CoverPage" href="/Attraction_Review-g60763-d110164-Reviews-Radio_City_Music_Hall-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">无线电城音乐大厅</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="6|poi|136028" data-tpid="150" data-tpp="CoverPage" href="/Attraction_Review-g60763-d136028-Reviews-Lincoln_Center_for_the_Performing_Arts-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">林肯表演艺术中心</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="6|poi|447348" data-tpid="150" data-tpp="CoverPage" href="/Attraction_Review-g60763-d447348-Reviews-Jazz_at_Lincoln_Center-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">林肯中心爵士乐表演</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="6|poi|505289" data-tpid="150" data-tpp="CoverPage" href="/Attraction_Review-g60763-d505289-Reviews-Gershwin_Theater-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">盖西文剧院</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="7|poi|1687489" data-tpid="40" data-tpp="CoverPage" href="/Attraction_Review-g60763-d1687489-Reviews-The_National_9_11_Memorial_Museum-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">9/11纪念馆</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="7|poi|103887" data-tpid="40" data-tpp="CoverPage" href="/Attraction_Review-g60763-d103887-Reviews-Statue_of_Liberty-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">自由女神像</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="7|poi|10340693" data-tpid="40" data-tpp="CoverPage" href="/Attraction_Review-g60763-d10340693-Reviews-The_Oculus-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">The Oculus</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="7|poi|143372" data-tpid="40" data-tpp="CoverPage" href="/Attraction_Review-g60763-d143372-Reviews-Alice_in_Wonderland_Statue-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">爱丽丝梦游仙境雕塑</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="8|poi|103371" data-tpid="47" data-tpp="CoverPage" href="/Attraction_Review-g60763-d103371-Reviews-Grand_Central_Terminal-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">大中央车站</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="8|poi|104365" data-tpid="47" data-tpp="CoverPage" href="/Attraction_Review-g60763-d104365-Reviews-Empire_State_Building-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">帝国大厦</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="8|poi|8072300" data-tpid="47" data-tpp="CoverPage" href="/Attraction_Review-g60763-d8072300-Reviews-One_World_Observatory_World_Trade_Center-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">世贸一号观景台</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="8|poi|105123" data-tpid="47" data-tpp="CoverPage" href="/Attraction_Review-g60763-d105123-Reviews-Rockefeller_Center-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">洛克菲勒中心</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="9|poi|105127" data-tpid="72" data-tpp="CoverPage" href="/Attraction_Review-g60763-d105127-Reviews-Central_Park-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">中央公园</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="9|poi|519474" data-tpid="72" data-tpp="CoverPage" href="/Attraction_Review-g60763-d519474-Reviews-The_High_Line-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">高线公园</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="9|poi|136347" data-tpid="72" data-tpp="CoverPage" href="/Attraction_Review-g60763-d136347-Reviews-Bryant_Park-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">布莱恩公园</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="9|poi|136354" data-tpid="72" data-tpp="CoverPage" href="/Attraction_Review-g60763-d136354-Reviews-Washington_Square_Park-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">华盛顿广场公园</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="10|poi|136053" data-tpid="28" data-tpp="CoverPage" href="/Attraction_Review-g60763-d136053-Reviews-St_Patrick_s_Cathedral-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">圣帕提克大教堂</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="10|poi|105055" data-tpid="28" data-tpp="CoverPage" href="/Attraction_Review-g60763-d105055-Reviews-St_Paul_s_Chapel-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">圣保罗教堂</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="10|poi|146337" data-tpid="28" data-tpp="CoverPage" href="/Attraction_Review-g60763-d146337-Reviews-St_Thomas_Church-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">圣·托马斯教堂</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="10|poi|3339075" data-tpid="28" data-tpp="CoverPage" href="/Attraction_Review-g60763-d3339075-Reviews-Times_Square_Church-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">时代广场教会</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="11|poi|9868012" data-tpid="57" data-tpp="CoverPage" href="/Attraction_Review-g60763-d9868012-Reviews-World_Trade_Center_Memorial_Foundation-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">World Trade Center Memorial Foundation</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="11|poi|3546757" data-tpid="57" data-tpp="CoverPage" href="/Attraction_Review-g60763-d3546757-Reviews-42nd_Street-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">42</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="11|poi|615518" data-tpid="57" data-tpp="CoverPage" href="/Attraction_Review-g60763-d615518-Reviews-Park_Avenue-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">帕克街</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="11|poi|615543" data-tpid="57" data-tpp="CoverPage" href="/Attraction_Review-g60763-d615543-Reviews-Duffy_Square-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">杜菲广场</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="12|poi|136066" data-tpid="87" data-tpp="CoverPage" href="/Attraction_Review-g60763-d136066-Reviews-Manhattan_Bridge-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">曼哈顿桥</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="12|poi|110276" data-tpid="87" data-tpp="CoverPage" href="/Attraction_Review-g60763-d110276-Reviews-New_York_City_Fire_Museum-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">纽约消防博物馆</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="12|poi|4358744" data-tpid="87" data-tpp="CoverPage" href="/Attraction_Review-g60763-d4358744-Reviews-Williamsburg_Bridge-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">威廉斯堡大桥</a>, <a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="12|poi|1800096" data-tpid="87" data-tpp="CoverPage" href="/Attraction_Review-g60763-d1800096-Reviews-Queensboro_Bridge-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">皇后区大桥</a>]

可见我们爬取到了所有标题的信息。

根据属性值筛选指定内容

下面来看一下对于图片的爬取,我们发现网页中有许多大小不等的图片。我们想爬取某一大小的图片。可以通过将select参数设置为 ‘标签名[属性名=”xx”]’ 这样的形式来筛选属性为某一个值的内容。譬如我们可以将select的参数设置为’img[width=”200”]’来获得网页中所有宽度为200的图片信息。

img = soup.select('img[width="200"]')
print(img)
[<span"中央公园" class="photo_image" height="111" id="lazyload_508582835_2" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"9\/11纪念馆" class="photo_image" height="111" id="lazyload_508582835_3" src="https://static.tacdn.com/img2/x.gif" style="height: 133px; width: 200px;" width="200"/>, <span"大都会艺术博物馆" class="photo_image" height="111" id="lazyload_508582835_4" src="https://static.tacdn.com/img2/x.gif" style="height: 111px; width: 241px;" width="200"/>, <span"峭石之巅观景台" class="photo_image" height="111" id="lazyload_508582835_5" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"峭石之巅观景台" class="photo_image" height="111" id="lazyload_508582835_6" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"史泰登岛渡轮" class="photo_image" height="111" id="lazyload_508582835_7" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"罗斯福岛棕榈泉" class="photo_image" height="111" id="lazyload_508582835_8" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"世贸一号观景台" class="photo_image" height="111" id="lazyload_508582835_9" src="https://static.tacdn.com/img2/x.gif" style="height: 356px; width: 200px;" width="200"/>, <span"9\/11纪念馆" class="photo_image" height="111" id="lazyload_508582835_10" src="https://static.tacdn.com/img2/x.gif" style="height: 133px; width: 200px;" width="200"/>, <span"埃利斯岛" class="photo_image" height="111" id="lazyload_508582835_11" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"World Trade Center Memorial Foundation" class="photo_image" height="111" id="lazyload_508582835_12" src="https://static.tacdn.com/img2/x.gif" style="height: 266px; width: 200px;" width="200"/>, <span"总督岛国家纪念碑" class="photo_image" height="111" id="lazyload_508582835_13" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"温室花园" class="photo_image" height="111" id="lazyload_508582835_14" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"莎士比亚公园" class="photo_image" height="111" id="lazyload_508582835_15" src="https://static.tacdn.com/img2/x.gif" style="height: 267px; width: 200px;" width="200"/>, <span"玻璃花房" class="photo_image" height="111" id="lazyload_508582835_16" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"The Jefferson Market Garden" class="photo_image" height="111" id="lazyload_508582835_17" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"中央公园" class="photo_image" height="111" id="lazyload_508582835_18" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"高线公园" class="photo_image" height="111" id="lazyload_508582835_19" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"Seventh Avenue" class="photo_image" height="111" id="lazyload_508582835_20" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"The High Bridge" class="photo_image" height="111" id="lazyload_508582835_21" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"9\/11纪念馆" class="photo_image" height="111" id="lazyload_508582835_22" src="https://static.tacdn.com/img2/x.gif" style="height: 133px; width: 200px;" width="200"/>, <span"大都会艺术博物馆" class="photo_image" height="111" id="lazyload_508582835_23" src="https://static.tacdn.com/img2/x.gif" style="height: 111px; width: 241px;" width="200"/>, <span"弗里克美术收藏馆" class="photo_image" height="111" id="lazyload_508582835_24" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"世贸大厦遗址博物馆工作室" class="photo_image" height="111" id="lazyload_508582835_25" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"无线电城音乐大厅" class="photo_image" height="111" id="lazyload_508582835_26" src="https://static.tacdn.com/img2/x.gif" style="height: 303px; width: 200px;" width="200"/>, <span"林肯表演艺术中心" class="photo_image" height="111" id="lazyload_508582835_27" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"林肯中心爵士乐表演" class="photo_image" height="111" id="lazyload_508582835_28" src="https://static.tacdn.com/img2/x.gif" style="height: 133px; width: 200px;" width="200"/>, <span"盖西文剧院" class="photo_image" height="111" id="lazyload_508582835_29" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"9\/11纪念馆" class="photo_image" height="111" id="lazyload_508582835_30" src="https://static.tacdn.com/img2/x.gif" style="height: 133px; width: 200px;" width="200"/>, <span"自由女神像" class="photo_image" height="111" id="lazyload_508582835_31" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"The Oculus" class="photo_image" height="111" id="lazyload_508582835_32" src="https://static.tacdn.com/img2/x.gif" style="height: 266px; width: 200px;" width="200"/>, <span"爱丽丝梦游仙境雕塑" class="photo_image" height="111" id="lazyload_508582835_33" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"大中央车站" class="photo_image" height="111" id="lazyload_508582835_34" src="https://static.tacdn.com/img2/x.gif" style="height: 134px; width: 200px;" width="200"/>, <span"帝国大厦" class="photo_image" height="111" id="lazyload_508582835_35" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"世贸一号观景台" class="photo_image" height="111" id="lazyload_508582835_36" src="https://static.tacdn.com/img2/x.gif" style="height: 356px; width: 200px;" width="200"/>, <span"洛克菲勒中心" class="photo_image" height="111" id="lazyload_508582835_37" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"中央公园" class="photo_image" height="111" id="lazyload_508582835_38" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"高线公园" class="photo_image" height="111" id="lazyload_508582835_39" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"布莱恩公园" class="photo_image" height="111" id="lazyload_508582835_40" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"华盛顿广场公园" class="photo_image" height="111" id="lazyload_508582835_41" src="https://static.tacdn.com/img2/x.gif" style="height: 211px; width: 200px;" width="200"/>, <span"圣帕提克大教堂" class="photo_image" height="111" id="lazyload_508582835_42" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"圣保罗教堂" class="photo_image" height="111" id="lazyload_508582835_43" src="https://static.tacdn.com/img2/x.gif" style="height: 255px; width: 200px;" width="200"/>, <span"圣·托马斯教堂" class="photo_image" height="111" id="lazyload_508582835_44" src="https://static.tacdn.com/img2/x.gif" style="height: 267px; width: 200px;" width="200"/>, <span"时代广场教会" class="photo_image" height="111" id="lazyload_508582835_45" src="https://static.tacdn.com/img2/x.gif" style="height: 149px; width: 200px;" width="200"/>, <span"World Trade Center Memorial Foundation" class="photo_image" height="111" id="lazyload_508582835_46" src="https://static.tacdn.com/img2/x.gif" style="height: 266px; width: 200px;" width="200"/>, <span"第42街" class="photo_image" height="111" id="lazyload_508582835_47" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"帕克街" class="photo_image" height="111" id="lazyload_508582835_48" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"杜菲广场" class="photo_image" height="111" id="lazyload_508582835_49" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"曼哈顿桥" class="photo_image" height="111" id="lazyload_508582835_50" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"纽约消防博物馆" class="photo_image" height="111" id="lazyload_508582835_51" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>, <span"威廉斯堡大桥" class="photo_image" height="111" id="lazyload_508582835_52" src="https://static.tacdn.com/img2/x.gif" style="height: 333px; width: 200px;" width="200"/>, <span"皇后区大桥" class="photo_image" height="111" id="lazyload_508582835_53" src="https://static.tacdn.com/img2/x.gif" style="height: 150px; width: 200px;" width="200"/>]

一对多关系的筛选

接下来我们想要爬取下面图片这部分中的每个景点的分类部分。
Python爬虫实战--(二)解析网页中的元素_第5张图片

和上一篇博客一样,对于一对多的关系,我们爬取的selector要选择父级标签。我们发现分类标签的父类标签是div class=”p13n_reasoning_v2”,则我们通过以下代码获得所有分类信息。我们发现爬取到的内容逗号也被当作了分类名称,所以我们去除列表中的逗号就可以了。

cates = soup.select('div.p13n_reasoning_v2')
for cate in cates:
    cate_list = list(cate.stripped_strings)
    for one_cate in cate_list:
        if one_cate == ',':
            cate_list.remove(one_cate)
    print(cate_list)
['公园', '景观步行区', '景点与地标']
['历史景点', '纪念碑与雕像', '专业博物馆', '景点与地标']
['艺术博物馆', '景点与地标']
['观察台与观景塔', '瞭望台']
['景点与地标']
[]
['建筑物', '景点与地标']
['公园', '景观步行区']
['建筑物', '观察台与观景塔', '景点与地标']
['景点与地标']
['艺术博物馆']
[]
['公园']
['纪念碑与雕像', '景点与地标']
['教堂与大教堂']
[]
['建筑物', '观察台与观景塔']
['专业博物馆']
['轮渡']
['剧院', '景点与地标']
['专业博物馆']
['建筑物', '景点与地标']
[]
['区域', '景点与地标']
['竞技场与体育馆']
['圣地与宗教景点', '艺术博物馆']
[]
['专业博物馆']
[]
[]

爬取分页

Python爬虫实战--(二)解析网页中的元素_第6张图片
接下来我们看一下如何爬取到每一个分页的内容,我们点击第二页、第三页、第四页发现网址具有一定的规律。

第二页:https://www.tripadvisor.cn/Attractions-g60763-Activities-oa30-New_York_City_New_York.html#FILTERED_LIST
第三页:https://www.tripadvisor.cn/Attractions-g60763-Activities-oa60-New_York_City_New_York.html#FILTERED_LIST
第四页:https://www.tripadvisor.cn/Attractions-g60763-Activities-oa90-New_York_City_New_York.html#FILTERED_LIST

我们发现url除了“oaxx”部分不同外,其余部分都是相同的。我们猜测“oaxx”代表了分页信息,后面的数字从30开始每隔30进行增长。我们利用python的格式化来创建每一页的url列表。

urls = ['https://www.tripadvisor.cn/Attractions-g60763-Activities-oa{}-New_York_City_New_York.html#FILTERED_LIST'.format(str(i)) for i in range(30,1141,30)]
print(urls)
['https://www.tripadvisor.cn/Attractions-g60763-Activities-oa30-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa60-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa90-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa120-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa150-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa180-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa210-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa240-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa270-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa300-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa330-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa360-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa390-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa420-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa450-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa480-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa510-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa540-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa570-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa600-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa630-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa660-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa690-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa720-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa750-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa780-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa810-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa840-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa870-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa900-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa930-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa960-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa990-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa1020-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa1050-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa1080-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa1110-New_York_City_New_York.html#FILTERED_LIST', 'https://www.tripadvisor.cn/Attractions-g60763-Activities-oa1140-New_York_City_New_York.html#FILTERED_LIST']

接下来访问列表中的每一个url来解析其中的内容,首先我们将之前的爬取代码封装为一个函数。

def get_attractions(url,data=None):
    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text, 'lxml')
    titles = soup.select('div.listing_title  > a')
    cates = soup.select('div.p13n_reasoning_v2')
    for title, cate in zip(titles, cates):
        cate_list = list(cate.stripped_strings)
        for one_cate in cate_list:
            if one_cate == ',':
                cate_list.remove(one_cate)
        data = {
            'title':title.get_text(),
            'cate':cate_list
        }
        print(data)

然后依次爬取每一页的url来获得信息。有些网站会有一些反爬取机制,譬如限制访问的频率。所以我们利用time.sleep()函数来设置相邻两个访问之间的时间。

urls = ['https://www.tripadvisor.cn/Attractions-g60763-Activities-oa{}-New_York_City_New_York.html#FILTERED_LIST'.format(str(i)) for i in range(30,1141,30)]
for url in urls:
    get_attractions(url)
    time.sleep(2)
{'title': '美国自然历史博物馆', 'cate': ['自然历史博物馆']}
{'title': '第五大道', 'cate': ['景点与地标']}
{'title': '现代艺术博物馆', 'cate': ['艺术博物馆']}
...
{'title': '天空剧场', 'cate': ['观景台与天文台']}
{'title': '香味博物馆', 'cate': ['特色博物馆']}
{'title': '美国圣经会图书馆', 'cate': ['图书馆', '特色博物馆']}

模拟手机端访问来抓取图片

对于该网站利用之前提到的方式来抓取图片,得到的结果每一个图片的链接都是相同的。

url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
imgs = soup.select('div.centering_wrapper > img')
for img in imgs:
    print(img.get('src'))
https://static.tacdn.com/img2/x.gif
https://static.tacdn.com/img2/x.gif
https://static.tacdn.com/img2/x.gif
...

原因在于网站通过JS控制图片的显示,这是一种反爬取机制。对于这种问题的一种解决方法是查看移动端的网页,对移动端的网页去进行爬取。这样做的原因在于移动端的保护没有那么严密。
对于一个网页查看它移动端页面的方法:在chrome内右键审查,点击左上角的按钮(下图右侧红圈)来切换成移动端访问。然后在网页上方左侧(下图左侧红圈)选择访问的设备。然后刷新网页之后网页就会变成移动端访问的页面。
Python爬虫实战--(二)解析网页中的元素_第7张图片
然后点击Network,找到一个请求去查看它的Request Headers,将其中的User-Agent的内容复制。并在代码内创建一个headers词典,添加‘User-Agent’字段粘贴之前复制的值。我们通过伪造user-agent来模拟在手机上进行浏览
Python爬虫实战--(二)解析网页中的元素_第8张图片

headers = {
    'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1'
}
url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
wb_data = requests.get(url,headers)
soup = BeautifulSoup(wb_data.text, 'lxml')
imgs = soup.select('div.centering_wrapper > img')
for img in imgs:
    print(img.get('src'))

(该网站目前对移动端也做了处理,所以爬到的链接也是一样的 - -|||。)

总结

  • 在真实网络环境中,我们用requests去发送请求来获得网页的源码。然后利用beautifulsoup来进行解析。
  • 用select()进行定位的时候不需要再写一长串的selector了,我们只要保证描述的位置是我们想获取的内容就可以了。即我们要找到我们需要内容的唯一标识。
    第一种方法是针对div使用class的信息结合子标签名称来进行筛选:

    soup.select('div.class名称 > 子标签名称')

    第二种是针对标签的属性值来进行筛选:

    soup.select('标签名[属性名="属性值"]')
  • 对于一对多的关系selector要选择对应的父级标签的位置。
  • 在分页爬取的时候我们首先要发现每一页URL的规律,然后创建一个URL的列表,然后依次对每一个URL进行爬取。
  • 在网站较难爬取到信息的时候可以考虑模拟手机端访问,去爬取手机端网页的信息。

你可能感兴趣的:(Python爬虫)