爬取美团

目录
一、背景介绍
二、网页分析
三、开始爬虫
1、城市获取
2、列表页拼接
3、最大页数(暂只获取前10页)
4、详情页拼接及获取数据
四、数据存储
一、背景介绍
由于工作需要,要获取美团商家的机构信息,包括名称,电话,地址等,存为excel格式

二、网页分析
参考 https://blog.csdn.net/xing851483876/article/details/81842329 经验。直接用手机版美团
电脑版的美团获取手机号时需要多点一次鼠标,手机版没有这个步骤,更简单一点

城市列表url:http://i.meituan.com/index/changecity?cevent=imt%2Fhd%2FcityBottom
(只用一次获取全部城市即可)

商家列表url示例:http://i.meituan.com/s/wanzhou-教育?p=2
两个关键部分:1)城市:bobaixian;2)关键词:教育(需要Unicode编码);3)页码

详情页url示例:http://i.meituan.com/poi/179243134
根据列表页中获取到商家ID,即179243134
ct_poi:240684642564654412435083837672355283025_e8694741092540145794_v1070221787272473329__21_a%e6%95%99%e8%82%b2
原本有ctpoi,后来没了,就先不需要,需要的话在列表页再加上

思路如下:
1)根据城市列表获取全部城市的拼音缩写;
2)根据城市拼音拼接列表页,从列表页获取商家链接;
3)循环遍历商家链接,获取需要的字段;

三、开始爬虫
1、城市获取
爬取美团_第1张图片
根据这里面的工具:https://blog.csdn.net/weixin_43420032/article/details/84646041 直接转成python,连header和cookies都给弄好了
爬取美团_第2张图片
直接获取到城市和缩写的字典,方便后续可以根据输入城市和关键词进行获取:

def get_cities_wap():   #网页端获取城市及简写
    cookies = {
     
        '__mta': '209614381.1543978383220.1543978491990.1543978501965.5',
        '_lxsdk_cuid': '16666fc2e54c8-06bb633ea17d43-737356c-15f900-16666fc2e54c8',
        'oc': 'Ze9dLOWSIlgu7r7EbFMStrH7FxUq57MiiNsP2vGkntNcdKo_CV5R2rHC7W9jVd9dPbO4UY_R3GRmoZhCH62HUnibfEBt7ArKLhxtVp_F4MBIfn1mLfucCPiTqWKtLPjSb65K76r1y49Ol1tEWBAqjvuF08yuJ39OBE8LEAk1wYM',
        'uuid': '0089ef8aea0b44b28a39.1543568012.1.0.0',
        '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
        'JSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'IJSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'iuuid': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        '_lxsdk': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        'webp': '1',
        '__utmc': '74597006',
        '__utmz': '74597006.1543917113.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
        '_hc.v': 'e06c0122-49b5-981c-ec71-64d1b97be3c1.1543917118',
        'ci': '174',
        'rvct': '174%2C957%2C517%2C1',
        'cityname': '%E4%B8%83%E5%8F%B0%E6%B2%B3',
        '__utma': '74597006.1820410032.1543917113.1543917113.1543978384.2',
        'ci3': '1',
        'idau': '1',
        'i_extend': 'H__a100001__b2',
        'latlng': '39.90569,116.22299,1543978425426',
        '__utmb': '74597006.19.9.1543978504738',
        '_lxsdk_s': '1677c4847d9-0cf-fe9-ad0%7C%7C24',
    }
    
    headers = {
     
        'Connection': 'keep-alive',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Referer': 'http://i.meituan.com/index/changecity?cevent=imt%2Fhd%2FcityBottom',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
    }
    
    params = (
        ('cevent', 'imt/hd/cityBottom'),
    )
    
    response = requests.get('http://i.meituan.com/index/changecity', headers=headers, params=params, cookies=cookies)
    result = str(response.content,'utf-8')
    soup = bs(result,'html.parser')   #用beautifulsoup解析
    #s1 = soup.find_all(name='a',attrs={'class':'react'})    #获取全部城市节点
    #https://www.cnblogs.com/cymwill/articles/7574479.html
    s1 = soup.find_all(lambda tag:tag.has_attr('class') and tag.has_attr('data-citypinyin'))
    dics = {
     }   #城市和缩写的字典
    for i in s1:
        city = i.text
        jianxie = i['data-citypinyin']
        dic = {
     city:jianxie}
        dics.update(dic)
    return dics

ps:s1 = soup.find_all(lambda tag:tag.has_attr(‘class’) and tag.has_attr(‘data-citypinyin’)) 1
说实话这里面的东西并不能看太懂的!

2、列表页拼接
城市和关键词有了,现在就差如何获取页数
获取到页数之后就可以用for循环取得所有页的数据了
保(jiu)险(shi)起(lan)见(de),还利用上面的工具获取python代码块:

def getOrg(city,sw,page):
    cookies = {
     
        '__mta': '209614381.1543978383220.1543997852031.1543998361599.17',
        '_lxsdk_cuid': '16666fc2e54c8-06bb633ea17d43-737356c-15f900-16666fc2e54c8',
        'oc': 'Ze9dLOWSIlgu7r7EbFMStrH7FxUq57MiiNsP2vGkntNcdKo_CV5R2rHC7W9jVd9dPbO4UY_R3GRmoZhCH62HUnibfEBt7ArKLhxtVp_F4MBIfn1mLfucCPiTqWKtLPjSb65K76r1y49Ol1tEWBAqjvuF08yuJ39OBE8LEAk1wYM',
        'uuid': '0089ef8aea0b44b28a39.1543568012.1.0.0',
        '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
        'JSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'IJSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'iuuid': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        '_lxsdk': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        'webp': '1',
        '__utmc': '74597006',
        '__utmz': '74597006.1543917113.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
        '_hc.v': 'e06c0122-49b5-981c-ec71-64d1b97be3c1.1543917118',
        'rvct': '174%2C957%2C517%2C1',
        'ci3': '1',
        'a2h': '3',
        'idau': '1',
        '__utma': '74597006.1820410032.1543917113.1543978512.1543997779.4',
        'i_extend': 'C_b3E240684642564654412435083837672355283025_e8694741092540145794_v1070221787272473329_a%e6%95%99%e8%82%b2GimthomepagesearchH__a100005__b4',
        'ci': '1',
        'cityname': '%E5%8C%97%E4%BA%AC',
        '__utmb': '74597006.4.9.1543997782892',
        'latlng': '39.90569,116.22299,1543998363681',
        '_lxsdk_s': '1677d706961-ed1-ece-8b8%7C%7C5',
    }
    
    headers = {
     
        'Connection': 'keep-alive',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
    }
    
    params = (
        ('p', page),
    )
    
    url = 'http://i.meituan.com/s/{}-{}'.format(city,sw)
    
    response = requests.get(url, headers=headers, params=params, cookies=cookies)
    result = str(response.content,'utf-8')
    soup = bs(result,'html.parser')   #用beautifulsoup解析
    
    jigous = soup.find_all(name='dd',attrs={
     'class':'poi-list-item'})
    arrs = []   #初始化店铺链接和ctpoi列表
    for jigou in jigous:
        #jigous.find(name='span',attrs={'class':'poiname'}).text    #列表页机构名称
        href = 'http:'+jigou.findChild('a').attrs['href']
        ctpoi = jigou.findChild('a').attrs['data-ctpoi']
        arr = [href,ctpoi]
        arrs.append(arr)
    return arrs

ps:ctpoi = jigou.findChild(‘a’).attrs[‘data-ctpoi’] 2

3、最大页数(暂只获取前10页)

arrs = []
for page in range(1,11):    #第一页到11页遍历
    sleep(2)
    pages = getOrg(city,sw,page)
    arrs += pages
    print('第{}页已完成'.format(page))

4、详情页拼接及获取数据
根据第二步获取到arrs后遍历arrs,依次获取id和ctpoi
然后还是老办法,获取python代码块如下

arrs = getOrg(city,sw,page)
wb = Workbook()
ws = wb.active
title = ['店铺名称','均价','店铺地址','电话','评分','是否支持wifi','营业时间','城市','详细字典','地图接口链接']
ws.append(title)
for arr in arrs:
#    url = arr[0]+'?ct_poi='+arr[1]

#arr = ['http://i.meituan.com/poi/79745525','035125882187008877213548425054591265479_e878979162135682910_v1070250989776906519__12_a%e6%95%99%e8%82%b2']
    url = arr[0]
    ct_poi = arr[1]
    cookies = {
     
        '__mta': '216309417.1543917117811.1544064676901.1544066177851.16',
        '_lxsdk_cuid': '16666fc2e54c8-06bb633ea17d43-737356c-15f900-16666fc2e54c8',
        'oc': 'Ze9dLOWSIlgu7r7EbFMStrH7FxUq57MiiNsP2vGkntNcdKo_CV5R2rHC7W9jVd9dPbO4UY_R3GRmoZhCH62HUnibfEBt7ArKLhxtVp_F4MBIfn1mLfucCPiTqWKtLPjSb65K76r1y49Ol1tEWBAqjvuF08yuJ39OBE8LEAk1wYM',
        'uuid': '0089ef8aea0b44b28a39.1543568012.1.0.0',
        '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
        'JSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'IJSESSIONID': '1xvxbfh2qrp7we79b6k37dz4f',
        'iuuid': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        '_lxsdk': '5AE1D264FD261C60A28BFD86F1659F01AB3097A4EC861FCCEC7662BDC2EE160F',
        'webp': '1',
        '__utmc': '74597006',
        '__utmz': '74597006.1543917113.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
        '_hc.v': 'e06c0122-49b5-981c-ec71-64d1b97be3c1.1543917118',
        'rvct': '174%2C957%2C517%2C1',
        'ci3': '1',
        'a2h': '3',
        'ci': '1102',
        'cityname': '%E5%8D%9A%E7%99%BD%E5%8E%BF',
        'idau': '1',
        '__utma': '74597006.1820410032.1543917113.1544058293.1544064411.7',
        'latlng': '39.90569,116.22299,1544064411559',
        'i_extend': 'C_b3GimthomepagesearchH__a100016__b7',
        'webloc_geo': '39.906303%2C116.182617%2Cwgs84',
        '__utmb': '74597006.9.9.1544064692635',
        '_lxsdk_s': '167816923bd-53c-ae2-739%7C%7C15',
    }
    
    headers = {
     
        'Connection': 'keep-alive',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
    }
    
    params = (
        ('ct_poi', ct_poi),
    )
    
    response = requests.get(url, headers=headers, params=params, cookies=cookies)
    result = str(response.content,'utf-8')
    soup = bs(result,'html.parser')   #用beautifulsoup解析
    
    poi_shopname = soup.find('h1',attrs={
     'class':'dealcard-brand'}).text   #商家名称
    avg_price = soup.find('span',attrs={
     'class':'avg-price'}).text.split(':')[1]    #商家均价
    poi_address = soup.find('div',attrs={
     'class':'poi-address'}).text   #商家地址
    phonecall = soup.find('a',attrs={
     'class':'react poi-info-phone'}).attrs['data-tele']    #商家电话
    star_text = soup.find('em',attrs={
     'class':'star-text'}).text    #商家评分
    wifi = soup.find('dd',attrs={
     'class':'dd-padding kv-line'}).text[4:]    #商家是否支持wifi
    open_time = soup.find('dd',attrs={
     'class':'dd-padding open-time kv-line'}).text.replace('营业时间','')  #商家营业时间
    citybtn = soup.find('a',attrs={
     'class':'btn btn-weak footer-citybtn'}).text     #商家城市
    data_params = str(soup.find('div',attrs={
     'id':'poi-detail'}).attrs['data-params'])   #详细信息字典
    map_url = soup.find_all('a',attrs={
     'class':'react','rel':'nofollow'})[2].attrs['href']  #腾讯地图链接
        detail = [poi_shopname,avg_price,poi_address,phonecall,star_text,wifi,open_time,citybtn,data_params,map_url]
    ws.append(detail)
wb.save('{}-{}.xlsx'.format(city,sw))

获取了’店铺名称’,‘均价’,‘店铺地址’,‘电话’,‘评分’,‘是否支持wifi’,‘营业时间’,‘城市’,‘详细字典’,‘地图接口链接’
其中地图接口连接发现调用的是腾讯地图,且可提取出经纬度,页面中也能find到经纬度字段,暂时不需要,就没加入。

四、数据存储
用openpyxl在循环获取页面条数之前append列标题。之后将各个字段加入一个小列表。按行append写入excel。代码见上面。

你可能感兴趣的:(爬取美团)