使用BeautifulSoup抓取

年前有个坑爹的需求来了,要把某点评网商户数据都给获取下来存储于数据库,好啦其实这个东西是蛮简单的;

首先到点评网把城市数据给拷贝下来,当然你也可以写个脚本把数据抓取下来,不够我没这么干;好了下面是用于抓取数据的脚本,下面我分享下给大家:

城市列表:

使用BeautifulSoup抓取
alashan|57|阿拉善

anshan|58|鞍山

anqing|117|安庆

anhuisuzhou|121|宿州

anyang|164|安阳

aba|255|阿坝

anshun|261|安顺

ali|288|阿里

ankang|297|安康

akesudiqu|332|阿克苏地区

aletaidiqu|338|阿勒泰地区

macau|342|澳门

alaer|389|阿拉尔

australia|2318|澳大利亚其他

auckland|2384|奥克兰

orlando|2401|奥兰多

agra|2410|阿格拉

antwerp|2422|安特卫普

amsterdam|2428|阿姆斯特丹

antalya|2445|安塔丽亚

ankara|2446|安卡拉

athens|2455|雅典

edinburgh|2465|爱丁堡

alexandria|2473|亚历山大

aswan|2474|亚斯文

ethiopia|2496|埃塞俄比亚

alishan|2503|阿里山

beijing|2|北京

baoding|29|保定

baotou|47|包头

bayannaoer|56|巴彦淖尔

benxi|60|本溪

baishan|75|白山

baicheng|77|白城

bengbu|112|蚌埠

bozhou|124|亳州

binzhou|158|滨州

beihai|228|北海

baise|233|百色

bazhong|253|巴中

bijiediqu|264|毕节地区

baoshan|270|保山

baoji|291|宝鸡

baiyin|302|白银

boertala|330|博尔塔拉

bayinguoleng|331|巴音郭楞

beitun|346|北屯

baisha|390|白沙

baoting|391|保亭

bangkok|2342|曼谷

pattaya|2344|芭堤雅

pai|2349|拜县

bali|2351|巴厘岛

bandung|2352|万隆

boracay|2355|长滩岛

palawan|2357|巴拉望岛

bohol|2358|薄荷岛

busan|2370|釜山

hokkaido|2375|北海道

brisbane|2381|布里斯班

paris|2388|巴黎

boston|2403|波士顿

brussels|2420|布鲁塞尔

bruges|2421|布鲁日

berlin|2423|柏林

prague|2431|布拉格

brno|2433|布尔诺

porto|2435|波尔图

bern|2443|伯尔尼

barcelona|2449|巴塞罗纳

budapest|2457|布达佩斯

pisa|2462|比萨

pretoria|2480|比勒陀利亚

buenosaires|2485|布宜诺斯艾利斯

brunei|2492|文莱

chengdu|8|成都

chongqing|9|重庆

chengde|31|承德

cangzhou|32|沧州

changzhi|38|长治

chifeng|49|赤峰

chaoyang|68|朝阳

changchun|70|长春

changzhou|93|常州

chuzhou|119|滁州

chaohu|122|巢湖

chizhou|125|池州

changde|197|常德

chenzhou|200|郴州

chaozhou|221|潮州

chuxiongzhou|272|楚雄州

changdudiqu|284|昌都地区

changjizhou|329|昌吉州

changsha|344|长沙

changjiang|392|昌江

chengmai|393|澄迈县

chongzuo|394|崇左

cixi|421|慈溪

cangnan|911|苍南

changle|981|长乐

cambodia|2316|柬埔寨其他

chiangmai|2345|清迈

chiangrai|2348|清莱

boracay|2355|长滩岛

cebu|2356|宿雾

okinawa|2377|冲绳

canberra|2382|堪培拉

cairns|2383|凯恩斯

christchurch|2387|基督城

cannes|2391|戛纳

chicago|2400|芝加哥

cologne|2425|科隆

creteisland|2453|克里特

cambridge|2466|剑桥

cairo|2472|开罗

casablanca|2477|卡萨布兰卡

capetown|2478|开普敦

cancun|2482|坎坤

cuzco|2487|库斯科

costarica|2500|哥斯达黎加

dalian|19|大连

datong|36|大同

dandong|61|丹东

daqing|84|大庆

daxinganling|91|大兴安岭

dongying|147|东营

dezhou|156|德州

dongguan|219|东莞

deyang|241|德阳

dazhou|251|达州

dali|277|大理

dehong|278|德宏

diqing|281|迪庆

dingxi|309|定西

danzhou|358|儋州

dingan|395|定安县

dongfang|396|东方

tokyo|2372|东京

osaka|2374|大阪

dijon|2394|第戎

tahiti|2405|大溪地

delhi|2407|新德里

toronto|2413|多伦多

turin|2463|都灵

dublin|2470|都柏林

eerduosi|51|鄂尔多斯

ezhou|181|鄂州

enshizhou|188|恩施州

edinburgh|2465|爱丁堡

ethiopia|2496|埃塞俄比亚

fuzhou|14|福州

fushun|59|抚顺

fuxin|64|阜新

fuyang|120|阜阳

jiangxifuzhou|143|抚州

foshan|208|佛山

fangchenggang|229|防城港

fenghua|422|奉化

fuqing|433|福清

fuyangfy|869|富阳

fuding|1031|福鼎

philippines|2327|菲律宾其他

fiji|2328|斐济

france|2331|法国其他

busan|2370|釜山

fujisan|2376|富士山

frankfurt|2426|法兰克福

florence|2459|佛罗伦萨

fukuoka|2505|福冈

guangzhou|4|广州

ganzhou|140|赣州

guilin|226|桂林

guigang|231|贵港

guangxiyulin|232|玉林

guangyuan|243|广元

guangan|250|广安

ganzi|256|甘孜��

guiyang|258|贵阳

gannanzhou|312|甘南

guoluo|318|果洛

guyuan|324|固原

guowai|343|国外其他

kaohsiung|2337|高雄

goldcoast|2380|黄金海岸

gothenburg|2437|哥德堡

geneva|2440|日内瓦

costarica|2500|哥斯达黎加

hangzhou|3|杭州

haikou|23|海口

handan|27|邯郸

hengshui|34|衡水

huhehaote|46|呼和浩特

hulunbeier|52|呼伦贝尔

huludao|69|葫芦岛

haerbin|79|哈尔滨

hegang|82|鹤岗

heihe|89|黑河

huaian|96|淮安

huzhou|103|湖州

hefei|110|合肥

huainan|113|淮南

huaibei|115|淮北

huangshan|118|黄山

heze|159|菏泽

hebi|165|鹤壁

huangshi|177|黄石

huanggang|185|黄冈

hengyang|194|衡阳

huaihua|202|怀化

huizhou|213|惠州

heyuan|216|河源

hezhou|234|贺州

hechi|235|河池

honghe|273|红河

hanzhong|295|汉中

haidong|314|海东

haibei|315|海北

huangnan|316|黄南

haixi|320|海西

hamidiqu|328|哈密地区

hetiandiqu|335|和田地区

hongkong|341|香港

hainanzhou|411|海南州

korea|2314|韩国其他

hualien|2336|花莲

hochiminh|2366|胡志明市

hanoi|2367|河内

haiphong|2368|海防市

hokkaido|2375|北海道

hakone|2378|箱根

goldcoast|2380|黄金海岸

queenstown|2385|皇后镇

wellington|2386|惠灵顿

hawaii|2404|夏威夷

hamburg|2424|汉堡

thehague|2430|海牙

jinan|22|济南

jincheng|39|晋城

jinzhong|41|晋中

jinzhou|62|锦州

jilin|71|吉林

jixi|81|鸡西

jiamusi|86|佳木斯

jiaxing|102|嘉兴

jinhua|105|金华

jingdezhen|135|景德镇

jiujiang|137|九江

jian|141|吉安

jiangxiyichun|142|宜春

jiangxifuzhou|143|抚州

jining|150|济宁

jiaozuo|167|焦作

jinmen|182|荆门

jingzhou|184|荆州

jiangmen|209|江门

jieyang|222|揭阳

jiayuguan|300|嘉峪关

jinchang|301|金昌

jiuquan|307|酒泉

jiyuan|397|济源

jingjian|853|靖江

jinjiang|1009|晋江

japan|2315|日本其他

cambodia|2316|柬埔寨其他

jakarta|2353|雅加达

kualalumpur|2359|吉隆坡

phnompenh|2364|金边

jeju|2371|济州岛

kyoto|2373|京都

christchurch|2387|基督城

sanfrancisco|2396|旧金山

jaipur|2411|斋浦尔

cambridge|2466|剑桥

killarney|2471|基拉尼

johannesburg|2479|约翰内斯堡

jordan|2495|约旦

jamaica|2501|牙买加

kaifeng|161|开封

kunming|267|昆明

kelamayi|326|克拉玛依

kezilesu|333|克孜勒苏

kashidiqu|334|喀什地区

kunshan|416|昆山

korea|2314|韩国其他

kaohsiung|2337|高雄

kohphiphi|2347|皮皮岛

kohsamet|2350|沙美岛

kualalumpur|2359|吉隆坡

kyoto|2373|京都

canberra|2382|堪培拉

cairns|2383|凯恩斯

kenting|2406|垦丁

cologne|2425|科隆

karlovyvary|2432|卡罗维瓦立

creteisland|2453|克里特

killarney|2471|基拉尼

cairo|2472|开罗

casablanca|2477|卡萨布兰卡

capetown|2478|开普敦

cancun|2482|坎坤

cuzco|2487|库斯科

kenya|2497|肯尼亚

langfang|33|廊坊

linfen|44|临汾

lvliang|45|吕梁

liaoyang|65|辽阳

liaoyuan|73|辽源

lianyuangang|95|连云港

lishui|109|丽水

liuan|123|六安

longyan|132|龙岩

laiwu|154|莱芜

linyi|155|临沂

liaocheng|157|聊城

luoyang|162|洛阳

luohe|170|漯河

loudi|203|娄底

liuzhou|225|柳州

luzhou|240|泸州

leshan|246|乐山

liangshan|257|凉山

liupanshui|259|六盘水

lijiang|279|丽江

linchang|282|临沧

lasa|283|拉萨

linzhi|289|林芝地区

lanzhou|299|兰州

longnan|310|陇南

linxiazhou|311|临夏州

laibin|398|来宾

ledong|399|乐东

lingao|400|临高县

lingshui|401|陵水

liyang|867|溧阳

linan|868|临安

yueqing|905|乐清

liuhai|1015|龙海

liuyang|1376|浏阳

langkawi|2361|兰卡威

lyon|2393|里昂

losangeles|2397|洛杉矶

lasvegas|2398|拉斯维加斯

rotterdam|2429|鹿特丹

lisbon|2434|里斯本

luzern|2442|卢塞恩

rome|2458|罗马

london|2464|伦敦

liverpool|2469|利物浦

luxor|2475|卢克索

riodejaneiro|2483|里约热内卢

lima|2488|利马

laos|2490|老挝

lebanon|2491|黎巴嫩

mudanjiang|88|牡丹江

maanshan|114|马鞍山

maoming|211|茂名

meizhou|214|梅州

mianyang|242|绵阳

meishan|248|眉山

macau|342|澳门

malaysia|2312|马来西亚其他

melbourne|2322|墨尔本

maldives|2324|马尔代夫

mauritius|2329|毛里求斯

unitedstates|2332|美国其他

bangkok|2342|曼谷

manila|2354|马尼拉

marseille|2392|马赛

miami|2402|迈阿密

mumbai|2409|孟买

montreal|2412|蒙特娄

munich|2427|慕尼黑

malmo|2438|马尔默

pamukkale|2448|棉花堡

madrid|2450|马德里

majorca|2452|马略卡岛

mykonos|2456|米科诺斯

manchester|2467|曼彻斯特

marrakech|2476|马拉喀什

mexicocity|2481|墨西哥城

machupicchu|2489|马丘比丘

myanmar|2493|缅甸

madagascar|2498|马达加斯加

nanjing|5|南京

ningbo|11|宁波

nantong|94|南通

nanping|131|南平

ningde|133|宁德

nanchang|134|南昌

nanyang|172|南阳

nanning|224|南宁

neijiang|245|内江

nanchong|247|南充

nujiang|280|怒江

naqu|287|那曲

ninghai|2308|宁海

newzealand|2319|新西兰其他

nepal|2333|尼泊尔

newtaipei|2340|新北

nice|2390|尼斯

newyork|2395|纽约

niagarafalls|2414|尼亚加拉瀑布

naples|2461|那不勒斯

oxford|2468|牛津

riyuetan|2504|南投

osaka|2374|大阪

okinawa|2377|冲绳

orlando|2401|奥兰多

ottawa|2416|渥太华

oxford|2468|牛津

panjin|66|盘锦

putian|127|莆田

pingxiang|136|萍乡

pingdingshan|163|平顶山

puyang|168|濮阳

panzhihua|239|攀枝花

puer|275|普洱

pingliang|306|平凉

pingyang|908|平阳

philippines|2327|菲律宾其他

phuket|2343|普吉岛

pattaya|2344|芭堤雅

kohphiphi|2347|皮皮岛

pai|2349|拜县

palawan|2357|巴拉望岛

penang|2360|槟城

phnompenh|2364|金边

paris|2388|巴黎

provence|2389|普罗旺斯

prague|2431|布拉格

porto|2435|波尔图

pamukkale|2448|棉花堡

pisa|2462|比萨

pretoria|2480|比勒陀利亚

palau|2502|帕劳

qingdao|21|青岛

qinghuangdao|26|秦皇岛

qiqihaer|80|齐齐哈尔

qitaihe|87|七台河

quzhou|106|衢州

quanzhou|129|泉州

qianjiang|190|潜江

qingyuan|218|清远

qinzhou|230|钦州

qianxinan|263|黔西南

qiandongnan|265|黔东南

qiannan|266|黔南

qujing|268|曲靖

qingyang|308|庆阳

qionghai|402|琼海

qiongzhong|403|琼中

chiangmai|2345|清迈

chiangrai|2348|清莱

queenstown|2385|皇后镇

rizhao|153|日照

rikazediqu|286|日喀则地区

ruian|904|瑞安

rongcheng|1161|荣成

japan|2315|日本其他

rotterdam|2429|鹿特丹

geneva|2440|日内瓦

rome|2458|罗马

riodejaneiro|2483|里约热内卢

riyuetan|2504|南投

shanghai|1|上海

suzhou|6|苏州

shenzhen|7|深圳

shenyang|18|沈阳

shijiazhuang|24|石家庄

shuozhou|40|朔州

siping|72|四平

songyuan|76|松原

shuangyashan|83|双鸭山

suihua|90|绥化

suqian|100|宿迁

shaoxing|104|绍兴

anhuisuzhou|121|宿州

sanming|128|三明

shangrao|144|上饶

sanmenxia|171|三门峡

shangqiu|173|商丘

shiyan|178|十堰

suizhou|187|随州

shaoyang|195|邵阳

shaoguan|205|韶关

shantou|207|汕头

shanwei|215|汕尾

suining|244|遂宁

shannan|285|山南

shangluo|298|商洛

shizuishan|322|石嘴山

shihezi|339|石河子

sanya|345|三亚

shennongjia|404|神农架林区

shishi|1008|石狮

sansha|2310|三沙

singapore|2311|新加坡

saipan|2326|塞班岛

srilanka|2330|斯里兰卡

seychelles|2334|塞舌尔

samui|2346|苏梅岛

kohsamet|2350|沙美岛

cebu|2356|宿雾

sabah|2362|沙巴

siemreap|2363|暹粒

sihanoukville|2365|西哈努克

seoul|2369|首尔

sydney|2379|悉尼

sanfrancisco|2396|旧金山

seattle|2399|西雅图

salzburg|2418|萨尔兹堡

stockholm|2436|斯德哥尔摩

zurich|2439|苏黎世

seville|2451|塞维利亚

santorini|2454|圣托里尼

saopaulo|2484|圣保罗

santiago|2486|圣地亚哥

tianjin|10|天津

tangshan|25|唐山

taiyuan|35|太原

tongliao|50|通辽

tieling|67|铁岭

tonghua|74|通化

taizhou|99|泰州

zhejiangtaizhou|108|台州

tongling|116|铜陵

taian|151|泰安

tianmen|191|天门

tongrendiqu|262|铜仁地区

tongchuan|290|铜川

tianshui|303|天水

tulufandiqu|327|吐鲁番地区

tachengdiqu|337|塔城地区

taiwan|340|台湾其他

tumushuke|405|图木舒克

tunchang|406|屯昌县

thailand|2313|泰国其他

taipei|2335|台北

tainan|2338|台南

taoyuan|2339|桃园

taichung|2341|台中

tokyo|2372|东京

tahiti|2405|大溪地

toronto|2413|多伦多

thehague|2430|海牙

turin|2463|都灵

tanzania|2499|坦桑尼亚

vietnam|2317|越南其他

varanasi|2408|瓦拉纳西

vancouver|2415|温哥华

vienna|2417|维也纳

venice|2460|威尼斯

wuxi|13|无锡

wuhan|16|武汉

wuhai|48|乌海

wulanchabu|55|乌兰察布

wenzhou|101|温州

wuhu|111|芜湖

weifang|149|潍坊

weihai|152|威海

wuzhou|227|梧州

wenshan|274|文山州

weinan|293|渭南

wuwei|304|武威

wuzhong|323|吴忠

wulumuqi|325|乌鲁木齐

wanning|407|万宁

wenchang|408|文昌

wujiaqu|409|五家渠

wuzhishan|410|五指山

wendeng|1163|文登

bandung|2352|万隆

wellington|2386|惠灵顿

varanasi|2408|瓦拉纳西

vancouver|2415|温哥华

vienna|2417|维也纳

venice|2460|威尼斯

brunei|2492|文莱

xiamen|15|厦门

xian|17|西安

xingtai|28|邢台

xinzhou|43|忻州

xingan|53|兴安盟

xilinguole|54|锡林郭勒

xuzhou|92|徐州

xuancheng|126|宣城

xinyu|138|新余

xinxiang|166|新乡

xuchang|169|许昌

xinyang|174|信阳

xiangyang|180|襄阳

xiaogan|183|孝感

xianning|186|咸宁

xiantao|189|仙桃

xiangtan|193|湘潭

xiangxi|204|湘西

xishuangbanna|276|西双版纳

xianyang|292|咸阳

xining|313|西宁

hongkong|341|香港

singapore|2311|新加坡

newzealand|2319|新西兰其他

newtaipei|2340|新北

sihanoukville|2365|西哈努克

hakone|2378|箱根

sydney|2379|悉尼

seattle|2399|西雅图

hawaii|2404|夏威夷

delhi|2407|新德里

yangzhou|12|扬州

yangquan|37|阳泉

yuncheng|42|运城

yingkou|63|营口

yanbian|78|延边

yichun|85|伊春

yancheng|97|盐城

yingtan|139|鹰潭

jiangxiyichun|142|宜春

yantai|148|烟台

yichang|179|宜昌

yueyang|196|岳阳

yiyang|199|益阳

yongzhou|201|永州

yangjiang|217|阳江

yunfu|223|云浮

guangxiyulin|232|玉林

yibin|249|宜宾

yaan|252|雅安

yuxi|269|玉溪

yanan|294|延安

yulin|296|榆林

yushu|319|玉树

yinchuan|321|银川

yili|336|伊犁

yiwu|385|义乌

yuyao|423|余姚

yongkang|893|永康

yueqing|905|乐清

vietnam|2317|越南其他

indonesia|2325|印度尼西亚其他

jakarta|2353|雅加达

innsbruck|2419|因斯布鲁克

interlaken|2441|因特拉肯

istanbul|2444|伊斯坦布尔

izmir|2447|伊兹密尔

athens|2455|雅典

alexandria|2473|亚历山大

aswan|2474|亚斯文

johannesburg|2479|约翰内斯堡

israel|2494|以色列

jordan|2495|约旦

jamaica|2501|牙买加

chongqing|9|重庆

zhangjiakou|30|张家口

zhengjiang|98|镇江

zhoushan|107|舟山

zhejiangtaizhou|108|台州

zhangzhou|130|漳州

zibo|145|淄博

zaozhuang|146|枣庄

zhengzhou|160|郑州

zhoukou|175|周口

zhumadian|176|驻马店

zhuzhou|192|株洲

zhangjiajie|198|张家界

zhuhai|206|珠海

zhanjiang|210|湛江

zhaoqing|212|肇庆

zhongshan|220|中山

zigong|238|自贡

ziyang|254|资阳

zunyi|260|遵义

zhaotong|271|昭通

zhangye|305|张掖

zhongwei|351|中卫

zhuji|883|诸暨

zhangqiu|1118|章丘

chicago|2400|芝加哥

jaipur|2411|斋浦尔

zurich|2439|苏黎世
View Code

抓取列表页面数据:

使用BeautifulSoup抓取
# -*- coding: utf-8 -*- 

import codecs

import traceback

import urllib2

import re

from bs4 import BeautifulSoup

import sys

import MySQLdb

import string

import json

import time



URL_LIST = "http://www.dianping.com/search/category/%s/%s/p%s"  # 列表

RUL_DETAIL = 'http://www.dianping.com/shop/%s'  # 详情



f1 = open("f1.log", "a", 1)

f2 = open("f2.log", "a", 1)



reload(sys)

sys.setdefaultencoding('utf-8')

type = sys.getfilesystemencoding()



def deal(city_id, category_id, p):

    url = URL_LIST % (city_id, category_id, p)

    print url

    opener = urllib2.build_opener()

    opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'), ('Accept', 'application/json, text/javascript'), ('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')]

    urlopen = opener.open(url, timeout=100)

    rsp = urlopen.read()

    

    if "404" in rsp:

        return 404 

    print "=====================start=========================="

    soup = BeautifulSoup(rsp)

    soup = soup.find("div", { "id" : "shop-all-list" })

    # print soup

    row = soup.find_all("li")

    for so in row:

        # print so

        get_business(so)

    print ''



INSERT_BUSINESS = "INSERT INTO tb_dianping_business_zx (businessID,NAME,Url,BranchName,Address,Regions,Categories,City,AvgRating,AvgPrice,ReviewCount,PhotoUrl,SPhotoUrl,HasCoupon,HasDeal,DealCount,Deals) VALUES (%s,'%s','%s','%s','%s','%s','%s','%s',%s,%s,%s,'%s','%s',%s,%s,%s,'%s');"

db_interest = MySQLdb.connect(host="ip", port=3306, user="xxx", passwd="xxx", db="db_xxx", charset="utf8");

cur_interest = db_interest.cursor();



def save(business_id, Name, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals):

    sql = INSERT_BUSINESS % (business_id, Name, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)

    print "========================sql=========================="

    print sql

    try:

        cur_interest.execute(sql);

        db_interest.commit()

    except MySQLdb.IntegrityError:

        db_interest.rollback()

        print "*********************** duplicate business_id: %s" % sql

        

    print ';'



def get_business(soup):

#     print soup

    

    business_id = get_business_id(soup)

    NAME = get_business_name(soup)

    Url = RUL_DETAIL % business_id

    BranchName = ''

    if "(" in NAME:

        BranchName = NAME[NAME.find("(") + 1:NAME.find(")")]

    Address = get_Address(soup)

    Regions = get_Regions(soup)

    Categories = get_Categories(soup)

    City = '北京'

    AvgRating = get_AvgRating(soup)

    AvgPrice = get_AvgPrice(soup)

    ReviewCount = get_ReviewCount(soup)

    PhotoUrl = get_PhotoUrl(soup)

    SPhotoUrl = PhotoUrl;

    DealCount = get_DealCount(soup)

    HasCoupon = DealCount > 0 and 1 or 0

    HasDeal = HasCoupon

    Deals = get_Deals(soup)

    

    print business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, DealCount, Deals

    save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)



def get_business_id(soup):

    return soup.find("div", {"class":"tit"}).find("a")["href"].strip().replace("/shop/", "")

def get_business_name(soup):

    return soup.find("div", {"class":"tit"}).find("a")["title"].strip()

def get_Address(soup):

    if soup.find("span", {"class":"addr"}):

        return soup.find("span", {"class":"addr"}).get_text().strip()

    else:

        return ""

def get_Regions(soup):

    if soup.find("div", {"class":"tag-addr"}):

        return soup.find("div", {"class":"tag-addr"}).find_all("a")[0].find("span", {"class":"tag"}).get_text().strip()

    else:

        return ""

def get_Categories(soup):

    if soup.find("div", {"class":"tag-addr"}):

        return soup.find("div", {"class":"tag-addr"}).find_all("a")[1].find("span", {"class":"tag"}).get_text().strip()

    else:

        return ""

def get_AvgRating(soup):

    return soup.find("span", {"class":"sml-rank-stars"})["class"][1].strip().replace("sml-str", "")

def get_AvgPrice(soup):

    b = soup.find("a", {"class":"mean-price"}).find("b")

    if b:

        return b.get_text().strip().replace("", "")

    return 0



def get_ReviewCount(soup):

    b = soup.find("a", {"class":"review-num"})

    if b:

        return soup.find("a", {"class":"review-num"}).find("b").get_text().strip()

    return 0

        

def get_PhotoUrl(soup):

    return soup.find("div", {"class":"pic"}).find("img")["data-src"].strip()

def get_DealCount(soup):

    soup = soup.find("div", {"class":"si-deal"})

    if soup :

        return len(soup.find_all("a", {"class":"J_dinfo"}))  # .count("a", {"class":"J_dinfo"})

    return 0

def get_Deals(soup):

    soup = soup.find("div", {"class":"si-deal"})

    if soup :

        data_deal_id = ''

        rows = soup.find_all("a", {"class":"J_dinfo"})

        for so in rows:

            data_deal_id = '%s,%s' % (data_deal_id, so["data-deal-id"])

        return data_deal_id

    return ''



if __name__ == "__main__":

    cities = []

    cas = [10, 20, 25, 30, 45, 50, 60, 70]

    cas = [30, 45, 50, 60, 70]

    ct = codecs.open("cities", 'r', 'utf-8')

    lines = ct.readlines()

    for word in lines:

        word = word[word.find("|") + 1:]

        word = word[0:word.find("|")]

        cities.append(word.strip())

    

    for city in cities:

        for ca in cas:

            p = 0

            while p <= 50:

                try:

                    print 'deal(%s,%s,%s)' % (city, ca, p)

                    p = p + 1

                    code = deal(city, ca, p)

#                     if 404==code:

#                         break

#                      2 25 12

                except Exception:

                    traceback.print_exc()

#                     print "*********************** duplicate business_id: %s" % sql

                print "休眠5秒 ... "

                time.sleep(1)

            

#     f = codecs.open("li", 'r', 'utf-8')

#         soup = BeautifulSoup(f.read())

#     soup = BeautifulSoup(f.read())

#     get_business(soup)

    

    # save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)
View Code

 抓取详情数据:

使用BeautifulSoup抓取
# -*- coding: utf-8 -*- 

import codecs

import traceback

import urllib2

import re

from bs4 import BeautifulSoup

import sys

import MySQLdb

import string

import json

import time

from tokenize import Double



URL_LIST = "http://www.dianping.com/search/category/%s/%s/p%s"  # 列表

RUL_DETAIL = 'http://www.dianping.com/shop/%s'  # 详情



f1 = open("f1.log", "a", 1)

f2 = open("f2.log", "a", 1)



reload(sys)

sys.setdefaultencoding('utf-8')

type = sys.getfilesystemencoding()



def deal(businessID, url):

    print url

    opener = urllib2.build_opener()

    opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'), ('Accept', 'application/json, text/javascript'), ('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')]

    urlopen = opener.open(url, timeout=30)

    rsp = urlopen.read()

    print "=====================start=========================="

    soup = BeautifulSoup(rsp)

    str = soup.find("div", {"class":"breadcrumb"})

    str2 = soup.find("div", {"id":"basic-info"})

    str3 = soup.find("div", {"id":"sales"})

    str4 = soup.find("div", {"id":"aside"})

    if str4:

        str4 = soup.find("div", {"id":"aside"}).find("script")

    else:

        str4=""

    print "----------------"

    print '%s%s%s%s' % (str, str2, str3, str4)

    soup = BeautifulSoup('%s%s%s%s' % (str, str2, str3, str4))

#     print soup

    get_business(businessID, soup)



UPDATE_BUSINESS = "UPDATE tb_dianping_business_zx SET Address='%s',Regions='%s',Categories='%s',City='%s',lat=%s,lng=%s,Deals='%s' where businessID= %s "

SELECT_BUSINESS = "SELECT businessID,url FROM tb_dianping_business_zx WHERE lat=0 and businessID > %d order by businessID asc LIMIT 100 "

db_interest = MySQLdb.connect(host="xxx", port=xxx, user="xxx", passwd="xxx", db="db_xxx", charset="utf8");

cur_interest = db_interest.cursor();



def save(Address, Regions, Categories, City, lat, lng, Deals, business_id):

    sql = UPDATE_BUSINESS % (Address, Regions, Categories, City, lat, lng, Deals, business_id)

    try:

        print sql

        cur_interest.execute(sql);

        db_interest.commit()

    except MySQLdb.IntegrityError:

        db_interest.rollback()

        print "*********************** duplicate business_id: %s" % sql

    print ';'



def fetchall(cur, sql):

    cur.execute(sql)

    return cur.fetchall()



def fetchone(cur, sql):

    cur.execute(sql)

    return cur.fetchone()



def get_business(business_id, soup):

    

    business_id = business_id

    cs = get_Regions_Categories(soup)

    City = cs[0]

    Regions = cs[1]

    Categories = cs[2]

    

#     print '%s , %s , %s' % (City, Regions, Categories)

    Address = get_Address(soup)

    point = get_point(soup)

    lat = point[0]

    lng = point[1]

    Deals = get_Deals(soup)

#     print Address, Regions, Categories, City, lat, lng, Deals, business_id

    save(Address, Regions, Categories, City, lat, lng, Deals, business_id)



def get_Regions_Categories(soup):

    rows = soup.find("div", {"class":"breadcrumb"}).find_all("a")

    City = ''

    RegionsCs = []

    CategoriesCs = []

    i = 0

    length = len(rows)

    for row in rows :

        if i == 0:

            City = row.get_text().strip()

        elif length % 2 == 0 and i < length / 2:

            RegionsCs.append(row.get_text().strip())

        elif length % 2 == 0 and i >= length / 2:

            CategoriesCs.append(row.get_text().strip())

        elif length % 2 == 1 and i < length / 2 + 1:

            RegionsCs.append(row.get_text().strip())

        else:

            CategoriesCs.append(row.get_text().strip())

        i = i + 1

    

    Regions = ""

    for c in RegionsCs:

        Regions = '%s,"%s"' % (Regions , c)

    Regions = '[%s]' % Regions

    Regions = Regions.replace("[,", "[")



    Categories = ""

    for c in CategoriesCs:

        Categories = '%s,"%s"' % (Categories , c)

    Categories = '[%s]' % Categories

    Categories = Categories.replace("[,", "[")

    

    return City, Regions, Categories

            

        

def get_Address(soup):

    return '%s %s' % (soup.find("div", {"class":"address"}).find("a").find("span").get_text().strip(), soup.find("div", {"itemprop":"street-address"}).find("span", {"class":"item"}).get_text().strip())

def get_point(soup):

    lat = ''

    lng = ''

    str = soup.find("script").get_text().strip()

    str = str[str.find("({lng:") + 6:]

    lat = str[:str.find(",lat:")]

    lng = str[str.find(",lat:") + 5:str.find("}")]

    

    la = int(float(lat)*1000000)

    ln = int(float(lng)*1000000)

    return la, ln

    

def get_Deals(soup):

    soup = soup.find("div", {"id":"sales"})

    if soup:

        Deals = []

        rows = soup.find_all("div", {"class":"item"})

        for row in rows:

            if row.find("span", {"class":"price"}):

                deal = {}

                title = row.find("p", {"class":"title"})

                url = ""

                if title:

                    deal["name"] = title.get_text().strip()

                    url = row.find("a", {"class":"block-link"})["href"]

                else:

                    deal["name"] = rows.get_text().strip()

                    url = row["href"]

                deal["url"] = url

                deal["id"] = url.replace("http://t.dianping.com/deal/", "")

                deal["h5_url"] = url

                Deals.append(deal)

        

        deals = ""

        for c in Deals:

            deals = '%s,{"url":"%s", "name": "%s", "h5_url": "%s", "id": "%s"}' % (deals , c.get("url"), c.get("name"), c.get("h5_url"), c.get("id"))

        deals = '[%s]' % deals

        deals = deals.replace("[,", "[")

        return deals

    return ''



if __name__ == "__main__":

#     deal('http://www.dianping.com/shop/11566327')

    maxId = 0

    SELECT_BUSINESS_NEXT = "";

    while True:

        try:

            SELECT_BUSINESS_NEXT = SELECT_BUSINESS % maxId

            print SELECT_BUSINESS_NEXT

            rows = fetchall(cur_interest, SELECT_BUSINESS_NEXT)

            for row in rows:

                print row

                deal(row[0], row[1])

                maxId = row[0]

        except Exception:

            traceback.print_exc()

        print "休眠5秒 ... "

        time.sleep(5)



#     f = codecs.open("detail", 'r', 'utf-8')

#     soup = BeautifulSoup(f.read())

#     get_business(soup, 11566327)

    

    # save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)
View Code

备注:抓取数据速度尽量去控制下来,好拉,今天都16号了,哥可以放大假了,大伙加完班,也好好回家过个好年

使用的解析库:http://www.crummy.com/software/BeautifulSoup/bs4/doc/

你可能感兴趣的:(UP)