年前有个坑爹的需求来了,要把某点评网商户数据都给获取下来存储于数据库,好啦其实这个东西是蛮简单的;
首先到点评网把城市数据给拷贝下来,当然你也可以写个脚本把数据抓取下来,不够我没这么干;好了下面是用于抓取数据的脚本,下面我分享下给大家:
城市列表:
alashan|57|阿拉善 anshan|58|鞍山 anqing|117|安庆 anhuisuzhou|121|宿州 anyang|164|安阳 aba|255|阿坝 anshun|261|安顺 ali|288|阿里 ankang|297|安康 akesudiqu|332|阿克苏地区 aletaidiqu|338|阿勒泰地区 macau|342|澳门 alaer|389|阿拉尔 australia|2318|澳大利亚其他 auckland|2384|奥克兰 orlando|2401|奥兰多 agra|2410|阿格拉 antwerp|2422|安特卫普 amsterdam|2428|阿姆斯特丹 antalya|2445|安塔丽亚 ankara|2446|安卡拉 athens|2455|雅典 edinburgh|2465|爱丁堡 alexandria|2473|亚历山大 aswan|2474|亚斯文 ethiopia|2496|埃塞俄比亚 alishan|2503|阿里山 beijing|2|北京 baoding|29|保定 baotou|47|包头 bayannaoer|56|巴彦淖尔 benxi|60|本溪 baishan|75|白山 baicheng|77|白城 bengbu|112|蚌埠 bozhou|124|亳州 binzhou|158|滨州 beihai|228|北海 baise|233|百色 bazhong|253|巴中 bijiediqu|264|毕节地区 baoshan|270|保山 baoji|291|宝鸡 baiyin|302|白银 boertala|330|博尔塔拉 bayinguoleng|331|巴音郭楞 beitun|346|北屯 baisha|390|白沙 baoting|391|保亭 bangkok|2342|曼谷 pattaya|2344|芭堤雅 pai|2349|拜县 bali|2351|巴厘岛 bandung|2352|万隆 boracay|2355|长滩岛 palawan|2357|巴拉望岛 bohol|2358|薄荷岛 busan|2370|釜山 hokkaido|2375|北海道 brisbane|2381|布里斯班 paris|2388|巴黎 boston|2403|波士顿 brussels|2420|布鲁塞尔 bruges|2421|布鲁日 berlin|2423|柏林 prague|2431|布拉格 brno|2433|布尔诺 porto|2435|波尔图 bern|2443|伯尔尼 barcelona|2449|巴塞罗纳 budapest|2457|布达佩斯 pisa|2462|比萨 pretoria|2480|比勒陀利亚 buenosaires|2485|布宜诺斯艾利斯 brunei|2492|文莱 chengdu|8|成都 chongqing|9|重庆 chengde|31|承德 cangzhou|32|沧州 changzhi|38|长治 chifeng|49|赤峰 chaoyang|68|朝阳 changchun|70|长春 changzhou|93|常州 chuzhou|119|滁州 chaohu|122|巢湖 chizhou|125|池州 changde|197|常德 chenzhou|200|郴州 chaozhou|221|潮州 chuxiongzhou|272|楚雄州 changdudiqu|284|昌都地区 changjizhou|329|昌吉州 changsha|344|长沙 changjiang|392|昌江 chengmai|393|澄迈县 chongzuo|394|崇左 cixi|421|慈溪 cangnan|911|苍南 changle|981|长乐 cambodia|2316|柬埔寨其他 chiangmai|2345|清迈 chiangrai|2348|清莱 boracay|2355|长滩岛 cebu|2356|宿雾 okinawa|2377|冲绳 canberra|2382|堪培拉 cairns|2383|凯恩斯 christchurch|2387|基督城 cannes|2391|戛纳 chicago|2400|芝加哥 cologne|2425|科隆 creteisland|2453|克里特 cambridge|2466|剑桥 cairo|2472|开罗 casablanca|2477|卡萨布兰卡 capetown|2478|开普敦 cancun|2482|坎坤 cuzco|2487|库斯科 costarica|2500|哥斯达黎加 dalian|19|大连 datong|36|大同 dandong|61|丹东 daqing|84|大庆 daxinganling|91|大兴安岭 dongying|147|东营 dezhou|156|德州 dongguan|219|东莞 deyang|241|德阳 dazhou|251|达州 dali|277|大理 dehong|278|德宏 diqing|281|迪庆 dingxi|309|定西 danzhou|358|儋州 dingan|395|定安县 dongfang|396|东方 tokyo|2372|东京 osaka|2374|大阪 dijon|2394|第戎 tahiti|2405|大溪地 delhi|2407|新德里 toronto|2413|多伦多 turin|2463|都灵 dublin|2470|都柏林 eerduosi|51|鄂尔多斯 ezhou|181|鄂州 enshizhou|188|恩施州 edinburgh|2465|爱丁堡 ethiopia|2496|埃塞俄比亚 fuzhou|14|福州 fushun|59|抚顺 fuxin|64|阜新 fuyang|120|阜阳 jiangxifuzhou|143|抚州 foshan|208|佛山 fangchenggang|229|防城港 fenghua|422|奉化 fuqing|433|福清 fuyangfy|869|富阳 fuding|1031|福鼎 philippines|2327|菲律宾其他 fiji|2328|斐济 france|2331|法国其他 busan|2370|釜山 fujisan|2376|富士山 frankfurt|2426|法兰克福 florence|2459|佛罗伦萨 fukuoka|2505|福冈 guangzhou|4|广州 ganzhou|140|赣州 guilin|226|桂林 guigang|231|贵港 guangxiyulin|232|玉林 guangyuan|243|广元 guangan|250|广安 ganzi|256|甘孜�� guiyang|258|贵阳 gannanzhou|312|甘南 guoluo|318|果洛 guyuan|324|固原 guowai|343|国外其他 kaohsiung|2337|高雄 goldcoast|2380|黄金海岸 gothenburg|2437|哥德堡 geneva|2440|日内瓦 costarica|2500|哥斯达黎加 hangzhou|3|杭州 haikou|23|海口 handan|27|邯郸 hengshui|34|衡水 huhehaote|46|呼和浩特 hulunbeier|52|呼伦贝尔 huludao|69|葫芦岛 haerbin|79|哈尔滨 hegang|82|鹤岗 heihe|89|黑河 huaian|96|淮安 huzhou|103|湖州 hefei|110|合肥 huainan|113|淮南 huaibei|115|淮北 huangshan|118|黄山 heze|159|菏泽 hebi|165|鹤壁 huangshi|177|黄石 huanggang|185|黄冈 hengyang|194|衡阳 huaihua|202|怀化 huizhou|213|惠州 heyuan|216|河源 hezhou|234|贺州 hechi|235|河池 honghe|273|红河 hanzhong|295|汉中 haidong|314|海东 haibei|315|海北 huangnan|316|黄南 haixi|320|海西 hamidiqu|328|哈密地区 hetiandiqu|335|和田地区 hongkong|341|香港 hainanzhou|411|海南州 korea|2314|韩国其他 hualien|2336|花莲 hochiminh|2366|胡志明市 hanoi|2367|河内 haiphong|2368|海防市 hokkaido|2375|北海道 hakone|2378|箱根 goldcoast|2380|黄金海岸 queenstown|2385|皇后镇 wellington|2386|惠灵顿 hawaii|2404|夏威夷 hamburg|2424|汉堡 thehague|2430|海牙 jinan|22|济南 jincheng|39|晋城 jinzhong|41|晋中 jinzhou|62|锦州 jilin|71|吉林 jixi|81|鸡西 jiamusi|86|佳木斯 jiaxing|102|嘉兴 jinhua|105|金华 jingdezhen|135|景德镇 jiujiang|137|九江 jian|141|吉安 jiangxiyichun|142|宜春 jiangxifuzhou|143|抚州 jining|150|济宁 jiaozuo|167|焦作 jinmen|182|荆门 jingzhou|184|荆州 jiangmen|209|江门 jieyang|222|揭阳 jiayuguan|300|嘉峪关 jinchang|301|金昌 jiuquan|307|酒泉 jiyuan|397|济源 jingjian|853|靖江 jinjiang|1009|晋江 japan|2315|日本其他 cambodia|2316|柬埔寨其他 jakarta|2353|雅加达 kualalumpur|2359|吉隆坡 phnompenh|2364|金边 jeju|2371|济州岛 kyoto|2373|京都 christchurch|2387|基督城 sanfrancisco|2396|旧金山 jaipur|2411|斋浦尔 cambridge|2466|剑桥 killarney|2471|基拉尼 johannesburg|2479|约翰内斯堡 jordan|2495|约旦 jamaica|2501|牙买加 kaifeng|161|开封 kunming|267|昆明 kelamayi|326|克拉玛依 kezilesu|333|克孜勒苏 kashidiqu|334|喀什地区 kunshan|416|昆山 korea|2314|韩国其他 kaohsiung|2337|高雄 kohphiphi|2347|皮皮岛 kohsamet|2350|沙美岛 kualalumpur|2359|吉隆坡 kyoto|2373|京都 canberra|2382|堪培拉 cairns|2383|凯恩斯 kenting|2406|垦丁 cologne|2425|科隆 karlovyvary|2432|卡罗维瓦立 creteisland|2453|克里特 killarney|2471|基拉尼 cairo|2472|开罗 casablanca|2477|卡萨布兰卡 capetown|2478|开普敦 cancun|2482|坎坤 cuzco|2487|库斯科 kenya|2497|肯尼亚 langfang|33|廊坊 linfen|44|临汾 lvliang|45|吕梁 liaoyang|65|辽阳 liaoyuan|73|辽源 lianyuangang|95|连云港 lishui|109|丽水 liuan|123|六安 longyan|132|龙岩 laiwu|154|莱芜 linyi|155|临沂 liaocheng|157|聊城 luoyang|162|洛阳 luohe|170|漯河 loudi|203|娄底 liuzhou|225|柳州 luzhou|240|泸州 leshan|246|乐山 liangshan|257|凉山 liupanshui|259|六盘水 lijiang|279|丽江 linchang|282|临沧 lasa|283|拉萨 linzhi|289|林芝地区 lanzhou|299|兰州 longnan|310|陇南 linxiazhou|311|临夏州 laibin|398|来宾 ledong|399|乐东 lingao|400|临高县 lingshui|401|陵水 liyang|867|溧阳 linan|868|临安 yueqing|905|乐清 liuhai|1015|龙海 liuyang|1376|浏阳 langkawi|2361|兰卡威 lyon|2393|里昂 losangeles|2397|洛杉矶 lasvegas|2398|拉斯维加斯 rotterdam|2429|鹿特丹 lisbon|2434|里斯本 luzern|2442|卢塞恩 rome|2458|罗马 london|2464|伦敦 liverpool|2469|利物浦 luxor|2475|卢克索 riodejaneiro|2483|里约热内卢 lima|2488|利马 laos|2490|老挝 lebanon|2491|黎巴嫩 mudanjiang|88|牡丹江 maanshan|114|马鞍山 maoming|211|茂名 meizhou|214|梅州 mianyang|242|绵阳 meishan|248|眉山 macau|342|澳门 malaysia|2312|马来西亚其他 melbourne|2322|墨尔本 maldives|2324|马尔代夫 mauritius|2329|毛里求斯 unitedstates|2332|美国其他 bangkok|2342|曼谷 manila|2354|马尼拉 marseille|2392|马赛 miami|2402|迈阿密 mumbai|2409|孟买 montreal|2412|蒙特娄 munich|2427|慕尼黑 malmo|2438|马尔默 pamukkale|2448|棉花堡 madrid|2450|马德里 majorca|2452|马略卡岛 mykonos|2456|米科诺斯 manchester|2467|曼彻斯特 marrakech|2476|马拉喀什 mexicocity|2481|墨西哥城 machupicchu|2489|马丘比丘 myanmar|2493|缅甸 madagascar|2498|马达加斯加 nanjing|5|南京 ningbo|11|宁波 nantong|94|南通 nanping|131|南平 ningde|133|宁德 nanchang|134|南昌 nanyang|172|南阳 nanning|224|南宁 neijiang|245|内江 nanchong|247|南充 nujiang|280|怒江 naqu|287|那曲 ninghai|2308|宁海 newzealand|2319|新西兰其他 nepal|2333|尼泊尔 newtaipei|2340|新北 nice|2390|尼斯 newyork|2395|纽约 niagarafalls|2414|尼亚加拉瀑布 naples|2461|那不勒斯 oxford|2468|牛津 riyuetan|2504|南投 osaka|2374|大阪 okinawa|2377|冲绳 orlando|2401|奥兰多 ottawa|2416|渥太华 oxford|2468|牛津 panjin|66|盘锦 putian|127|莆田 pingxiang|136|萍乡 pingdingshan|163|平顶山 puyang|168|濮阳 panzhihua|239|攀枝花 puer|275|普洱 pingliang|306|平凉 pingyang|908|平阳 philippines|2327|菲律宾其他 phuket|2343|普吉岛 pattaya|2344|芭堤雅 kohphiphi|2347|皮皮岛 pai|2349|拜县 palawan|2357|巴拉望岛 penang|2360|槟城 phnompenh|2364|金边 paris|2388|巴黎 provence|2389|普罗旺斯 prague|2431|布拉格 porto|2435|波尔图 pamukkale|2448|棉花堡 pisa|2462|比萨 pretoria|2480|比勒陀利亚 palau|2502|帕劳 qingdao|21|青岛 qinghuangdao|26|秦皇岛 qiqihaer|80|齐齐哈尔 qitaihe|87|七台河 quzhou|106|衢州 quanzhou|129|泉州 qianjiang|190|潜江 qingyuan|218|清远 qinzhou|230|钦州 qianxinan|263|黔西南 qiandongnan|265|黔东南 qiannan|266|黔南 qujing|268|曲靖 qingyang|308|庆阳 qionghai|402|琼海 qiongzhong|403|琼中 chiangmai|2345|清迈 chiangrai|2348|清莱 queenstown|2385|皇后镇 rizhao|153|日照 rikazediqu|286|日喀则地区 ruian|904|瑞安 rongcheng|1161|荣成 japan|2315|日本其他 rotterdam|2429|鹿特丹 geneva|2440|日内瓦 rome|2458|罗马 riodejaneiro|2483|里约热内卢 riyuetan|2504|南投 shanghai|1|上海 suzhou|6|苏州 shenzhen|7|深圳 shenyang|18|沈阳 shijiazhuang|24|石家庄 shuozhou|40|朔州 siping|72|四平 songyuan|76|松原 shuangyashan|83|双鸭山 suihua|90|绥化 suqian|100|宿迁 shaoxing|104|绍兴 anhuisuzhou|121|宿州 sanming|128|三明 shangrao|144|上饶 sanmenxia|171|三门峡 shangqiu|173|商丘 shiyan|178|十堰 suizhou|187|随州 shaoyang|195|邵阳 shaoguan|205|韶关 shantou|207|汕头 shanwei|215|汕尾 suining|244|遂宁 shannan|285|山南 shangluo|298|商洛 shizuishan|322|石嘴山 shihezi|339|石河子 sanya|345|三亚 shennongjia|404|神农架林区 shishi|1008|石狮 sansha|2310|三沙 singapore|2311|新加坡 saipan|2326|塞班岛 srilanka|2330|斯里兰卡 seychelles|2334|塞舌尔 samui|2346|苏梅岛 kohsamet|2350|沙美岛 cebu|2356|宿雾 sabah|2362|沙巴 siemreap|2363|暹粒 sihanoukville|2365|西哈努克 seoul|2369|首尔 sydney|2379|悉尼 sanfrancisco|2396|旧金山 seattle|2399|西雅图 salzburg|2418|萨尔兹堡 stockholm|2436|斯德哥尔摩 zurich|2439|苏黎世 seville|2451|塞维利亚 santorini|2454|圣托里尼 saopaulo|2484|圣保罗 santiago|2486|圣地亚哥 tianjin|10|天津 tangshan|25|唐山 taiyuan|35|太原 tongliao|50|通辽 tieling|67|铁岭 tonghua|74|通化 taizhou|99|泰州 zhejiangtaizhou|108|台州 tongling|116|铜陵 taian|151|泰安 tianmen|191|天门 tongrendiqu|262|铜仁地区 tongchuan|290|铜川 tianshui|303|天水 tulufandiqu|327|吐鲁番地区 tachengdiqu|337|塔城地区 taiwan|340|台湾其他 tumushuke|405|图木舒克 tunchang|406|屯昌县 thailand|2313|泰国其他 taipei|2335|台北 tainan|2338|台南 taoyuan|2339|桃园 taichung|2341|台中 tokyo|2372|东京 tahiti|2405|大溪地 toronto|2413|多伦多 thehague|2430|海牙 turin|2463|都灵 tanzania|2499|坦桑尼亚 vietnam|2317|越南其他 varanasi|2408|瓦拉纳西 vancouver|2415|温哥华 vienna|2417|维也纳 venice|2460|威尼斯 wuxi|13|无锡 wuhan|16|武汉 wuhai|48|乌海 wulanchabu|55|乌兰察布 wenzhou|101|温州 wuhu|111|芜湖 weifang|149|潍坊 weihai|152|威海 wuzhou|227|梧州 wenshan|274|文山州 weinan|293|渭南 wuwei|304|武威 wuzhong|323|吴忠 wulumuqi|325|乌鲁木齐 wanning|407|万宁 wenchang|408|文昌 wujiaqu|409|五家渠 wuzhishan|410|五指山 wendeng|1163|文登 bandung|2352|万隆 wellington|2386|惠灵顿 varanasi|2408|瓦拉纳西 vancouver|2415|温哥华 vienna|2417|维也纳 venice|2460|威尼斯 brunei|2492|文莱 xiamen|15|厦门 xian|17|西安 xingtai|28|邢台 xinzhou|43|忻州 xingan|53|兴安盟 xilinguole|54|锡林郭勒 xuzhou|92|徐州 xuancheng|126|宣城 xinyu|138|新余 xinxiang|166|新乡 xuchang|169|许昌 xinyang|174|信阳 xiangyang|180|襄阳 xiaogan|183|孝感 xianning|186|咸宁 xiantao|189|仙桃 xiangtan|193|湘潭 xiangxi|204|湘西 xishuangbanna|276|西双版纳 xianyang|292|咸阳 xining|313|西宁 hongkong|341|香港 singapore|2311|新加坡 newzealand|2319|新西兰其他 newtaipei|2340|新北 sihanoukville|2365|西哈努克 hakone|2378|箱根 sydney|2379|悉尼 seattle|2399|西雅图 hawaii|2404|夏威夷 delhi|2407|新德里 yangzhou|12|扬州 yangquan|37|阳泉 yuncheng|42|运城 yingkou|63|营口 yanbian|78|延边 yichun|85|伊春 yancheng|97|盐城 yingtan|139|鹰潭 jiangxiyichun|142|宜春 yantai|148|烟台 yichang|179|宜昌 yueyang|196|岳阳 yiyang|199|益阳 yongzhou|201|永州 yangjiang|217|阳江 yunfu|223|云浮 guangxiyulin|232|玉林 yibin|249|宜宾 yaan|252|雅安 yuxi|269|玉溪 yanan|294|延安 yulin|296|榆林 yushu|319|玉树 yinchuan|321|银川 yili|336|伊犁 yiwu|385|义乌 yuyao|423|余姚 yongkang|893|永康 yueqing|905|乐清 vietnam|2317|越南其他 indonesia|2325|印度尼西亚其他 jakarta|2353|雅加达 innsbruck|2419|因斯布鲁克 interlaken|2441|因特拉肯 istanbul|2444|伊斯坦布尔 izmir|2447|伊兹密尔 athens|2455|雅典 alexandria|2473|亚历山大 aswan|2474|亚斯文 johannesburg|2479|约翰内斯堡 israel|2494|以色列 jordan|2495|约旦 jamaica|2501|牙买加 chongqing|9|重庆 zhangjiakou|30|张家口 zhengjiang|98|镇江 zhoushan|107|舟山 zhejiangtaizhou|108|台州 zhangzhou|130|漳州 zibo|145|淄博 zaozhuang|146|枣庄 zhengzhou|160|郑州 zhoukou|175|周口 zhumadian|176|驻马店 zhuzhou|192|株洲 zhangjiajie|198|张家界 zhuhai|206|珠海 zhanjiang|210|湛江 zhaoqing|212|肇庆 zhongshan|220|中山 zigong|238|自贡 ziyang|254|资阳 zunyi|260|遵义 zhaotong|271|昭通 zhangye|305|张掖 zhongwei|351|中卫 zhuji|883|诸暨 zhangqiu|1118|章丘 chicago|2400|芝加哥 jaipur|2411|斋浦尔 zurich|2439|苏黎世
抓取列表页面数据:
# -*- coding: utf-8 -*- import codecs import traceback import urllib2 import re from bs4 import BeautifulSoup import sys import MySQLdb import string import json import time URL_LIST = "http://www.dianping.com/search/category/%s/%s/p%s" # 列表 RUL_DETAIL = 'http://www.dianping.com/shop/%s' # 详情 f1 = open("f1.log", "a", 1) f2 = open("f2.log", "a", 1) reload(sys) sys.setdefaultencoding('utf-8') type = sys.getfilesystemencoding() def deal(city_id, category_id, p): url = URL_LIST % (city_id, category_id, p) print url opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'), ('Accept', 'application/json, text/javascript'), ('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')] urlopen = opener.open(url, timeout=100) rsp = urlopen.read() if "404" in rsp: return 404 print "=====================start==========================" soup = BeautifulSoup(rsp) soup = soup.find("div", { "id" : "shop-all-list" }) # print soup row = soup.find_all("li") for so in row: # print so get_business(so) print '' INSERT_BUSINESS = "INSERT INTO tb_dianping_business_zx (businessID,NAME,Url,BranchName,Address,Regions,Categories,City,AvgRating,AvgPrice,ReviewCount,PhotoUrl,SPhotoUrl,HasCoupon,HasDeal,DealCount,Deals) VALUES (%s,'%s','%s','%s','%s','%s','%s','%s',%s,%s,%s,'%s','%s',%s,%s,%s,'%s');" db_interest = MySQLdb.connect(host="ip", port=3306, user="xxx", passwd="xxx", db="db_xxx", charset="utf8"); cur_interest = db_interest.cursor(); def save(business_id, Name, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals): sql = INSERT_BUSINESS % (business_id, Name, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals) print "========================sql==========================" print sql try: cur_interest.execute(sql); db_interest.commit() except MySQLdb.IntegrityError: db_interest.rollback() print "*********************** duplicate business_id: %s" % sql print ';' def get_business(soup): # print soup business_id = get_business_id(soup) NAME = get_business_name(soup) Url = RUL_DETAIL % business_id BranchName = '' if "(" in NAME: BranchName = NAME[NAME.find("(") + 1:NAME.find(")")] Address = get_Address(soup) Regions = get_Regions(soup) Categories = get_Categories(soup) City = '北京' AvgRating = get_AvgRating(soup) AvgPrice = get_AvgPrice(soup) ReviewCount = get_ReviewCount(soup) PhotoUrl = get_PhotoUrl(soup) SPhotoUrl = PhotoUrl; DealCount = get_DealCount(soup) HasCoupon = DealCount > 0 and 1 or 0 HasDeal = HasCoupon Deals = get_Deals(soup) print business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, DealCount, Deals save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals) def get_business_id(soup): return soup.find("div", {"class":"tit"}).find("a")["href"].strip().replace("/shop/", "") def get_business_name(soup): return soup.find("div", {"class":"tit"}).find("a")["title"].strip() def get_Address(soup): if soup.find("span", {"class":"addr"}): return soup.find("span", {"class":"addr"}).get_text().strip() else: return "" def get_Regions(soup): if soup.find("div", {"class":"tag-addr"}): return soup.find("div", {"class":"tag-addr"}).find_all("a")[0].find("span", {"class":"tag"}).get_text().strip() else: return "" def get_Categories(soup): if soup.find("div", {"class":"tag-addr"}): return soup.find("div", {"class":"tag-addr"}).find_all("a")[1].find("span", {"class":"tag"}).get_text().strip() else: return "" def get_AvgRating(soup): return soup.find("span", {"class":"sml-rank-stars"})["class"][1].strip().replace("sml-str", "") def get_AvgPrice(soup): b = soup.find("a", {"class":"mean-price"}).find("b") if b: return b.get_text().strip().replace("¥", "") return 0 def get_ReviewCount(soup): b = soup.find("a", {"class":"review-num"}) if b: return soup.find("a", {"class":"review-num"}).find("b").get_text().strip() return 0 def get_PhotoUrl(soup): return soup.find("div", {"class":"pic"}).find("img")["data-src"].strip() def get_DealCount(soup): soup = soup.find("div", {"class":"si-deal"}) if soup : return len(soup.find_all("a", {"class":"J_dinfo"})) # .count("a", {"class":"J_dinfo"}) return 0 def get_Deals(soup): soup = soup.find("div", {"class":"si-deal"}) if soup : data_deal_id = '' rows = soup.find_all("a", {"class":"J_dinfo"}) for so in rows: data_deal_id = '%s,%s' % (data_deal_id, so["data-deal-id"]) return data_deal_id return '' if __name__ == "__main__": cities = [] cas = [10, 20, 25, 30, 45, 50, 60, 70] cas = [30, 45, 50, 60, 70] ct = codecs.open("cities", 'r', 'utf-8') lines = ct.readlines() for word in lines: word = word[word.find("|") + 1:] word = word[0:word.find("|")] cities.append(word.strip()) for city in cities: for ca in cas: p = 0 while p <= 50: try: print 'deal(%s,%s,%s)' % (city, ca, p) p = p + 1 code = deal(city, ca, p) # if 404==code: # break # 2 25 12 except Exception: traceback.print_exc() # print "*********************** duplicate business_id: %s" % sql print "休眠5秒 ... " time.sleep(1) # f = codecs.open("li", 'r', 'utf-8') # soup = BeautifulSoup(f.read()) # soup = BeautifulSoup(f.read()) # get_business(soup) # save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)
抓取详情数据:
# -*- coding: utf-8 -*- import codecs import traceback import urllib2 import re from bs4 import BeautifulSoup import sys import MySQLdb import string import json import time from tokenize import Double URL_LIST = "http://www.dianping.com/search/category/%s/%s/p%s" # 列表 RUL_DETAIL = 'http://www.dianping.com/shop/%s' # 详情 f1 = open("f1.log", "a", 1) f2 = open("f2.log", "a", 1) reload(sys) sys.setdefaultencoding('utf-8') type = sys.getfilesystemencoding() def deal(businessID, url): print url opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'), ('Accept', 'application/json, text/javascript'), ('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')] urlopen = opener.open(url, timeout=30) rsp = urlopen.read() print "=====================start==========================" soup = BeautifulSoup(rsp) str = soup.find("div", {"class":"breadcrumb"}) str2 = soup.find("div", {"id":"basic-info"}) str3 = soup.find("div", {"id":"sales"}) str4 = soup.find("div", {"id":"aside"}) if str4: str4 = soup.find("div", {"id":"aside"}).find("script") else: str4="" print "----------------" print '%s%s%s%s' % (str, str2, str3, str4) soup = BeautifulSoup('%s%s%s%s' % (str, str2, str3, str4)) # print soup get_business(businessID, soup) UPDATE_BUSINESS = "UPDATE tb_dianping_business_zx SET Address='%s',Regions='%s',Categories='%s',City='%s',lat=%s,lng=%s,Deals='%s' where businessID= %s " SELECT_BUSINESS = "SELECT businessID,url FROM tb_dianping_business_zx WHERE lat=0 and businessID > %d order by businessID asc LIMIT 100 " db_interest = MySQLdb.connect(host="xxx", port=xxx, user="xxx", passwd="xxx", db="db_xxx", charset="utf8"); cur_interest = db_interest.cursor(); def save(Address, Regions, Categories, City, lat, lng, Deals, business_id): sql = UPDATE_BUSINESS % (Address, Regions, Categories, City, lat, lng, Deals, business_id) try: print sql cur_interest.execute(sql); db_interest.commit() except MySQLdb.IntegrityError: db_interest.rollback() print "*********************** duplicate business_id: %s" % sql print ';' def fetchall(cur, sql): cur.execute(sql) return cur.fetchall() def fetchone(cur, sql): cur.execute(sql) return cur.fetchone() def get_business(business_id, soup): business_id = business_id cs = get_Regions_Categories(soup) City = cs[0] Regions = cs[1] Categories = cs[2] # print '%s , %s , %s' % (City, Regions, Categories) Address = get_Address(soup) point = get_point(soup) lat = point[0] lng = point[1] Deals = get_Deals(soup) # print Address, Regions, Categories, City, lat, lng, Deals, business_id save(Address, Regions, Categories, City, lat, lng, Deals, business_id) def get_Regions_Categories(soup): rows = soup.find("div", {"class":"breadcrumb"}).find_all("a") City = '' RegionsCs = [] CategoriesCs = [] i = 0 length = len(rows) for row in rows : if i == 0: City = row.get_text().strip() elif length % 2 == 0 and i < length / 2: RegionsCs.append(row.get_text().strip()) elif length % 2 == 0 and i >= length / 2: CategoriesCs.append(row.get_text().strip()) elif length % 2 == 1 and i < length / 2 + 1: RegionsCs.append(row.get_text().strip()) else: CategoriesCs.append(row.get_text().strip()) i = i + 1 Regions = "" for c in RegionsCs: Regions = '%s,"%s"' % (Regions , c) Regions = '[%s]' % Regions Regions = Regions.replace("[,", "[") Categories = "" for c in CategoriesCs: Categories = '%s,"%s"' % (Categories , c) Categories = '[%s]' % Categories Categories = Categories.replace("[,", "[") return City, Regions, Categories def get_Address(soup): return '%s %s' % (soup.find("div", {"class":"address"}).find("a").find("span").get_text().strip(), soup.find("div", {"itemprop":"street-address"}).find("span", {"class":"item"}).get_text().strip()) def get_point(soup): lat = '' lng = '' str = soup.find("script").get_text().strip() str = str[str.find("({lng:") + 6:] lat = str[:str.find(",lat:")] lng = str[str.find(",lat:") + 5:str.find("}")] la = int(float(lat)*1000000) ln = int(float(lng)*1000000) return la, ln def get_Deals(soup): soup = soup.find("div", {"id":"sales"}) if soup: Deals = [] rows = soup.find_all("div", {"class":"item"}) for row in rows: if row.find("span", {"class":"price"}): deal = {} title = row.find("p", {"class":"title"}) url = "" if title: deal["name"] = title.get_text().strip() url = row.find("a", {"class":"block-link"})["href"] else: deal["name"] = rows.get_text().strip() url = row["href"] deal["url"] = url deal["id"] = url.replace("http://t.dianping.com/deal/", "") deal["h5_url"] = url Deals.append(deal) deals = "" for c in Deals: deals = '%s,{"url":"%s", "name": "%s", "h5_url": "%s", "id": "%s"}' % (deals , c.get("url"), c.get("name"), c.get("h5_url"), c.get("id")) deals = '[%s]' % deals deals = deals.replace("[,", "[") return deals return '' if __name__ == "__main__": # deal('http://www.dianping.com/shop/11566327') maxId = 0 SELECT_BUSINESS_NEXT = ""; while True: try: SELECT_BUSINESS_NEXT = SELECT_BUSINESS % maxId print SELECT_BUSINESS_NEXT rows = fetchall(cur_interest, SELECT_BUSINESS_NEXT) for row in rows: print row deal(row[0], row[1]) maxId = row[0] except Exception: traceback.print_exc() print "休眠5秒 ... " time.sleep(5) # f = codecs.open("detail", 'r', 'utf-8') # soup = BeautifulSoup(f.read()) # get_business(soup, 11566327) # save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)
备注:抓取数据速度尽量去控制下来,好拉,今天都16号了,哥可以放大假了,大伙加完班,也好好回家过个好年
使用的解析库:http://www.crummy.com/software/BeautifulSoup/bs4/doc/