python爬虫抓取链家租房数据



初学python和爬虫,正好赶上要在帝都租房,于是打算自己抓下链家的租房数据试试。看到网上有人抓取链家的二手房买卖数据,参考了下,不过我抓租房数据的时候发现还比较简单,不需要模拟登陆,链家也没怎么反爬虫,因而一路还比较顺利。

总体思路,虽然链家没有采用太多的反爬虫技术,但是基本的限制IP访问密度还是做了的,所以得动用代理,这么一来,抓取效率也必然降低,所以得采用多线程。实现的时候先实现代理的抓取,然后实现单线程单页面的抓取,接着改为多线程,再结合代理。

先看下代理部分。网上搜索了下,几个不断更新的免费的代理平台有,快代理、西刺代理和proxy360。那就针对这几个网站,分别封装类,提取代理IP和端口呗。限于初学,为了扎实基本功,就用了最笨的正则表达式提取数据。

先把代码全部贴出来:

# coding=utf-8
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'ATP'
import urllib
import urllib2
import re
import time
import threading
import socket
from bs4 import BeautifulSoup
import sys
import random
import getProxy
reload(sys)
sys.setdefaultencoding('utf-8')

BSparser = 'html.parser'
UserAgents=['Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6',
            'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11',
            'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)',
            'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36',

            'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
            'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',

            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
            'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
            'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
            'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
            'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36']


#obtain proxy from proxy360.cn
class proxy_proxy360:
    def __init__(self):
        self.region=['Brazil','China','America','Taiwan','Japan','Thailand','Vietnam','bahrein']
        self.root_url = 'http://www.proxy360.cn/Region/'
        self.qqproxy_url='http://www.proxy360.cn/QQ-Proxy'
        self.msnproxy_url = 'http://www.proxy360.cn/MSN-Proxy'

    def getUrlList(self,proxyType='normal',proxyLocation='China'):
        url = []
        if proxyType == 'normal' or proxyType == 'all':
            if proxyLocation == 'all':
                for ri in self.region:
                    url.append(self.root_url + ri)
            else:
                url.append(self.root_url + proxyLocation)
        if proxyType == 'qq' or proxyType == 'all':
            url.append(self.qqproxy_url)
        if proxyType == 'msn' or proxyType == 'all':
            url.append(self.msnproxy_url)
        return url

    def getProxy(self,url=[]):
        items = []
        try:
            for urlitem in url:
                print 'Get proxy from url: ' + urlitem
                #request = urllib2.Request(urlitem,headers=headers)
                request = urllib2.Request(urlitem)
                response = urllib2.urlopen(request)
                content = response.read().decode('utf-8')
                pattern = re.compile(r'.*?>'
                                        r'.*?>\s*(.*?)\s*'
                                        r'.*?>\s*(.*?)\s*'
                                        r'.*?>\s*(.*?)\s*'
                                        r'.*?>\s*(.*?)\s*'
                                        r'.*?>\s*(.*?)\s*'
                                        r'.*?>\s*(.*?)\s*'
                                        r'.*?>\s*(.*?)\s*'
                                        r'.*?>\s*(.*?)\s*'
                                        r'.*?/div>.*?>.*?', re.S)
                itemslist = re.findall(pattern,content)
                for itemsnew in itemslist:
                    itemproperty = {}
                    itemproperty['ip']=itemsnew[0]
                    itemproperty['port'] = itemsnew[1]
                    itemproperty['anony_degree'] = itemsnew[2]
                    itemproperty['location'] = itemsnew[3]
                    itemproperty['updata_time'] = itemsnew[4]
                    itemproperty['today_mark'] = itemsnew[5]
                    itemproperty['total_mark'] = itemsnew[6]
                    itemproperty['available_time'] = itemsnew[7]
                    itemproperty['speed'] = itemsnew[8]
                    items.append(itemproperty)
        except (urllib2.HTTPError, urllib2.URLError), e:
            print e
        except Exception,e:
            print e
        return items

#obtain proxy from kuaidaili.com
class proxy_kuaidaili:
    def __init__(self):
        self.root_latest = 'http://www.kuaidaili.com/proxylist/'
        self.root_free = 'http://www.kuaidaili.com/free/'
        self.freeType = ['latest','inha','intr','outha','outtr']

    def getUrlList(self,proxyType='latest',pageNum=10):
        url = []
        if proxyType == 'latest':
            for i in range(0,min(pageNum,10)):
                url.append(self.root_latest+str(i+1))
        elif proxyType in self.freeType:
            try:
                urlitem = self.root_free + proxyType
                request = urllib2.Request(urlitem)
                response = urllib2.urlopen(request)
                content = response.read().decode('utf-8')
                pattern = re.compile(u'
(.*?)
  • \u9875
  • ',re.S) lastPage = re.findall(pattern,content) if lastPage: for i in range(0,min(pageNum,int(lastPage[0]))): url.append(urlitem+'/'+str(i+1)) except (urllib2.HTTPError, urllib2.URLError), e: print e except Exception,e: print e else: print "Input error no such type of proxy:"+proxyType return url def getProxy(self,url=[]): proxy = [] for urlitem in url: try: print "Get proxy from url: " + urlitem request = urllib2.Request(urlitem) response = urllib2.urlopen(request) content = response.read().decode('utf-8') pattern = re.compile(r'IP<.*?PORT.*?\s*(.*?)',re.S) proxyTable = re.findall(pattern,content) if proxyTable: content=proxyTable[0] pattern = re.compile(r'get/post') bGET_POST = re.findall(pattern,content) if bGET_POST: pattern = re.compile(r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)',re.S) proxylistget = re.findall(pattern,content) for proxyitem in proxylistget: proxyitem_property = {} proxyitem_property['ip'] = proxyitem[0] proxyitem_property['port'] = proxyitem[1] proxyitem_property['anony_degree'] = proxyitem[2] proxyitem_property['proxy_type'] = proxyitem[3] proxyitem_property['get_post_support'] = proxyitem[4] proxyitem_property['location'] = proxyitem[5] proxyitem_property['speed'] = proxyitem[6] proxyitem_property['last_verify_time'] = proxyitem[7] proxy.append(proxyitem_property) else : pattern = re.compile(r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)' r'.*?(.*?)', re.S) proxylistget = re.findall(pattern, content) for proxyitem in proxylistget: proxyitem_property = {} proxyitem_property['ip'] = proxyitem[0] proxyitem_property['port'] = proxyitem[1] proxyitem_property['anony_degree'] = proxyitem[2] proxyitem_property['proxy_type'] = proxyitem[3] proxyitem_property['location'] = proxyitem[4] proxyitem_property['speed'] = proxyitem[5] proxyitem_property['last_verify_time'] = proxyitem[6] proxy.append(proxyitem_property) except (urllib2.HTTPError, urllib2.URLError), e: print e except Exception,e: print e return proxy class proxy_xicidaili: def __init__(self): self.root_url = 'http://www.xicidaili.com/' self.proxy_type = ['nn','nt','wn','wt','qq'] self.user_agent = UserAgents[14] def getUrlList(self,proxyType='nn',pageNum=10): url = [] if proxyType == 'all': for item in self.proxy_type: url.extend(self.getUrlList(item,pageNum)) elif proxyType in self.proxy_type: try: urlitem = self.root_url + proxyType headers = {'User-Agent': self.user_agent} request = urllib2.Request(urlitem,headers=headers) response = urllib2.urlopen(request) content = response.read().decode('utf-8') soup = BeautifulSoup(content,BSparser) atag = soup.find(class_='next_page') num = int(atag.find_previous_sibling().string) for i in range(0,min(num,pageNum)): url.append(urlitem+'/'+str(i+1)) except (urllib2.HTTPError, urllib2.URLError), e: print e except Exception,e: print e else: print "Input error no such type of proxy:" + proxyType return url def getProxy(self,url=[]): proxy = [] for urlitem in url: try: print "get proxy from url: " + urlitem headers = {'User-Agent': self.user_agent} request = urllib2.Request(urlitem, headers=headers) response = urllib2.urlopen(request) content = response.read().decode('utf-8') soup = BeautifulSoup(content, BSparser) tabletag = soup.find(id='ip_list') thtag = tabletag.find('tr') trtags = thtag.find_next_siblings('tr') for trtag in trtags: tdtags = trtag.find_all('td',recursive=False) proxyitem_property = {} if tdtags[0].img: proxyitem_property['country']=tdtags[0].img['alt'] else: proxyitem_property['country']=tdtags[0].string proxyitem_property['ip'] = tdtags[1].string proxyitem_property['port'] = tdtags[2].string if tdtags[3].a : proxyitem_property['location'] = tdtags[3].a.string locatext = tdtags[3].a['href'] pattern = re.compile(r'/[\-0-9]/(.*?)') loca = re.findall(pattern, locatext) if loca: proxyitem_property['location-pinyin'] = loca[0] else: proxyitem_property['location'] = tdtags[3].string proxyitem_property['anony_degree'] = tdtags[4].string proxyitem_property['proxy_type'] = tdtags[5].string proxyitem_property['speed'] = tdtags[6].div['title'] proxyitem_property['connect_speed'] = tdtags[7].div['title'] proxyitem_property['alive_time'] = tdtags[8].string proxyitem_property['last_verify_time'] = tdtags[9].string proxy.append(proxyitem_property) except (urllib2.HTTPError, urllib2.URLError), e: print e except Exception,e: print e return proxy def proxyValidate(target_url, key_word, proxylist,timeout=5): resultList = {} for proxy in proxylist: try: starttime = time.time() headers = {'User-Agent': UserAgents[random.randint(0, len(UserAgents) - 1)]} proxy_handler = urllib2.ProxyHandler({"http": proxy}) opener = urllib2.build_opener(proxy_handler) urllib2.install_opener(opener) request = urllib2.Request(target_url, headers=headers) response = urllib2.urlopen(request,timeout=timeout) content = response.read().decode('utf-8') pattern = re.compile(unicode(key_word), re.S) keywordFind = re.findall(pattern, content) endtime = time.time() if keywordFind: proxyAvailable = 'true' print ' proxy: %s --%s thread: %s time %s'% (proxy,'success',threading.current_thread().name,endtime - starttime) plistlock.acquire() try: successList[proxy]=endtime-starttime finally: plistlock.release() else: proxyAvailable = 'false' print ' proxy: %s --%s thread: %s time %s' % (proxy,'failed',threading.current_thread().name,endtime - starttime) plistlock.acquire() try: failList[proxy]=endtime-starttime finally: plistlock.release() except (urllib2.HTTPError, urllib2.URLError), e: proxyAvailable = str(e) print ' proxy: %s -httperror-%s thread: %s'% (proxy,e,threading.current_thread().name) except Exception, e: proxyAvailable = str(e) print ' proxy: %s -other-%s thread: %s'% (proxy,e,threading.current_thread().name) if proxyAvailable != 'true' and 'false' != proxyAvailable: plistlock.acquire() try: errorList[proxy]=proxyAvailable finally: plistlock.release() resultList[str(proxy)] = proxyAvailable return resultList #what if python2.7 support variable parameters, things will go easier...make it more universally def multiThreadValidate(dealList,threadnum,func,args,joinImmediate = True): if len(dealList) == 0: return successList.clear() failList.clear() errorList.clear() listLen = len(dealList) threadlist = [] apartNum = (len(dealList)-1)/threadnum+1 if apartNum < 2: apartNum = 2 ctn = apartNum -1 while ctn < listLen: th = threading.Thread(target=func, args=(args[0],args[1],dealList[ctn-apartNum+1:ctn+1]),name=str(ctn / apartNum)) th.start() threadlist.append(th) ctn = ctn + apartNum if listLen % apartNum != 0: th = threading.Thread(target=func, args=(args[0],args[1],dealList[ctn-apartNum+1:listLen]),name=str(ctn / apartNum)) th.start() threadlist.append(th) if joinImmediate: for ti in threadlist: ti.join() return threadlist successList = {} #format: {'proxyip:port':'time of access',} failList = {} errorList = {} plistlock = threading.RLock() if __name__=="__main__": starttime = time.time() spider = proxy_kuaidaili() url = spider.getUrlList('outha',10) proxy_items = spider.getProxy(url) print 'num of proxy to be test: '+ str(len(proxy_items)) proxylist = [] for proxyi in proxy_items: proxylist.append(proxyi['ip'] + ':' + proxyi['port']) socket.setdefaulttimeout(5) multiThreadValidate(proxylist,20,proxyValidate,('http://bj.lianjia.com/','了解链家网'.decode())) # print len(successList) for i in successList: print '%s : %s ' % (i,successList[i]) endtime = time.time() print 'total time cost: %s' % str(endtime-starttime)


    几个类都类似,如果放C#下面,可以实现个接口。getUrlList先生成可以抓取的页面URL,然后getproxy抓取代理服务器IP端口。此外定义了proxyValidate和multiThreadValidate对抓取的代理服务器进行验证,毕竟免费的质量毕竟低,很多是用不了或者可用时间很短。验证的时候就下载指定url,并匹配关键词,匹配成功则表明服务器可用,这是最简单的做法。multiThreadValidate用了多线程的方法,每个线程还是执行proxyValidate进行验证,但是在该函数进行了任务拆分。

    真正进行到链家数据抓取,考虑到数据量比较大(不适合写入文本,但又不至于动用高性能大型数据库),但有都是结构化数据,为了高效存储,采用轻量级的sqlite,考虑到多线程操作数据库,必然带来访问一致性问题,而网络IO较慢,数据库读写频率不高,可以加个线程锁保证单次只有一个线程访问。

    数据库封装如下:

    class SQLiteWraper(object):
        def __init__(self,conn_arg,command='',*args,**kwargs):  
            self.lock = threading.RLock()
            self.path = conn_arg
            if command!='':
                conn=self.get_conn()
                cu=conn.cursor()
                cu.execute(command)
    
        def get_conn(self):  
            conn = sqlite3.connect(self.path)#,check_same_thread=False)  
            conn.text_factory=str
            return conn   
    
        def conn_close(self,conn=None):  
            conn.close()  
    
        def conn_trans(func):
            def connection(self,*args,**kwargs):  
                self.lock.acquire()  
                conn = self.get_conn()  
                kwargs['conn'] = conn  
                rs = func(self,*args,**kwargs)  
                self.conn_close(conn)  
                self.lock.release()  
                return rs  
            return connection
    
        @conn_trans
        def execute(self, command, conn=None):
            cu = conn.cursor()
            try:
                cu.execute(command)
                conn.commit()
            except sqlite3.IntegrityError, e:
                print e
                return -1
            except Exception, e:
                print e
                return -2
            return 0
    
        @conn_trans
        def fetchall(self,command="select count(*)",conn=None):
            cu=conn.cursor()
            lists=[]
            try:
                cu.execute(command)
                lists=cu.fetchall()
            except Exception,e:
                print e
            return lists

    其中为了保证线程锁不被滥用造成死锁,以及数据库连接能被及时关闭,定义了一个高阶函数conn_trans,也即装饰器函数(decorator)。定义了一个全局函数createDB来初始化数据库。仅执行一次。

    抓取链家数据的类结构如下:

    class crawlLianJia:
        def __init__(self):
            pass
            
        def threadUpdateProxy(self):
            pass
        
        def getContentbyProxy(self,url,succPattern='',failPattern='',succCheckFirst= False,tryMax = 10,timeout = 10):
            pass
    
        #craw with single thread, about 1,000 accesses to the lianjia website will be ip blocked
        def crawAllRentInfo(self):
            pass
            
        #craw with multiple threads
        def crawAllRentInfo_multi(self,threadnum):
            pass
    
        #python 2.7 does not support variable parameters very well, pity
        def multiThreadCrawList(self,threadnum,deallist,func,threadNamePrefix='',joinImmediate = True,*argspass,**kwargs):
            pass
    
        #get regions from website than craw by regions because the website only presents 100 pages in response
        def getRegion(self):
            pass
    
        #save regions and bizcircles(subRegions) crawed to database, because the regions only need to craw once
        def updateRegionInfo(self):
            pass
            
        def getRegionFromDB(self):
            pass
    
        #craw info of the given regions
        def crawRentInfo_Regions(self,subRegionList):
            pass
    
        #return a list of pages to be crawed in the region
        def crawRentInfo_certainRegion(self,region):
            pass
            
        #save all pages need be crawed to database
        def updatePageList(self):
            pass
            
        def getPageListFromDB(self):
            pass
            
        def crawRentInfo_Pages(self,pageList=[]):
            pass
            
        def crawRentInfo_certainPage(self,urlpage):
            pass
            
        def crawRentInfo_pageTagInfo(self,tagText):
            pass
            
        def crawRentInfo_certainRoom(self,hid,isRemoved=False):
            pass
            
        def crawRentInfo_community(self,cid):
            pass
            
        def crawRentInfo_json(self,HID,CID):
            pass
            
        #return true of false to indicate whether a room is in database(has been crawed)
        def dealRentRoomCrawed(self,hid):
            pass
            
        def dealRentCommunityCrawed(self,cid):
            pass
            
        #insert a new record into a certain table of database
        def insertInfo(self,table='',infoDic={}):
            pass
            
        #update the infomation of certain record
        def updateInfo(self,table,primaryKey,infoDic={}):
            pass
            
        #insert some key-value pairs to the table
        def insertDicListInfo(self,table='',infoDicList=[]):
            pass
            
        def loadErrorListFromFile(self,filename = 'errorlist'):
            pass
            
        def dumpErrorListToFile(self,filename = 'errorlist'):
            pass
            
        def recrawError_Comms(self,commList):
            pass
            
        def recrawError_Rooms(self,roomList):
            pass
            
        def recrawErrorList_multi(self,threadnum=10):
            pass
            
        def recrawErrorList(self):
            pass
            
    def createDB():
        pass
        
    if __name__=="__main__":

    这里涉及几个多线程访问的列表变量,allPages表示待抓取的所有页面url,errorPageUrlList是抓取错误的页面url,errorRoomList是抓取房屋信息时发生错误的房屋ID,errorCommList是抓取小区信息失败时的小区ID。proxyList是可用代理服务器列表,一直更新中。proxyForceoutList是被网站封杀了的代理服务器列表。

    首先是几个用于验证代理服务器是否可用的函数,threadUpdateProxy为主函数,程序启动后也随之启动,里面主要为一个死循环,判断当可用代理服务器列表proxyListOld小于阈值,启动抓取代理服务器,执行multiThreadValidate函数,参见前述getProxy模块。

    getContentbyProxy完成使用代理服务器下载指定页面内容的功能。仅负责下载内容,解析由其他函数完成,参见后文。

    这里得说两句,链家的租房信息是树状分布,以北京为例,顶层是按几大城区(东城、西城、海淀、大兴、顺义等等,程序中用mainRegion表示)分,然后是各区内的小街区(如东城区内的安定门、安贞、朝阳门外、前门等等,程序中用region或者subRegion表示),再然后是出租房数据。链家限制一次最多呈现100个页面的结果,每个页面三十条数据,因此不能直接按几大城区+页码的形式构造url获取查询结果。而通常子街区包含的结果数不超过30页,子街区之间又不重复,故而可以直接按子街区,然后按页构造url以此获取全部数据。这里我先将所有的子街区下载好,按页将url构造好并保存到数据库免去每次测试期间每次启动爬虫重新构造url的麻烦。

    crawAllRentInfo为单线程版,仅测试用。实际采用crawAllRentInfo_multi函数,该函数调用multiThreadCrawList抓取所有带抓取页面url。multiThreadCrawList用多线程执行crawRentInfo_Regions,传入参数subRegionList包含子区域列表,由crawRentInfo_Regions调用crawRentInfo_certainRegion抓取每个子区域的搜索结果页url并返回。最终crawAllRentInfo_mult得到所有带抓取的页面url,然后再调用multiThreadCrawList开启多线程抓取所有页面,传入的抓取函数为crawRentInfo_Pages。第一次抓取到所有页面url可以用updatePageList()将页面url更新到数据库,后续执行可以直接用注释的语句getPageListFromDB()完成所有页面url获取。当然不想用多线程抓取所有页面url可以直接用crawRentInfo_Regions(self.regionUnrepeat)。

    crawRentInfo_certainPage根据url抓取指定搜索结果页面,用正则表达式匹配出该页面上的房屋结果信息,crawRentInfo_pageTagInfo抓取搜索结果页上的特有的房屋标签信息(详情页里没有的)。然后根据房屋的ID信息转由crawRentInfo_certainRoom进一步抓取该ID房屋详细信息,由crawRentInfo_community进一步抓取对应的小区信息。由crawRentInfo_json抓取由json传送的带看信息。

    后续的insertInfo根据表名和字段字典构造数据库插入sql语句并插入。updateInfo构造数据库更新语句并执行。insertDicListInfo批量插入数据。dumpErrorListToFile将抓取错误的数据以json形式序列化到文档中。loadErrorListFromFile完成对应的反序列化工作。recrawErrorList_multi多线程方式重新抓取之前抓取失败的各种信息。调用recrawError_Comms重新抓取之前抓取失败的小区信息。调用recrawError_Rooms同理。

    剩下的没啥好说的啦。入口函数依次执行。

    更多信息见原文: http://www.straka.cn/blog/python_craw_lianjia/

    你可能感兴趣的:(software,diy,python)