python实现pyquery网络爬虫

PyQuery是强大而又灵活的网页解析库,正则写起来太麻烦,BeautifulSoup语法太难记,如果你熟悉jQuery的语法那么,PyQuery就是你绝佳的选择。

代码实现

'''
部分源码
<tr class="mBg2"onmouseover="jQuery(this).addClass('bgon');
"οnmοuseοut="jQuery(this).removeClass('bgon');">
	<td class="wd1">1</td>
	<td class="wd2 bOS">
		<a target="_blank" href="http://xf.house.163.com/gz/0SJN.html">南沙心意华庭</a></td>
	<td class="wd3"><a href="#" onclick="gotrend('南沙心意华庭')"></a></td>
	<td class="wd4">南沙</td>
	<td class="wc5 bgOnS">4</td>
	<td class="wc6">425</td>
	<td class="wc7">--</td>
	<td class="wc8">--</td>
	<td class="wc9">658</td>
	<td class="wc10">54202</td>
	<td class="wc11">18</td>
	<td class="wc12">1944</td>
'''
from pyquery import PyQuery as pq
while True:
    url ='http://data.house.163.com/'
    for j in range(3):  # 自定义重新运行次数,最多重复运行3try:
            doc = pq(url=url, encoding='utf-8')  # 获取html,若运行时长超出,则重新运行
            break
        except:
            continue
    if doc('.pager_a.next-page') == []:   # 判断下一页是否存在
        break
    else:
        name = [i.text() for i in doc('.wd2.bOS a').items()]   # 寻找class节点
        place = [i.text() for i in doc('.wd4').items()]
        number1 = [i.text() for i in doc('.wc5.bgOnS').items()][1:]
        area1 = [i.text() for i in doc('.wc6').items()][1:]
        number2 = [i.text() for i in doc('.wc9').items()][1:]
        area2 = [i.text() for i in doc('.wc10').items()][1:]
        number3 = [i.text() for i in doc('.wc11').items()][1:]
        area3 = [i.text() for i in doc('.wc12').items()][1:]
        quhua = [i.text() for i in doc('.wd14').items()]

pyquery和BeautifulSoup的使用方法差不多,pyquery也可以结合requests使用

你可能感兴趣的:(网络爬虫)