Python2.7.15
今天我们来爬取安居客经纪人的信息。这次我们不再使用正则,我们使用beautifulsoup。不了解的可以先看一下这个文档,便于理解。https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
for page in range(1,8):
url ="https://beijing.anjuke.com/tycoon/p" + str(page)+"/"
response = urllib2.urlopen(url)
content = response.read()
老套路urllib2
首先看源码,找到经纪人信息对应的标签,然后使用beautifulsoup方法,这里的html.parser是对应的解析器
soup = BeautifulSoup(content,'html.parser')
a = soup.find_all('h3')
b = soup.find_all(class_=re.compile("brokercard-sd-cont clearfix"))
c = soup.find_all("p", attrs={"class": "jjr-desc"})
d = soup.find_all("p", attrs={"class": "jjr-desc xq_tag"})
e = soup.find_all(class_=re.compile("broker-tags clearfix"))
a,b,c,d,e分别对应经纪人姓名,评价,门店,熟悉,业务
每一项都是列表
将它们循环输出
n = 0
for jjr in a:
o = jjr.get_text(strip=True).encode('utf-8')
p = b[n].get_text(strip=True).encode('utf-8')
q = c[2*n].get_text(strip=True).encode('utf-8')
r = d[n].get_text(strip=True).encode('utf-8')
s = e[n].get_text(strip=True).encode('utf-8')
n+=1
这里要注意编码问题,使用beautifulsoup解析后的文档是Unicode编码,直接输出会乱码,而且这个编码模式也无法写入文档或数据库,所以后面要加上encode(‘utf-8’)来重新编码
insert_agent = ("INSERT INTO AGENT(姓名,评价,门店,熟悉,业务)" "VALUES(%s,%s,%s,%s,%s)")
data_agent = (o,p,q,r,s)
cursor.execute(insert_agent, data_agent)
记得先建立数据库连接,和要写入的表
# coding=utf-8
from bs4 import BeautifulSoup
import urllib2
import re
import MySQLdb
conn=MySQLdb.connect(host="127.0.0.1",user="root",passwd="199855pz",db="pz",charset='utf8')
print '连接成功'
cursor = conn.cursor()
cursor.execute("DROP TABLE IF EXISTS AGENT")
sql = '''CREATE TABLE AGENT(姓名 char(4) ,评价 char(50) ,门店 char(50) ,熟悉 char(50) ,业务 char(50))'''
cursor.execute(sql)
for page in range(1,8):
url ="https://beijing.anjuke.com/tycoon/p" + str(page)+"/"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content,'html.parser')
a = soup.find_all('h3')
b = soup.find_all(class_=re.compile("brokercard-sd-cont clearfix"))
c = soup.find_all("p", attrs={"class": "jjr-desc"})
d = soup.find_all("p", attrs={"class": "jjr-desc xq_tag"})
e = soup.find_all(class_=re.compile("broker-tags clearfix"))
n = 0
for jjr in a:
o = jjr.get_text(strip=True).encode('utf-8')
p = b[n].get_text(strip=True).encode('utf-8')
q = c[2*n].get_text(strip=True).encode('utf-8')
r = d[n].get_text(strip=True).encode('utf-8')
s = e[n].get_text(strip=True).encode('utf-8')
n+=1
insert_agent = ("INSERT INTO AGENT(姓名,评价,门店,熟悉,业务)" "VALUES(%s,%s,%s,%s,%s)")
data_agent = (o,p,q,r,s)
cursor.execute(insert_agent, data_agent)
conn.commit()
PS.安居客更新了,源码有一些变动,但爬取信息还是老方法。