因需要爬取江浙沪皖地级市之间的城市距离,爬取的网址是http://www.china6636.com/。爬取代码如下:
###江浙沪皖41城市代号
x=[27005208,27017237,27029767,27035786,27036716,27071629,27044783,27115330,27045424,27060216,
27060379,27059992,27065633,27085865,27074128,27017808,27071103,27045640,27003122,27011786,
27017472,27006461,27059466,27034352,27053195,27059466,27049842,27071035,27001264,27019067,
27016684,27034795,27053712,27023458,27141500,27040603,27021505,27044186,27061126,27125941,
27028433]
##x对应城市名
x1=['常州','淮安','连云港','南京','南通','苏州','泰州','无锡','宿迁','徐州',
'盐城','扬州','镇江','上海','合肥','淮北','亳州','宿州','蚌埠','阜阳',
'淮南','滁州','六安','马鞍山','芜湖','宣城','铜陵','池州','安庆','黄山',
'杭州','宁波','温州','嘉兴','湖州','绍兴','金华','衢州','舟山','台州',
'丽水']
####求两城市之间距离,想输出数据框形式,怎么试都不会………………!
from urllib import request,error
import re
datas=[]
x=[27005208,27017237,27029767]
y=[27005208,27017237,27029767]
for i in x:
for j in y:
url="http://www.china6636.com/distance/"+str(i)+"-"+str(j)
data=request.urlopen(url).read().decode("utf-8",'ignore')
pat='
(.*?)(.*?)(.*?)(.*?)(.*?)或(.*?)(.*?)。' data_new=re.compile(pat).findall(data) datas.append(data_new) #####将所有循环数据保存,这里我不会将列表中的元祖去除 datas
代码爬取的结果:
因为我只需要城市之间的距离,因此只输出城市代号和城市距离:
###由于我的城市顺序已有,只想要距离数据
from urllib import request,error
import re
x=[27005208,27017237,27029767,27035786,27036716,27071629,27044783,27115330,27045424,27060216,
27060379,27059992,27065633,27085865,27074128,27017808,27071103,27045640,27003122,27011786,
27017472,27006461,27059466,27034352,27053195,27059466,27049842,27071035,27001264,27019067,
27016684,27034795,27053712,27023458,27141500,27040603,27021505,27044186,27061126,27125941,
27028433]
y=[27005208,27017237,27029767,27035786,27036716,27071629,27044783,27115330,27045424,27060216,
27060379,27059992,27065633,27085865,27074128,27017808,27071103,27045640,27003122,27011786,
27017472,27006461,27059466,27034352,27053195,27059466,27049842,27071035,27001264,27019067,
27016684,27034795,27053712,27023458,27141500,27040603,27021505,27044186,27061126,27125941,
27028433]
for i in x:
for j in y:
url="http://www.china6636.com/distance/"+str(i)+"-"+str(j)
data=request.urlopen(url).read().decode("utf-8",'ignore')
pat='
(.*?)(.*?)(.*?)(.*?)(.*?)或(.*?)(.*?)。' data_new=re.compile(pat).findall(data) print(data_new[0][1]+","+str(i)+"-"+str(j))
爬取的部分结果截图:
双循环爬取速度太慢,如果需要直线距离不多的小伙伴可以试试下面的单循环:
###双循环太慢,改成单循环
from urllib import request,error
import re
x=27017237
y=[27005208,27017237,27029767,27035786,27036716,27071629,27044783,27115330,27045424,27060216,
27060379,27059992,27065633,27085865,27074128,27017808,27071103,27045640,27003122,27011786,
27017472,27006461,27059466,27034352,27053195,27059466,27049842,27071035,27001264,27019067,
27016684,27034795,27053712,27023458,27141500,27040603,27021505,27044186,27061126,27125941,
27028433]
for j in y:
url="http://www.china6636.com/distance/"+str(x)+"-"+str(j)
data=request.urlopen(url).read().decode("utf-8",'ignore')
pat='