python3 爬取汽车之家所有车型操作步骤

2019独角兽企业重金招聘Python工程师标准>>>

题记:

　　互联网上关于使用python3去爬取汽车之家的汽车数据（主要是汽车基本参数，配置参数，颜色参数，内饰参数）的教程已经非常多了，但大体的方案分两种：

　　1.解析出汽车之家某个车型的网页，然后正则表达式匹配出混淆后的数据对象与混淆后的js，并对混淆后的js使用pyv8进行解析返回正常字符，然后通过字符与数据对象进行匹配，具体方法见这位园友，传送门：https://www.cnblogs.com/my8100/p/js_qichezhijia.html （感谢这位大神前半部分的思路）

2.解析出汽车之家某个车型的网页，然后正则表达式匹配出混淆后的数据对象与混淆后的js，针对混淆后的js进行进行手动匹配，因为混淆的js大概分为8大类（无参数返回常量，无参数返回函数，参数等于返回值函数，无参数返回常量，无参数返回常量中间无混淆代码，字符串拼接时使无参常量，字符串拼接时使用返回参数的函数），然后通过正则表达式进行解析出8类内容并进行逐个替换，最终也会返回一个带有顺序的字符串，将这个字符串与前边的数据对象再次替换，最终数据对象中的所有span都会被替换成中文，具体操作见园友的地址，传送门:https://www.cnblogs.com/dyfblog/p/6753251.html （感谢这位大神前半部分的思路）

不过鉴于作者技术有限，上述的两种方案，我都没有完整的执行完成，哪怕花了一周的时间也没有，但是没有办法，谁让我是一个很爱钻牛角尖的人呢，下一步提出我自己琢磨出来的方案，流程上稍微有点复杂，但是稳打稳扎，还是可以爬出来的，好了话不多说了，贴出步骤；

1.获取所有车型的网页，保存到本地：

复制代码
1 import bs4
2 import requests as req
3 '''
4 第一步，下载出所有车型的网页。
5 '''
6 def mainMethod():
7 '''
8 解析汽车之家所有车型数据保存到D盘
9 '''
10 li = [chr(i) for i in range(ord("A"),ord("Z")+1)]
11 firstSite="https://www.autohome.com.cn/grade/carhtml/"
12 firstSiteSurfixe=".html"
13 secondSite = "https://car.autohome.com.cn/config/series/"
14 secondSiteSurfixe = ".html"
15
16 for a in li:
17 if a is not None:
18 requestUrl = firstSite+a+firstSiteSurfixe
19 print(requestUrl)
20 #开始获取每个品牌的车型
21 resp = req.get(requestUrl)
22 # print(str(resp.content,"gbk"))
23 bs = bs4.BeautifulSoup(str(resp.content,"gbk"),"html.parser")
24 bss = bs.find_all("li")
25 con = 0
26 for b in bss:
27 d = b.h4
28 if d is not None:
29 her = str(d.a.attrs['href'])
30 her = her.split("#")[0]
31 her = her[her.index(".cn")+3:].replace("/",'')
32 if her is not None:
33 secSite = secondSite +her + secondSiteSurfixe
34 print("secSite="+secSite)
35 # print(secSite)
36 #奥迪A3
37 if her is not None:
38 resp = req.get(secSite)
39 text = str(resp.content,encoding="utf-8")
40 print(a)
41 fil = open("d:\\autoHome\\html\\"+str(her),"a",encoding="utf-8")
42 fil.write(text)
43 con = (con+1)
44 else:
45 print(con)
46 if __name__ =="__main__":
47 mainMethod()
复制代码
2.解析出每个车型的关键js并拼装成一个html,保存到本地。

复制代码
1 import os
2 import re
3 '''
4 第二步，解析出每个车型的关键js拼装成一个html
5 '''
6 if __name__=="__main__":
7 print("Start...")
8 rootPath = "D:\\autoHome\\html\\"
9 files = os.listdir(rootPath)
10 for file in files:
11 print("fileName=="+file.title())
12 text = ""
13 for fi in open(rootPath+file,'r',encoding="utf-8"):
14 text = text+fi
15 else:
16 print("fileName=="+file.title())
17 #解析数据的json
18 alljs = ("var rules = '2';"
19 "var document = {};"
20 "function getRules(){return rules}"
21 "document.createElement = function() {"
22 " return {"
23 " sheet: {"
24 " insertRule: function(rule, i) {"
25 " if (rules.length == 0) {"
26 " rules = rule;"
27 " } else {"
28 " rules = rules + '#' + rules;"
29 " }"
30 " }"
31 " }"
32 " }"
33 "};"
34
37 "document.head = {};"
38 "document.head.appendChild = function() {};"
39
40 "var window = {};"
41 "window.decodeURIComponent = decodeURIComponent;")
42 try:
43 js = re.findall('(\(function\([a-zA-Z]{2}.*?_\).*?\(document\);)', text)
44 for item in js:
45 alljs = alljs + item
46 except Exception as e:
47 print('makejs function exception')
48
49
50 newHtml = "

python3 爬取汽车之家所有车型操作步骤

你可能感兴趣的:(python3 爬取汽车之家所有车型操作步骤)