上篇说到如何使用python通过提取网页元素抓取网站数据并导出到excel中,今天就来说说如何通过获取json爬取数据并且保存到mysql数据库中。
本文主要涉及到三个知识点:
1.通过抓包工具获取网站接口api
2.通过python解析json数据
3.通过python与数据库进行连接,并将数据写入数据库。
抓包不是本文想说的主要内容,大家可以移步这里或者直接在百度搜索“fiddler手机抓包”去了解抓包的相关内容,对了,这篇简书中也公布了一些网站的接口,大家也可以直接去那儿获取。
ok,那直接切入正题,首先看看python是如何拿到json并且解析json的:
获取json数据:
def getHtmlData(url):
# 请求
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'}
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
data = response.read()
# 设置解码方式
data = data.decode('utf-8')
return data
解析json之前,我们先来看看我们得到的json是怎样的(数据较多,相同结构的数据隐藏了一些):
{
"id": 1,
"label": "头条",
"prev": "https://api.dongqiudi.com/app/tabs/android/1.json?before=1658116800",
"next": "https://api.dongqiudi.com/app/tabs/android/1.json?after=1500443152&page=2",
"max": 1658116800,
"min": 1500443152,
"page": 1,
"articles": [
{
"id": 375248,
"title": "还记得他们吗?那些年,我们也有自己的留洋军团",
"share_title": "还记得他们吗?那些年,我们也有自己的留洋军团",
"description": "",
"comments_total": 1026,
"share": "https://www.dongqiudi.com/article/375248",
"thumb": "http://img1.dongqiudi.com/fastdfs1/M00/97/55/180x135/crop/-/pIYBAFlkjm-AMc7AAAL4n-oihZs769.jpg",
"top": true,
"top_color": "#4782c4",
"url": "https://api.dongqiudi.com/article/375248.html?from=tab_1",
"url1": "https://api.dongqiudi.com/article/375248.html?from=tab_1",
"scheme": "dongqiudi:///news/375248",
"is_video": false,
"new_video_detail": null,
"collection_type": null,
"add_to_tab": "0",
"show_comments": true,
"published_at": "2022-07-18 12:00:00",
"sort_timestamp": 1658116800,
"channel": "article",
"label": "深度",
"label_color": "#4782c4"
},
{
"id": 382644,
"title": "连续三年英超主场负于水晶宫,今晚克洛普的扑克牌怎么打呢?",
"share_title": "连续三年英超主场负于水晶宫,今晚克洛普的扑克牌怎么打呢?",
"comments_total": 0,
"share": "https://www.dongqiudi.com/article/382644",
"thumb": "",
"top": false,
"top_color": "",
"url": "https://api.dongqiudi.com/article/382644.html?from=tab_1",
"url1": "https://api.dongqiudi.com/article/382644.html?from=tab_1",
"scheme": null,
"is_video": true,
"new_video_detail": "1",
"collection_type": null,
"add_to_tab": null,
"show_comments": true,
"published_at": "2017-07-19 14:55:25",
"sort_timestamp": 1500447325,
"channel": "video"
},
{
"id": 382599,
"title": "梦想不会褪色!慈善机构圆孟买贫民区女孩儿的足球梦",
"share_title": "梦想不会褪色!慈善机构圆孟买贫民区女孩儿的足球梦",
"comments_total": 9,
"share": "https://www.dongqiudi.com/article/382599",
"thumb": "http://img1.dongqiudi.com/fastdfs1/M00/9C/D3/180x135/crop/-/o4YBAFlu8F2AcFtwAACX_DJbrwo612.jpg",
"top": false,
"top_color": "",
"url": "https://api.dongqiudi.com/article/382599.html?from=tab_1",
"url1": "https://api.dongqiudi.com/article/382599.html?from=tab_1",
"scheme": null,
"is_video": true,
"new_video_detail": "1",
"collection_type": null,
"add_to_tab": null,
"show_comments": true,
"published_at": "2017-07-19 14:45:20",
"sort_timestamp": 1500446720,
"channel": "video"
}
],
"hotwords": "JJ同学",
"ad": [],
"quora": [
{
"id": 182,
"type": "ask",
"title": "足坛历史上有哪些有名的更衣室故事?",
"ico": "",
"thumb": "http://img1.dongqiudi.com/fastdfs1/M00/9B/BE/pIYBAFlt3uyACqEnAADhb9FVavU28.jpeg",
"answer_total": 222,
"scheme": "dongqiudi:///ask/182",
"position": 7,
"sort_timestamp": 1500533674,
"published_at": "2017-07-20 14:54:34"
}
]
}
先导入解析json的包:
imprt json
然后解析:
dataList = json.loads(data)['articles']
你没看错,就这一步便取出了articles这个json数组;
接下来取出articles中的对象并添加到python的list中,留待后面添加到数据库中使用:
for index in range(len(dataList)):
newsObj = dataList[index]
#print(newsObj.get('title'))
newsObjs = [newsObj.get('id'), newsObj.get('title'), newsObj.get('share_title'), newsObj.get('description'),
newsObj.get('comments_total'), newsObj.get('share'), newsObj.get('thumb'), newsObj.get('top'),
newsObj.get('top_color'), newsObj.get('url'), newsObj.get('url1'), newsObj.get('scheme'),
newsObj.get('is_video'), newsObj.get('new_video_detail'), newsObj.get('collection_type'),
newsObj.get('add_to_tab'), newsObj.get('show_comments'), newsObj.get('published_at'),
newsObj.get('channel'), str(first_label), newsObj.get('comments_total')]
解析json的工作到这就完成了,接下来就是连接数据库了:
#执行sql语句
def executeSql(sql,values):
conn = pymysql.connect(host=str(etAddress.get()), port=int(etPort.get()), user=str(etName.get()),
passwd=str(etPassWd.get()), db=str(etDBName.get()))
cursor = conn.cursor()
conn.set_charset('utf8')
effect_row = cursor.execute(sql, values)
# 提交,不然无法保存新建或者修改的数据
conn.commit()
# 关闭游标
cursor.close()
# 关闭连接
conn.close()
是不是觉得很眼熟,的确python连接数据库和java等类似,也是建立连接,输入mysql的地址,端口号,数据库的用户名,密码然后通过cursor返回操作结果,当然最后要把连接,cursor都关掉。(python连接数据库需要导入pymysql的包,直接通过pip安装,然后import即可)sql语句的写法也和java等类似,整个过程是这样的:
#插入新闻
def insertNews(data):
if len(data) > 2:
dataList = json.loads(data)['articles']
first_label = json.loads(data)['label']
for index in range(len(dataList)):
newsObj = dataList[index]
#print(newsObj.get('title'))
newsObjs = [newsObj.get('id'), newsObj.get('title'), newsObj.get('share_title'), newsObj.get('description'),
newsObj.get('comments_total'), newsObj.get('share'), newsObj.get('thumb'), newsObj.get('top'),
newsObj.get('top_color'), newsObj.get('url'), newsObj.get('url1'), newsObj.get('scheme'),
newsObj.get('is_video'), newsObj.get('new_video_detail'), newsObj.get('collection_type'),
newsObj.get('add_to_tab'), newsObj.get('show_comments'), newsObj.get('published_at'),
newsObj.get('channel'), str(first_label), newsObj.get('comments_total')]
sql = "insert into news(id,title,share_title,description,comments_total," \
"share,thumb,top,top_color,url,url1,scheme,is_video,new_video_detail," \
"collection_type,add_to_tab,show_comments,published_at,channel,label)" \
"values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) " \
"ON DUPLICATE KEY UPDATE comments_total = %s"
executeSql(sql=sql,values=newsObjs)
#执行sql语句
def executeSql(sql,values):
print(str(etPassWd.get()))
conn = pymysql.connect(host=str(etAddress.get()), port=int(etPort.get()), user=str(etName.get()),
passwd=str(etPassWd.get()), db=str(etDBName.get()))
cursor = conn.cursor()
conn.set_charset('utf8')
effect_row = cursor.execute(sql, values)
# 提交,不然无法保存新建或者修改的数据
conn.commit()
# 关闭游标
cursor.close()
# 关闭连接
conn.close()
最后在main里面:
data = getHtmlData(url)
insertNews(data=data)