爬虫框架的三个基本组成:获取网页,寻找信息,收集信息。
## 分析网页获取音频资源的url
打开网页https://www.ximalaya.com/youshengshu/2684034/;
点击F12查看网页源代码。
依次点击查看音频资源的url。
https://aod.cos.tx.xmcdn.com/group80/M02/38/03/wKgPEV7WFIzQqI8UAHbux7JA91Y637.m4a
查看网页源代码发现并没有与之相匹配的,所以真正的url被隐藏了。
不断尝试下发现真正的资源文件在这个url下https://www.ximalaya.com/revision/play/v1/audio?id=2725352&ptype=1,以json字符串的形式存储。`# src=‘https://www.ximalaya.com/revision/play/v1/audio?id=’+str(id)+’&ptype=1’
# audiodic=getHTML(src)
# src = audiodic.json()['data']['src']`
以此代码提取url。
获取到url之后就可以下载对资源进行下载。
def getMp3(name,name2,url):
root = "E://python_resorce//" + name +"//"
# path = root + url.split('/')[-1]
path = root + name2+'.m4a'
print(root)
try:
print('kaishi1')
if not os.path.exists(root):
print('kaishi2')
os.mkdir(root)
if not os.path.exists(path):
print('kaishi3')
headers = {
'User-Agent':'Unnecessary semicolonMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'}
r=requests.get(url, headers = headers)
with open(path,'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except:
print("爬取失败")
因为大多数网站不许许爬虫所以对请求头部进行简单的伪装。
如此就可以下载音频了,但是如果要进行批量下载,我没还需要知道网页的源代码,和我们所得到的音频资源的关系。
https://www.ximalaya.com/revision/play/v1/audio?id=2725352&ptype=1
以上述方法打开多个类似的url可以发现只有id号发生了改变,但是并不是线性变化的。
但是在网页的源码中我们是可以找到id号的。
这样我们就可以对网页进行解析提取出我们需要的了
def get_M4a_Url(s):
soup = BeautifulSoup(s, "html.parser")
for link in soup.find_all(attrs={
'class':'text _Vc'}):
name1=link.a.get('title')
id1=link.a.get('href').split('/')[-1]
src='https://www.ximalaya.com/revision/play/v1/audio?id='+str(id1)+'&ptype=1'
audiodic=getHTML(src)
src1 = audiodic.json()['data']['src']
my_dict={
"name":name1,'id':id1,'src':src1}
list1.append(my_dict)
return list1
到这里结束,完整代码如下:
`from bs4 import BeautifulSoup
import re
import random
import requests
import os
def getHTML(url):
try:
headers = {
'User-Agent':'Unnecessary semicolonMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'}
r=requests.get(url, headers = headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r
except:
return "产生异常"
def getMp3(name,name2,url):
root = "E://python_resorce//" + name +"//"
# path = root + url.split('/')[-1]
path = root + name2+'.m4a'
print(root)
try:
print('kaishi1')
if not os.path.exists(root):
print('kaishi2')
os.mkdir(root)
if not os.path.exists(path):
print('kaishi3')
headers = {
'User-Agent':'Unnecessary semicolonMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'}
r=requests.get(url, headers = headers)
with open(path,'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except:
print("爬取失败")
list1 = []
# def get_M4a_Src(id)
# src='https://www.ximalaya.com/revision/play/v1/audio?id='+str(id)+'&ptype=1'
# audiodic=getHTML(src)
# src = audiodic.json()['data']['src']
# return src
def get_M4a_Url(s):
soup = BeautifulSoup(s, "html.parser")
for link in soup.find_all(attrs={
'class':'text _Vc'}):
name1=link.a.get('title')
id1=link.a.get('href').split('/')[-1]
src='https://www.ximalaya.com/revision/play/v1/audio?id='+str(id1)+'&ptype=1'
audiodic=getHTML(src)
src1 = audiodic.json()['data']['src']
my_dict={
"name":name1,'id':id1,'src':src1}
list1.append(my_dict)
return list1
for i in range(1,20):
src='https://www.ximalaya.com/guangbojv/30816438/p'+str(i)
s=getHTML(src).text
get_M4a_Url(s)
for dict2 in list1:
print(dict2)
getMp3('雪中悍刀行',dict2['name'],dict2['src'])