前不久为了在群里斗图,想多搜集点表情包学习了一下python爬虫,搜集了一万多张吧。下载太多,完全不知道有什么图,还是斗不过!!!!!
今天又想爬取百度的搜索结果,本人还是小白,怕忘记记录一下,望大神赐教指正
同样是以爬取图片为例,还很简陋,没什么实用价值
手机百度搜索和PC的搜索爬取有些不一样,主要是html不一样
1、首先获取百度搜索页面的html代码,一定要记得设置User-Agent
# 获取指定地址的html的代码
def getHtml(url):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8'
}
req = urllib.request.Request(url, None, headers, None, False)
response = urllib.request.urlopen(req)
html = response.read()
return html
except AttributeError as e:
return None
2、得到html以后当然是遍历每条搜索结果,得到对应的站点地址集合
# 获取PC百度搜索的每条地址
def getPCItemUrl(html):
urls = []
try:
bsObj = BeautifulSoup(html)
bq = bsObj.find('div', {'id': 'content_left'}).findAll('h3', {'class': 't'})
for uu in bq:
bsO = BeautifulSoup(uu.encode('utf-8'))
urll = bsO.findAll('a')
urls.append(urll[0]['href'])
return urls
except AttributeError as e:
return []
3、得到地址数组以后只要在得到指定地址的html然后去获取里面的指定信息即可
# 开始遍历网站地址,得到图片
def getImage(urls):
if urls == None:
return
get_html = GetHtml.GetHtml
n = 0
for url in urls:
getImg = GetImg.GetImg
# 获取页面的html
one_html = get_html.getHtml(url)
# 得到对应地址里的图片地址集合
images = getImg.getImgs(one_html)
i = 0
for img in images:
src = img['src']
print(src)
endname = src[-4:]
if endname[-3:] in img_ends:
endname = endname
else:
endname = endname + '.jpg'
endname = endname.replace('?', '')
# str[-3:] # 截取倒数第三位到结尾
getImg.SaveImg(str(n) + str(i) + 'img' + endname, src)
i += 1
n += 1
这里只是我的一点思路,只能得到少量图片
4、`# 保存图片
def SaveImg(filename, url):
print(filename)
try:
response = urllib.request.urlopen(url)
cat_img = response.read()
with open(filename, ‘wb’) as f:
f.write(cat_img)
except urllib.error.HTTPError as reason:
print(reason)
# 获取图片地址(jpg|gif|png|bmp)
def getImgs(html):
try:
bsObj = BeautifulSoup(html)
bq = bsObj.findAll('img', {'src': re.compile('http[/:A-Za-z0-9\.]+\.(jpg|gif|png|bmp)')})
return bq
except AttributeError as e:
return None`
个人感觉爬取手机百度会更容易些