Python爬虫福利第一弹---爬取校花校草吧图片

1. 采集网站:百度贴吧

    采集内容:图片

比如:下图中的校花吧的图片

              :Python爬虫福利第一弹---爬取校花校草吧图片_第1张图片

2.采集效果:

                      Python爬虫福利第一弹---爬取校花校草吧图片_第2张图片

                               

                      

                                                              Python爬虫福利第一弹---爬取校花校草吧图片_第3张图片

3.代码及思路:

    3.1 每个贴吧对应的链接:

                  

    构造路由

self.url = "https://tieba.baidu.com/f?kw="
kw = input("请输入要爬取的贴吧:")
urlTieBa = self.url + kw

    3.2 贴吧翻页规则:

        第一页,第二页,第三页:

             

          

         

    翻页总结:从0开始,没50个铁(帖)子为一页,

    构造翻页路由:       

for i in range(1, 1000): #爬取1000页
    urlTieBa = urlTieBa + "&ie=utf-8&pn=" + str((i - 1) * 50)

    3.3 xpath抓取每个帖子的链接:

   Python爬虫福利第一弹---爬取校花校草吧图片_第4张图片

    最后每个帖子的链接:

                                

   所以,帖子链接拼接:

urlTieZiList = parseHtml.xpath('//div[@class="threadlist_title pull_left j_th_tit "]/a[@rel="noreferrer"]/@href')
for i in urlTieZiList:
    j = 1
    urlTieZi = 'http://tieba.baidu.com'
    urlTieZi += i

     3.4 每个图片的链接:Python爬虫福利第一弹---爬取校花校草吧图片_第5张图片

    也是可以直接xpath 定位的:

imageurl = parseHtml.xpath('//div[@class = "d_post_content j_d_post_content  clearfix"]/img[@class="BDE_Image"]/@src')

     3.5 下载图片:

            for ima in imageurl:
                response = requests.get(ima, headers=self.headers)
                filename = ima[-12:]
                response.encoding = "utf-8"
                image = response.content
                j += 1
                with open("./images/校花/"+filename, "wb")as f:
                    f.write(image)
                    print("%s" % filename, "下载成功")

4.总体代码:

# -*- coding: UTF-8 -*-
'''
@Author :Jason
爬取输入的指定贴吧
注意在该目录下创建images\指定贴吧 的 文件夹

'''
import requests
from lxml import etree

class getTiebaPic():
    def __init__(self):
        self.url = "https://tieba.baidu.com/f?kw="
        self.headers = {"User-Agent": "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"}


    #获取贴吧每一页中的帖子的链接
    def get_urlTieBa(self):
        kw = input("请输入要爬取的贴吧:")
        page = int(input("请输入要爬取的页数:"))
        urlTieBa = self.url + kw
        for i in range(1,page): #爬取1000页
            urlTieBa = urlTieBa + "&ie=utf-8&pn=" + str((i - 1) * 50)
            print(urlTieBa)
            response = requests.get(urlTieBa, headers=self.headers)
            response.encoding = "utf-8"
            html = response.text
            urlTieZiList = self.get_urlTieZi(html)
            self.get_imageUrl(urlTieZiList)

    def get_urlTieZi(self, html):
        parseHtml = etree.HTML(html)
        urlTieZiList = parseHtml.xpath('//div[@class="threadlist_title pull_left j_th_tit "]/a[@rel="noreferrer"]/@href')
        return urlTieZiList

    def get_imageUrl(self, urlTieZiList):
        for i in urlTieZiList:
            j = 1
            urlTieZi = 'http://tieba.baidu.com'
            urlTieZi += i
            response = requests.get(urlTieZi, headers=self.headers)
            response.encoding = "utf-8"
            html = response.text

            parseHtml = etree.HTML(html)
            imageurl = parseHtml.xpath('//div[@class = "d_post_content j_d_post_content  clearfix"]/img[@class="BDE_Image"]/@src')
            for ima in imageurl:
                response = requests.get(ima, headers=self.headers)
                filename = ima[-12:]
                response.encoding = "utf-8"
                image = response.content
                j += 1
                with open("./images/校花/"+filename, "wb")as f:
                    f.write(image)
                    print("%s" % filename, "下载成功")

    def main(self):
        self.get_urlTieBa()

if __name__ == "__main__":
    html = getTiebaPic()
    html.main()
    print("爬取结束")

 

 

你可能感兴趣的:(爬虫Web)