初识scrapy

scrapy由下面几个部分组成

spiders：爬虫模块，负责配置需要爬取的数据和爬取规则，以及解析结构化数据

items：定义我们需要的结构化数据，使用相当于dict

pipelines：管道模块，处理spider模块分析好的结构化数据，如保存入库等

middlewares：中间件，相当于钩子，可以对爬取前后做预处理，如修改请求header，url过滤等

参考：http://python.gotrained.com/scrapy-tutorial-web-scraping-craigslist/

https://doc.scrapy.org/en/latest/

本篇文档只写了常见的spiders例子，其余部分(items、pipelines、settings等)请参考后期blog

例1 在同一个页面上抓取内容 (抓取七月在线精品课程的名称、课程信息、开课时间)：

 
            import  
            scrapy 
           
            class  
            julyClassSpider(scrapy.Spider): 
           
            name 
            = 
            'julyclass' 
           
            start_urls 
            = 
            [ 
            'https://www.julyedu.com/category/index' 
            ] 
           
            def  
            parse( 
            self 
            ,response): 
           
            for  
            classinfo  
            in  
            response.xpath( 
            '//div[@class="item"]/div/div' 
            ): 
           
            classname 
            = 
            classinfo.xpath( 
            'a[1]/h4/text()' 
            ).extract_first() 
           
            classdate 
            = 
            classinfo.xpath( 
            'a[1]/p[2]/text()' 
            ).extract_first() 
           
            imageaddr 
            = 
            response.url 
            + 
            classinfo.xpath( 
            'a[1]/img[1]/@src' 
            ).extract_first() 
           
            #print("classname:%s; classdate:%s; imageaddr: %s " %(classname,classdate,imageaddr)) 
           
            yield  
            { 
            "classname" 
            :classname, 
            "classdate" 
            :classdate, 
            "imageaddr" 
            :imageaddr}

例2 在连续页面上抓取内容(抓取博客园前10页的精华贴)：

 
       
        
          
          
            import  
            scrapy 
           
 
            import  
            re 
           
 
            class  
            cnblogsSpider(scrapy.Spider): 
           
 
                 
            name 
            = 
            "cnblogs" 
           
 
                 
            start_urls 
            = 
            [ 
            'https://www.cnblogs.com/pick/' 
            + 
            str 
            (n) 
            + 
            '/'  
            for  
            n  
            in  
            range 
            ( 
            1 
            , 
            10 
            )] 
           
 
                 
            def  
            parse( 
            self 
            ,response): 
           
 
                     
            for  
            post  
            in  
            response.xpath( 
            '//div[@class="post_item_body"]' 
            ): 
           
 
                         
            title 
            = 
            post.xpath( 
            'h3/a/text()' 
            ).extract_first() 
           
 
                         
            href 
            = 
            post.xpath( 
            'h3/a/@href' 
            ).extract_first() 
           
 
                         
            pubdate 
            = 
            post.xpath( 
            'div[@class="post_item_foot"]/text()' 
            )[ 
            1 
            ].extract().strip() 
           
 
                         
            pubdate 
            = 
            re.split( 
            ' ' 
            ,pubdate)[ 
            1 
            ] 
            + 
            ' ' 
            + 
            re.split( 
            ' ' 
            ,pubdate)[ 
            2 
            ] 
           
 
                         
            comments 
            = 
            post.xpath( 
            'div[@class="post_item_foot"]/span[1]/a/text()' 
            ).extract_first() 
           
 
                         
            comments 
            = 
            re.split( 
            '\(|\)' 
            ,comments)[ 
            1 
            ] 
           
 
                         
            reads 
            = 
            post.xpath( 
            'div[@class="post_item_foot"]/span[2]/a/text()' 
            ).extract_first() 
           
 
                         
            reads 
            = 
            re.split( 
            '\(|\)' 
            ,reads)[ 
            1 
            ] 
           
 
                         
            #print(title,href,pubdate,comments,reads) 
           
 
                         
            yield  
            { 
            'title' 
            :title, 
            'url' 
            :href, 
            'pubdate' 
            :pubdate, 
            'comments' 
            :comments, 
            'reads' 
            :reads} 
           
 
        
 
       
     

运行：scrapy runspider scrapy2.py

urls是通过for拼接而成的list

例3 通过指定按钮(Next)连续抓取多个页面内容：

 
       
        
          
          
            import  
            scrapy 
           
 
            import  
            re 
           
 
            class  
            cnblogsSpider(scrapy.Spider): 
           
 
                 
            name 
            = 
            "cnblogs" 
           
 
                 
            start_urls 
            = 
            [ 
            'https://www.cnblogs.com/pick/' 
            ] 
           
 
                 
            def  
            parse( 
            self 
            ,response): 
           
 
                     
            for  
            post  
            in  
            response.xpath( 
            '//div[@class="post_item_body"]' 
            ): 
           
 
                         
            title 
            = 
            post.xpath( 
            'h3/a/text()' 
            ).extract_first() 
           
 
                         
            href 
            = 
            post.xpath( 
            'h3/a/@href' 
            ).extract_first() 
           
 
                         
            pubdate 
            = 
            post.xpath( 
            'div[@class="post_item_foot"]/text()' 
            )[ 
            1 
            ].extract().strip() 
           
 
                         
            pubdate 
            = 
            re.split( 
            ' ' 
            ,pubdate)[ 
            1 
            ] 
            + 
            ' ' 
            + 
            re.split( 
            ' ' 
            ,pubdate)[ 
            2 
            ] 
           
 
                         
            comments 
            = 
            post.xpath( 
            'div[@class="post_item_foot"]/span[1]/a/text()' 
            ).extract_first() 
           
 
                         
            comments 
            = 
            re.split( 
            '\(|\)' 
            ,comments)[ 
            1 
            ] 
           
 
                         
            reads 
            = 
            post.xpath( 
            'div[@class="post_item_foot"]/span[2]/a/text()' 
            ).extract_first() 
           
 
                         
            reads 
            = 
            re.split( 
            '\(|\)' 
            ,reads)[ 
            1 
            ] 
           
 
                         
            #print(title,href,pubdate,comments,reads) 
           
 
                         
            yield  
            { 
            'title' 
            :title, 
            'url' 
            :href, 
            'pubdate' 
            :pubdate, 
            'comments' 
            :comments, 
            'reads' 
            :reads} 
           
 
                     
            #print("========="+response.url+"==========") 
           
 
                     
            url 
            = 
            response.xpath( 
            '//div[@class="pager"]/a[last()]/@href' 
            ).extract()[ 
            0 
            ] 
           
 
                     
            nexturl 
            = 
            response.urljoin(url) 
           
 
                     
            yield  
            scrapy.Request(nexturl,callback 
            = 
            self 
            .parse) 
           
 
        
 
       
     

通过“Next” 按钮获取下一页的url，然后分析.

 
            import  
            scrapy 
           
            import  
            re 
           
            class  
            humorSpider(scrapy.Spider): 
           
            name 
            = 
            'humor' 
           
            start_urls 
            = 
            [ 
            'http://quotes.toscrape.com/tag/humor/page/1/' 
            ] 
           
            def  
            parse( 
            self 
            ,response): 
           
            for  
            humor  
            in  
            response.xpath( 
            '//div[@class="quote"]' 
            ): 
           
            sentence 
            = 
            humor.xpath( 
            'span[1]/text()' 
            ).extract_first() 
           
            author 
            = 
            humor.xpath( 
            'span[2]/small/text()' 
            ).extract_first() 
           
            yield  
            { 
            'sentence' 
            :sentence, 
            'author' 
            :author} 
           
            next_url 
            = 
            response.xpath( 
            '//ul[@class="pager"]/li/a/@href' 
            ).extract_first() 
           
            pattern 
            = 
            re. 
            compile 
            (r 
            '/' 
            ) 
           
            if  
            next_url  
            is  
            not  
            None  
            and  
            pattern.split(next_url)[ 
            - 
            2 
            ]>pattern.split(response.url)[ 
            - 
            2 
            ]: 
           
            next_url 
            = 
            response.urljoin(next_url) 
           
            #print(next_url) 
           
            yield  
            scrapy.Request(next_url,callback 
            = 
            self 
            .parse)

例4 通过多个函数分析不同页面

scrapy startproject qqnews

 
            tree 
           
            . 
           
            |____qqnews 
           
            | |______init__.py 
           
            | |______pycache__ 
           
            | |____items.py 
           
            | |____middlewares.py 
           
            | |____pipelines.py 
           
            | |____settings.py 
           
            | |____spiders 
           
            | | |______init__.py 
           
            | | |______pycache__ 
           
            | | |____qqnews.py 
           
            |____scrapy.cfg

cd qqnews/spiders/

cat qqnews.py

 
            import  
            scrapy 
           
            class  
            qqNewsSpider(scrapy.Spider): 
           
            name  
            =  
            'qqnews' 
           
            start_urls  
            =  
            [ 
            'http://news.qq.com/' 
            ] 
           
            def  
            parse( 
            self 
            ,response): 
           
            for  
            url  
            in  
            response.xpath( 
            '//div[@class="text"]/em/a/@href' 
            ).extract(): 
           
            yield  
            scrapy.Request(url,callback 
            = 
            self 
            .parse_news) 
           
            def  
            parse_news( 
            self 
            ,response): 
           
            try 
            : 
           
            title 
            = 
            response.xpath( 
            '//div[@class="hd"]/h1/text()' 
            ).extract()[ 
            0 
            ] 
           
            type 
            = 
            response.xpath( 
            '//div[@class="a_Info"]/span[1]/a/text()' 
            ).extract()[ 
            0 
            ] 
           
            source 
            = 
            response.xpath( 
            '//div[@class="a_Info"]/span[2]/a/text()' 
            ).extract()[ 
            0 
            ] 
           
            time 
            = 
            response.xpath( 
            '//span[@class="a_time"]/text()' 
            ).extract()[ 
            0 
            ] 
           
            print 
            (title, 
            type 
            ,source,time) 
           
            except 
            : 
           
            print 
            ( 
            "exception" 
            )

运行：scrapy crawl qqnews -o news.csv

本文转自 meteor_hy 51CTO博客，原文链接：http://blog.51cto.com/caiyuanji/1982130，如需转载请自行联系原作者

初识scrapy

你可能感兴趣的:(初识scrapy)