《战狼2》评论抓取

学了python,抓取《战狼2》的评论,熟悉一下。把整个过程写到一个类中,看起来整洁些。

首先,罗列出评论页的地址。

    def Url(self): 
        url_duan = "https://movie.douban.com/subject/26363254/reviews?start="
        start_urls = [url_duan + str((i - 1) * 20) for i in range(1, 342)]
        for url in start_urls:
            self.spider(url)            

注意:总的评论汇总页有342页如图:

Paste_Image.png

观察得到每个评论数字减一再乘以20即可拼接出所有的评论页。

找出每个评论的地址。

import time
import urllib.request
from bs4 import BeautifulSoup
    def spider(self,url):
        time.sleep(2)
        proxy_support = urllib.request.ProxyHandler({'sock5':random.choice(self.iplist)})
        opener = urllib.request.build_opener(proxy_support)  
        html = opener.open(url, data= self.data())    #注意此处的写法,不然代理可能不成功。至于怎么验证代理是否成功,我还没想出来。
        htm = html.read().decode("utf-8")
        soup = BeautifulSoup(htm,"lxml")
        reviewer_urls = soup.find_all('a', 'title-link')
        reviewer_url =[]
        for i in range(len(reviewer_urls)):
            reviewer_url.append(reviewer_urls[i]['href'])
        #print(reviewer_url)
        for i in range(len(reviewer_url)):
            #print(reviewer_url[i])
            self.spider_1(reviewer_url[i])

通过每个汇总页,提取每个评论的链接。保存到reviewer_url中。为了能够成功爬去网页的数据,加了代理。

import random
import urllib.parse
    def data(self):
        ua = random.choice(self.user_agent_list)
        data = {
            "Host": "movie.douban.com",
            "User-Agent": ua,
            }
        data = urllib.parse.urlencode(data).encode('utf-8')
        return data

模拟浏览器。

    def __init__(self):
        self.user_agent_list=["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
                            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
                            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
                            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
                            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
                            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
                            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
                            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
                            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
                            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
                            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
                            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]   
    
            iplist =[]  ##自己找一些代理ip就可以了,格式为ip:port,例如:127.0.0.1:30.

现在开始真正抓取每个评论的内容,没要图片内容。

import string
    def spider_1(self,url):
        proxy_support = urllib.request.ProxyHandler({'sock5':random.choice(self.iplist)})
        opener = urllib.request.build_opener(proxy_support)
        f = opener.open(url,data= self.data(),timeout=10)
        soup = BeautifulSoup(f.read().decode('utf-8'), 'lxml')
        #得到评论人
        reviewer = soup.find_all('span', property='v:reviewer')[0].get_text()
        #if not os.path.exists(reviewer+'.txt'):
            #exit()
        string_1 =string.punctuation
        for str_1 in string_1:
            reviewer= reviewer.replace(str_1,'')
        content = []
        ##方法一:之前只知道用find_all,不知道find,所以用之前的方法,现在利用find和find_all的组合试试,但是必须将结果修改为字符串才行。
        content_1 = soup.find('div', property='v:description').find_all('p')
        for i in range(len(content_1)):
            content.append(content_1[i].string)
        with open(reviewer+'.txt', 'w+', encoding= 'utf-8') as f:
            for i in range(len(content)):
                try:
                    f.write(content[i]+'\n')
                except TypeError as e:
                    f.write('\n')
        #os.mknod('/ping/'+reviewer+'.txt')

中文文档汇总

完整代码:https://github.com/Hadesghost/zhanlang2/blob/master/zhanlang.py   
re库的文档:http://python.usyiyi.cn/translate/python_352/library/re.html    
time库的文档:http://python.usyiyi.cn/translate/python_352/library/time.html    
urllib库的文档:http://python.usyiyi.cn/translate/python_352/library/urllib.html   
bs4库的文档:http://beautifulsoup.readthedocs.io/zh_CN/latest/    
random库的文档:http://python.usyiyi.cn/translate/python_352/library/random.html    
string的用法:http://python.usyiyi.cn/translate/python_352/library/string.html    
os的用法:http://python.usyiyi.cn/translate/python_352/library/os.path.html    

其它的中文文档可以从以下地方查找:

https://readthedocs.org/    
http://python.usyiyi.cn/translate/python_352/library/index.html#library-index       
http://docs.pythontab.com/

你可能感兴趣的:(《战狼2》评论抓取)