将浏览器模式从PC切换为手机
第一个URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=0&startTime=0
第二个URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=15&startTime=2018-10-11%2015%3A19%3A05
第三个URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=30&startTime=2018-10-11%2015%3A19%3A05
在offset
=0时,startTime
也为0,之后就是offest
每次增加15,startTime
也变为固定时间。
我们来看第一条评论(非热评)时间3分钟前,startTime
时间是2018-10-11 15:19:05
,我电脑的时间是15:22:04
所以,这个startTime
时间就是最新一条评论时间。
我们获取最新一条评论的时间,设为固定值,然后将offset
每次便宜量增加15就成功构造该请求了。
offset
为15。{
"approve": 3913,
"approved": false,
"assistAwardInfo": {
"avatar": "",
"celebrityId": 0,
"celebrityName": "",
"rank": 0,
"title": ""
},
"authInfo": "",
"avatarurl": "https://img.meituan.net/avatar/7e9e9348115c451276afffda986929b311657.jpg",
"cityName": "深圳",
"content": "脑洞很大,有创意,笑点十足又有泪点,十分感动,十分推荐。怀着看喜剧电影去看的,最后哭了个稀里哗。确实值得一看,很多场景让我回忆青春,片尾的旧照片更是让我想起了小时候。",
"filmView": false,
"gender": 1,
"id": 1035829945,
"isMajor": false,
"juryLevel": 0,
"majorType": 0,
"movieId": 1216446,
"nick": "lxz367738371",
"nickName": "发白的牛仔裤",
"oppose": 0,
"pro": false,
"reply": 94,
"score": 5,
"spoiler": 0,
"startTime": "2018-08-17 03:30:37",
"supportComment": true,
"supportLike": true,
"sureViewed": 0,
"tagList": {},
"time": "2018-08-17 03:30",
"userId": 1326662323,
"userLevel": 2,
"videoDuration": 0,
"vipInfo": "",
"vipType": 0
},
构造起始请求URL
class Movie1Spider(scrapy.Spider):
name = 'movie1'
allowed_domains = ['m.maoyan.com']
base_url = 'http://m.maoyan.com/mmdb/comments/movie/{}.json?_v_=yes&offset={}&startTime={}'
def start_requests(self):
time_now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
url = self.base_url.format(MOVIE_ID, 0, quote(time_now))
yield Request(url=url)
从JSON
数据中获取,参数信息。
在进行爬取的时候当固定一个时间并不能一直爬取,当一个固定时间只能爬取到offset=1005
,再往后面就没数据了,当爬取到第一条评论的时候后,再往前爬取,会得到电影上映时间的评论。
所以代码终止的条件就是,当评论中的时候大于URL中的请求时间。
def parse(self, response):
last_time = re.search(r'startTime=(.*)', response.url).group(1) # url中的时间
response = json.loads(response.text)
cmts = response.get('cmts')
for cmt in cmts:
global time
maoyan_item = MaoyanItem()
maoyan_item['id'] = cmt.get('id')
maoyan_item['nickname'] = cmt.get('nickName')
maoyan_item['gender'] = cmt.get('gender')
maoyan_item['cityname'] = cmt.get('cityName')
maoyan_item['content'] = cmt.get('content')
maoyan_item['score'] = cmt.get('score')
time = cmt.get('startTime')
maoyan_item['time'] = time
maoyan_item['userlevel'] = cmt.get('userLevel')
if quote(time) > last_time: # 当评论的时间大于url中的时间
break
yield maoyan_item
if quote(time) < last_time: # 最后一条评论小于url中的时间
url = self.base_url.format(MOVIE_ID, 15, quote(time)) # 使用评论最后一条的时间
yield Request(url=url, meta={'next_time': time})
class MaoyanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
table = 'movie'
id = Field() # ID
nickname = Field() # 名称
gender = Field() # 性别
cityname = Field() # 城市名称
content = Field() # 评论内容
score = Field() # 评分
time = Field() # 评论时间
userlevel = Field() # 评论者等级
保存信息到Mysql数据库
class MaoyanPipeline(object):
def __init__(self, host, databases, user, password, port):
self.host = host
self.databases = databases
self.user = user
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
databases=crawler.settings.get('MYSQL_DATABASES'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
port=crawler.settings.get('MYSQL_PORT'),
)
def open_spider(self, spider):
try:
self.db = pymysql.connect(self.host, self.user, self.password, self.databases, charset='utf8',
port=self.port)
self.db.ping()
except:
self.db = pymysql.connect(self.host, self.user, self.password, self.databases, charset='utf8',
port=self.port)
self.curosr = self.db.cursor()
def process_item(self, item, spider):
data = dict(item)
keys = ','.join(data.keys())
values = ','.join(['%s'] * len(data))
sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
self.curosr.execute(sql, tuple(data.values()))
self.db.commit()
return item
def close_spider(self, spider):
self.db.close()
适当降低爬取的延迟,以及添加Headers
,配置Mysql
的信息,开启piplines
。
BOT_NAME = 'maoyan'
SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'
DEFAULT_REQUEST_HEADERS = {
'Referer': 'http://m.maoyan.com/movie/1216446/comments?_v_=yes',
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) '
'Version/11.0 Mobile/15A372 Safari/604.1'
}
ITEM_PIPELINES = {
'maoyan.pipelines.MaoyanPipeline': 300,
}
MYSQL_HOST = ''
MYSQL_DATABASES = 'movie'
MYSQL_PORT =
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''
DOWNLOAD_DELAY = 0.1 # 每次下载请求的延迟
MOVIE_ID = '1216446' # 电影ID
由于评论过多,我们用分布式爬取的话会更快。
from scrapy_redis.spiders import RedisSpider
Spider
修改为RedisSpider
redis
数据库中爬取链接信息,所以去掉start_urls
,并添加redis_key
。class Movie1Spider(RedisSpider):
name = 'movie1'
allowed_domains = ['m.maoyan.com']
base_url = 'http://m.maoyan.com/mmdb/comments/movie/{}.json?_v_=yes&offset={}&startTime={}'
redis_key = 'movie1:start_urls'
# def start_requests(self):
# time_now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
# url = self.base_url.format(MOVIE_ID, 0, quote(time_now))
# yield Request(url=url)
redis
数据库链接参数:REDIS_URLscrapy-redis
的调度器:SCHEDULERscrapy-redis
的去重:DUPEFILTER_CLASSredis queue
:SCHEDULER_PERSISTSCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://:密码@IP:6379'
SCHEDULER_PERSIST = True
MYSQL_HOST = '地址'
MYSQL_DATABASES = 'movie'
MYSQL_PORT = 62782
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''
DOWNLOAD_DELAY = 0.1 # 每次下载请求的延迟
MOVIE_ID = '1216446' # 电影ID
启动程序后,我们链接Redis
数据库,进行单机测试是否可以。
127.0.0.1:6379> lpush dytt:start_urls http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=0&startTime=2018-10-11%2018%3A14%3A17
使用Gerapy批量部署