【Python爬虫】MongoDB爬虫实践:爬取虎扑论坛

MongoDB爬虫实践:爬取虎扑论坛


网站地址为:https://bbs.hupu.com/bxj

1.网站分析

首先,定位网页上帖子名称、帖子链接、作者、作者链接、创建时间、回复数目、浏览数目、最后回复用户、最后回复时间等信息的位置,之后,我们使用BeautifulSoup在网页中定位这些。

数据所在的位置
数据 位置
某帖子所有数据 ‘li’
帖子名称 div class="titlelink box"   >  a
帖子链接 div class="titlelink box"   >  a['href']
作者 div class="author box"   >  a
作者链接 div class="author box"   >  a['href']
创建时间 div class="author box"   >  contents[5]
回复数 span class="ansour box" 
浏览数 span class="ansour box"
最后回复用户 div  class="endreply box"  > span
最后回复时间 div  class="endreply box"  > a

另外,当打开第二页时,网页的URL地址变成了https://bbs.hupu.com/bxj-2,以此类推。

2.项目实践

首先尝试获取第一页数据,代码如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@File  : Save05.py
@Author: Xinzhe.Pang
@Date  : 2019/7/10 0:14
@Desc  : 
"""
# 爬取虎扑论坛数据 https://bbs.hupu.com/bxj

import requests
from bs4 import BeautifulSoup
import datetime


def get_page(link):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
    r = requests.get(link, headers=headers)
    html = r.content
    html = html.decode('UTF-8')
    soup = BeautifulSoup(html, 'lxml')
    return soup


def get_data(post_list):
    data_list = []
    for post in post_list:
        title = post.find('div', class_='titlelink box').a.text.strip()
        post_link = post.find('div', class_='titlelink box').a['href']
        post_link = "https://bbs.hupu.com" + post_link

        author = post.find('div', class_='author box').a.text.strip()
        author_page = post.find('div', class_='author box').a['href']
        start_date = post.find('div', class_='author box').contents[5].text.strip()

        reply_view = post.find('span', class_='ansour box').text.strip()
        reply = reply_view.split('/')[0].strip()
        view = reply_view.split('/')[1].strip()

        reply_time = post.find('div', class_='endreply box').a.text.strip()
        last_reply = post.find('div', class_='endreply box').span.text.strip()

        if ':' in reply_time:  # 时间是11:27
            date_time = str(datetime.date.today()) + ' ' + reply_time
            date_time = datetime.datetime.strptime(date_time, '%Y-%m-%d %H:%M')
        else:
            date_time = datetime.datetime.strptime('2019-' + reply_time, '%Y-%m-%d').date()

        data_list.append([title, post_link, author, author_page, start_date, reply, last_reply, date_time])

    return data_list


link = "https://bbs.hupu.com/bxj"
soup = get_page(link)
post_all = soup.find('ul', class_="for-list")
post_list = post_all.find_all('li')
data_list = get_data(post_list)
for each in data_list:
    print(each)

3.获取前50页数据

注意一个问题:当翻到第二页的时候,可能新回复的帖子已经将原来第一页的帖子推到了第二页,如果还用insert_one方法将数据存入MongoDB中,那么同一个帖子可能会在数据库中出现两次记录,因此需要改用update方法。

代码如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@File  : MongoAPI.py
@Author: Xinzhe.Pang
@Date  : 2019/7/10 22:15
@Desc  : 
"""
import requests
import datetime
import time
from pymongo import MongoClient
from bs4 import BeautifulSoup

# 定义一个使用MongoDB的类,方便连接数据库、提取数据库中内容以及加入或更新数据
class MongoAPI(object):
    def __init__(self, db_ip, db_port, db_name, table_name):
        self.db_ip = db_ip
        self.db_port = db_port
        self.db_name = db_name
        self.table_name = table_name
        self.conn = MongoClient(host=self.db_ip, port=self.db_port)
        self.db = self.conn[self.db_name]
        self.table = self.db[self.table_name]

    def get_one(self, query):
        return self.table.find_one(query, projection={"_id": False})

    def get_all(self, query):
        return self.table.find(query)

    def add(self, kv_dict):
        return self.table.insert(kv_dict)

    def delete(self, query):
        return self.table.delete_many(query)

    def check_exist(self, query):
        ret = self.table.find_one(query)
        return ret != None

    # 如果不存在则会新建
    def update(self, query, kv_dict):
        self.table.update_one(query, {'$set': kv_dict}, upsert=True)

# 获取页面内容
def get_page(link):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
    r = requests.get(link, headers=headers)
    html = r.content
    html = html.decode('UTF-8')
    soup = BeautifulSoup(html, 'lxml')
    return soup

# 解析页面数据
def get_data(post_list):
    data_list = []
    for post in post_list:
        title = post.find('div', class_='titlelink box').a.text.strip()
        post_link = post.find('div', class_='titlelink box').a['href']
        post_link = "https://bbs.hupu.com" + post_link

        author = post.find('div', class_='author box').a.text.strip()
        author_page = post.find('div', class_='author box').a['href']
        start_date = post.find('div', class_='author box').contents[5].text.strip()

        reply_view = post.find('span', class_='ansour box').text.strip()
        reply = reply_view.split('/')[0].strip()
        view = reply_view.split('/')[1].strip()

        reply_time = post.find('div', class_='endreply box').a.text.strip()
        last_reply = post.find('div', class_='endreply box').span.text.strip()

        if ':' in reply_time:  # 时间是11:27
            date_time = str(datetime.date.today()) + ' ' + reply_time
            date_time = datetime.datetime.strptime(date_time, '%Y-%m-%d %H:%M')
        else:
            date_time = datetime.datetime.strptime('2019-' + reply_time, '%Y-%m-%d').date()

        data_list.append([title, post_link, author, author_page, start_date, reply, last_reply, date_time])

    return data_list


hupu_post = MongoAPI('111.230.95.186', 27017, 'hupu', 'post')
for i in range(1, 100):
    link = "https://bbs.hupu.com/bxj-" + str(i)
    soup = get_page(link)

    post_all = soup.find('ul', class_="for-list")
    if post_all is None:
        continue
    post_list = post_all.find_all('li')
    data_list = get_data(post_list)

    for each in data_list:
        hupu_post.update({"post_link": each[1]}, {"title": each[0],
                                                  "post_link": each[1],
                                                  "author": each[2],
                                                  "author_page": each[3],
                                                  "start_date": str(each[4]),
                                                  "reply": each[5],
                                                  "last_reply": each[6],
                                                  "last_reply_time": str(each[7])})
    time.sleep(3)
    print('第', i, '页数据获取完成,暂停3秒')

查看数据库是否写入:

【Python爬虫】MongoDB爬虫实践:爬取虎扑论坛_第1张图片

遇到的问题:'NoneType' object has no attribute 'find_all'

解决方法:在post_list = post_all.find_all('li')之前加入如下代码:

if post_all is None:
    continue

 参考资料:《Python网络爬虫从入门到实践》

你可能感兴趣的:(数据挖掘与分析策略)