最终效果(不看效果就讲过程都是耍流氓):
实现过程如下:
目标网站:
蚌埠医学院 学院新闻列表:http://www.bbmc.edu.cn/index.php/view/viewcate/0/
##第一步:数据抓取
在终端中执行命令
srapy startproject bynews
在爬虫目录spider中,新建具体爬虫的by_spider.py文件:
爬虫代码功能说明:
爬虫源代码内容:
import scrapy
from bynews.items import BynewsItem
import re
class BynewsSpider(scrapy.Spider):
name = 'news'
start_urls = ['http://www.bbmc.edu.cn/index.php/view/viewcate/0/']
allowed_domains = ['bbmc.edu.cn']
def parse(self, response):
news = response.xpath("//div[@class='cate_body']//ul/li")
next_page = response.xpath("//div[@class='pagination']/a[contains(text(),'>')]/@href").extract_first()
href_urls = response.xpath("//div[@class='cate_body']/ul/li/a/@href").extract()
for href in href_urls:
yield scrapy.Request(href,callback=self.parse_item)
if next_page is not None:
yield scrapy.Request(next_page)
def parse_item(self, response):
item = BynewsItem()
item['title'] = response.xpath("//div[@id='news_main']/div/h1/text()").extract_first()
# 获取 作者和发文日期
string = response.xpath("//div[@class='title_head']/span/text()").extract_first()
# 通过分割获取文章发布日期
full_date = string.split('/')
year = full_date[0][-4:]
month = full_date[1]
day = full_date[2]
item['post_date'] = year + month +day
# 通过正则表达式获取作者
matchObj = re.search(r'\[.*\]',string)
# 去除两边的中括号
string = matchObj.group()
string = string[1:]
string = string[:-1]
item['author'] = string
contents = response.xpath("//div[@id='news_main']/div/div/span/text()").extract_first()
text = ''
for content in contents :
text = text + content
item['content'] = text
yield item
提取的数据结构items.py文件:
import scrapy
class BynewsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
post_date = scrapy.Field()
content = scrapy.Field()
爬虫项目的settings.py文件,增加请求头,关闭爬虫协议,仅放修改的部分,其他的默认即可:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
准备工作就绪后,执行scrapy crawl news -o news.csv
命令开始抓取数据,并且报存到文件news.csv中,方便后续建站导入数据:
爬虫过程结束,准备建站啦~
##第二步:数据呈现
在终端中执行命令:django-admin startproject 项目名
执行命令:python manage.py startapp bynews
新建应用
Django采用MVC架构,需要写的内容比较多,就不一一放图了,直接甩上代码:
urls.py文件:
from django.contrib import admin
from django.urls import include,path
urlpatterns = [
path('admin/', admin.site.urls),
path('blog/', include('blog.urls')),
# 蚌医新闻
path('bynews/',include('bynews.urls')),
]
settings.py文件(仅放修改部分):
INSTALLED_APPS = [
'blog.apps.BlogConfig',
'bynews.apps.BynewsConfig',
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
]
模型models.py文件(模型文件弄好后,需要执行数据迁移什么的命令,Django保证了所有操作都是基于面向对象,十分强大):
from django.db import models
class Bynews(models.Model):
title = models.CharField(max_length=30, default='Title')
author = models.CharField(null = True,max_length=30)
content= models.TextField(null = True)
def __str__(self):
return self.title
视图views.py文件:
from django.shortcuts import render,get_object_or_404
from .models import Bynews
def index(request):
news = Bynews.objects.order_by('id')[:30]
return render(request, 'bynews/newslist.html',{'news_list':news})
def detail(request,news_id):
news = get_object_or_404(Bynews,pk=news_id)
return render(request, 'bynews/detail.html', {'news':news})
urls.py文件:
from django.urls import path
from . import views
app_name = 'bynews'
urlpatterns = [
# ex:/bynews/
path('', views.index, name='index'),
# ex:/bynews/4/
path('/', views.detail, name = 'detail'),
#path('/vote/', views.form ,name='form'),
]
放上一个新闻列表页的模板newslist
文件,需要注意其中的url生成方式,static的生成方式:
{% load static %}
蚌埠医学院 - 新闻列表
蚌埠医学院
新闻列表
{% for news in news_list %}
-
{{news.title}}
{% endfor %}
最终效果(请忽视正文内容不全问题,这是采集没弄好,懒得弄了,基本想法实现了就行):