欢迎来我的个人博客:fizzyi
项目介绍
爬取地址: http://www.resgain.net/xmdq.html
爬取内容:为该网址下的所有姓氏和姓氏名字
爬取步骤:
- 先爬取所有的姓氏,包括姓氏,姓氏的中文,每个姓氏的URL
- 然后在进每一个姓氏的网址进去爬取每个姓氏下的名字,每个姓氏下都有十页,但是发现并不是每一页都是存在姓名的。
- 最后进每一个姓氏的详细页面,爬取每个姓名的相同人数和五行和三才。
工作环境和爬取的框架: python3 scrapy
爬取数据量: 姓氏435个 姓名194万数据
代码
1 准备工作
新建scrapy项目 和新建爬虫项目(2个)
scrapy startproject baijiaxing1
scrapy genspider baijiaxing2 resgain.net/xmdq.html
scrapy genspider spider_xingming resgain.net/xmdq.html
解释下为什么要创建两个爬虫项目
因为scrapy是多线程的爬取方式,我之前写在一起,会同时爬取姓氏和姓名,但是姓名中有一个字段是姓氏的id,这样就会存在一个情况 姓名抓取都的时候姓氏还没有存到数据库中,会导致报错。但想一想,应该有其他办法可以解决这个问题,本人也是才接触scrapy,所以采用这种笨方法。
baijiaxing2 是爬取姓氏, spider_xingming 是爬取姓名
2 建立items
Items.py
# -*- coding: utf-8 -*-
import scrapy
class Xingshi_Item(scrapy.Item):
xingshi = scrapy.Field()
href = scrapy.Field()
xingshi_zhongwen = scrapy.Field()
class Xingming_Item(scrapy.Item):
name = scrapy.Field()
the_same_people_number = scrapy.Field()
boy_ratio = scrapy.Field()
girl_ratio = scrapy.Field()
five_elements = scrapy.Field()
three_talents = scrapy.Field()
xingshi = scrapy.Field()
3 爬取所有姓氏
# -*- coding: utf-8 -*-
import scrapy
from baijiaxing1.items import Xingshi_Item
class Baijiaxing2Spider(scrapy.Spider):
name = 'baijiaxing2'
# allowed_domains = ['resgain.net/xmdq.html']
start_urls = ('http://www.resgain.net/xmdq.html',)
def parse(self, response):
content = response.xpath('//div[@class="col-xs-12"]/a')
for i in content:
# xingshi = i.split('.')[0].split('/')[-1]
xingshi = i.xpath('./@href').extract()[0].split('.')[0].split('/')[-1]
href = 'http:' + i.xpath('./@href').extract()[0]
item = Xingshi_Item()
item['xingshi'] = xingshi
item['href'] = href
item['xingshi_zhongwen'] = i.xpath('./text()').extract()[0].split('姓名')[0]
yield item
4 爬取所有名字
# -*- coding: utf-8 -*-
from urllib.parse import urljoin
import scrapy
from baijiaxing1.items import Xingshi_Item, Xingming_Item
class SpiderXingmingSpider(scrapy.Spider):
name = 'spider_xingming'
# allowed_domains = ['www.resgain.net/xmdq.html']
start_urls = ('http://www.resgain.net/xmdq.html',)
def parse(self, response):
content = response.xpath('//div[@class="col-xs-12"]/a/@href').extract()
for i in content:
page = 0
href = 'http:' + i
base = href.split('/name')[0] + '/name_list_'
while page < 10:
url = base + str(page) + '.html'
page += 1
yield scrapy.Request(url, callback=self.parse_in_html)
# 解析每一页
def parse_in_html(self, response):
person_info = response.xpath('//div[@class="col-xs-12"]/a')
base_url = 'http://'+response.url.split('/')[2]
xingshi = response.url.split('/')[2].split('.')[0]
for every_one in person_info:
name = every_one.xpath('./text()').extract()[0]
href = every_one.xpath('./@href').extract()[0]
the_person_info_url =base_url + href
the_item = Xingming_Item()
the_item['name'] = name
the_item['xingshi'] = xingshi
yield scrapy.Request(the_person_info_url, meta={'the_item': the_item}, callback=self.parse_every_html)
def parse_every_html(self, response):
the_item = response.meta['the_item']
the_same_people_number = \
response.xpath('//div[@class="navbar-brand"]/text()').extract_first().split('人')[0].split('有')[1]
boy_ratio = \
response.xpath('//div[@class="progress"]/div[contains(@class,progress-bar)]/text()').extract()[0].split('情况')[0]
girl_ratio = \
response.xpath('//div[@class="progress"]/div[contains(@class,progress-bar)]/text()').extract()[1].split('情况')[0]
five_elements = response.xpath('//div[@class="panel-body"]/div[@class="col-xs-6"]/blockquote/text()').extract()[
0]
three_talents = response.xpath('//div[@class="panel-body"]/div[@class="col-xs-6"]/blockquote/text()').extract()[
1]
the_item['the_same_people_number'] = the_same_people_number,
the_item['boy_ratio'] = boy_ratio,
the_item['girl_ratio'] = girl_ratio,
the_item['five_elements'] = five_elements,
the_item['three_talents'] = three_talents
yield the_item
5 pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from baijiaxing1.items import Xingshi_Item, Xingming_Item
class XingShiPipeline(object):
def __init__(self,host, database, user, password, port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
def process_item(self, item, spider):
if isinstance(item, Xingshi_Item):
sql = 'INSERT INTO baijiaxing(xingshi,href,xingshi_zhongwen) VALUES (%s,%s,%s);'
self.cursor.execute(sql,(item['xingshi'],str(item['href']),item['xingshi_zhongwen']))
self.db.commit()
return item
elif isinstance(item,Xingming_Item):
sql = 'SELECT id FROM xingshi WHERE xingshi = %s' %item['xingshi']
xingshi_id = self.cursor.execute(sql)
sql = 'INSERT INTO xingming(name,the_same_people_number,boy_ratio,girl_ratio,five_elements,three_talents,xingshi_id)values (%s,%s,%s,%s,%s,%s,%s);'
self.cursor.execute(sql,(item['name'],item['the_same_people_number'][0],item['boy_ratio'][0],item['girl_ratio'][0],item['five_elements'][0],item['three_talents'],int(xingshi_id)))
self.db.commit()
return item
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('PYMYSQL_HOST'),
database=crawler.settings.get('PYMYSQL_DATABASE'),
user=crawler.settings.get('PYMYSQL_USER'),
password=crawler.settings.get('PYMYSQL_PASSWORD'),
port=crawler.settings.get('PYMYSQL_PORT'),
)
def open_spider(self, spider):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, port=self.port)
self.cursor = self.db.cursor()
def close_spider(self, spider):
self.db.close()
因为有两个item,所以在process_item中要区分是哪个item返回的数据。
6 settings.py
最后在settings.py中设置数据库的配置以及请求头的配置
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'
ROBOTSTXT_OBEY = False
# mysql数据库设置
PYMYSQL_HOST = '127.0.0.1'
PYMYSQL_DATABASE = 'test1'
PYMYSQL_USER = 'root'
PYMYSQL_PASSWORD = '123456'
PYMYSQL_PORT = 3306
7 数据库表
Baijiaxing 表
id | xingshi | href | xingshi_zhongwen |
---|---|---|---|
xingming表
id | name | the_same_people_number | boy_ratio | girl_ratio | five_elements | three_talents | xingshi_id |
---|---|---|---|---|---|---|---|
8 开始
因为有两个spiders,而且是先运行baijiaxing1 后运行spider_xingming 所以写了一个run.py文件
run.py
import os
os.system("scrapy crawl baijiaxing2")
os.system("scrapy crawl spider_xingming")
github地址:https://github.com/Fizzyi/baijiaxing/tree/master