对"瑶瑶代理IP"的爬取存取MySQL

前因

过分的西刺代理网站,对于爬虫早已写好,只待代理IP数据进入我的MySQL服务器中,可以无奈访问IP被封,所有转向了另一个代理IP网站—-瑶瑶代理IP。

配置

针对爬虫数据库操作主要在settings.pypipelines.py文件中,前者进行配置,后者进行操作。注意的是代码中设计数据库配置应提前配置好。

  • settings.py
# 数据库配置
DBKWARGS={'db':'ippool','user':'root', 'passwd':'root',
    'host':'127.0.0.1','use_unicode':True, 'charset':'utf8'}

# 管道配置
ITEM_PIPELINES = {
   'httpsdaili.pipelines.HttpsdailiPipeline': 300,
}

操作

  • pipelines.py
# -*- coding: utf-8 -*-

import MySQLdb

class HttpsdailiPipeline(object):
    def process_item(self, item, spider):
        DBKWARGS = spider.settings.get('DBKWARGS')#获取配置文件
        con = MySQLdb.connect(**DBKWARGS)#连接数据库
        cur = con.cursor()#获取游标
        sql = ("insert into proxy(ip,port,anny,type,position,speed,time) "
            "values(%s,%s,%s,%s,%s,%s,%s)")#sql语句
        lis = (item['ip'],item['port'],item['anny'],item['type'],item['position'],item['speed'],
            item['time'])#插入数据
        try:
            cur.execute(sql,lis)
        except Exception,e:
            print "Insert error:",e
            con.rollback()#回滚操作
        else:
            con.commit()#插入必须提交
        cur.close()
        con.close()
        return item

爬虫

  • items.py
# -*- coding: utf-8 -*-
import scrapy

class HttpsdailiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    ip = scrapy.Field()
    port = scrapy.Field()
    anny = scrapy.Field()
    type = scrapy.Field()
    position = scrapy.Field()
    speed = scrapy.Field()
    time = scrapy.Field()
  • spider.py(daili.py)
# -*- coding: utf-8 -*-
import scrapy
from httpsdaili.items import HttpsdailiItem

class DailiSpider(scrapy.Spider):
    name = 'daili'
    allowed_domains = ['httpsdaili.com']
    start_urls = ['http://httpsdaili.com/']

    def parse(self, response):
        for i in range(1,38):
            url = 'http://www.httpsdaili.com/free.asp?page=%s'%i
            print url
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.css('tr.odd'):
            item = HttpsdailiItem()
            item['ip'] = sel.css('td:nth-child(1)::text').extract_first()
            item['port'] = sel.css('td:nth-child(2)::text').extract_first()
            item['anny'] = sel.css('td:nth-child(3)::text').extract_first()
            item['type'] = sel.css('td:nth-child(4)::text').extract_first()
            item['position'] = sel.css('td:nth-child(5)::text').extract_first()
            item['speed'] = sel.css('td:nth-child(6)::text').extract_first()
            item['time'] = sel.css('td:nth-child(7)::text').extract_first()
            yield item

总结

在调试过程中出现了很多小的问题,通过查看日志文件进行百度搜索能解决大部分文件。
当然对了西刺IP代理仍在继续。

你可能感兴趣的:(网络爬虫,#,scrapy)