过分的西刺代理网站,对于爬虫早已写好,只待代理IP数据进入我的MySQL服务器中,可以无奈访问IP被封,所有转向了另一个代理IP网站—-瑶瑶代理IP。
针对爬虫数据库操作主要在settings.py
和pipelines.py
文件中,前者进行配置,后者进行操作。注意的是代码中设计数据库配置应提前配置好。
# 数据库配置
DBKWARGS={'db':'ippool','user':'root', 'passwd':'root',
'host':'127.0.0.1','use_unicode':True, 'charset':'utf8'}
# 管道配置
ITEM_PIPELINES = {
'httpsdaili.pipelines.HttpsdailiPipeline': 300,
}
# -*- coding: utf-8 -*-
import MySQLdb
class HttpsdailiPipeline(object):
def process_item(self, item, spider):
DBKWARGS = spider.settings.get('DBKWARGS')#获取配置文件
con = MySQLdb.connect(**DBKWARGS)#连接数据库
cur = con.cursor()#获取游标
sql = ("insert into proxy(ip,port,anny,type,position,speed,time) "
"values(%s,%s,%s,%s,%s,%s,%s)")#sql语句
lis = (item['ip'],item['port'],item['anny'],item['type'],item['position'],item['speed'],
item['time'])#插入数据
try:
cur.execute(sql,lis)
except Exception,e:
print "Insert error:",e
con.rollback()#回滚操作
else:
con.commit()#插入必须提交
cur.close()
con.close()
return item
# -*- coding: utf-8 -*-
import scrapy
class HttpsdailiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
ip = scrapy.Field()
port = scrapy.Field()
anny = scrapy.Field()
type = scrapy.Field()
position = scrapy.Field()
speed = scrapy.Field()
time = scrapy.Field()
# -*- coding: utf-8 -*-
import scrapy
from httpsdaili.items import HttpsdailiItem
class DailiSpider(scrapy.Spider):
name = 'daili'
allowed_domains = ['httpsdaili.com']
start_urls = ['http://httpsdaili.com/']
def parse(self, response):
for i in range(1,38):
url = 'http://www.httpsdaili.com/free.asp?page=%s'%i
print url
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.css('tr.odd'):
item = HttpsdailiItem()
item['ip'] = sel.css('td:nth-child(1)::text').extract_first()
item['port'] = sel.css('td:nth-child(2)::text').extract_first()
item['anny'] = sel.css('td:nth-child(3)::text').extract_first()
item['type'] = sel.css('td:nth-child(4)::text').extract_first()
item['position'] = sel.css('td:nth-child(5)::text').extract_first()
item['speed'] = sel.css('td:nth-child(6)::text').extract_first()
item['time'] = sel.css('td:nth-child(7)::text').extract_first()
yield item
在调试过程中出现了很多小的问题,通过查看日志文件进行百度搜索能解决大部分文件。
当然对了西刺IP代理
仍在继续。