币小站日志1--python3爬虫爬取区块链新闻
最近区块链很火,所以想做个新闻爬取和分析类的媒体网站,说干就干,但是做媒体网站总是需要数据源的呀,而数据源从何而来呢,自己写这种事后面再说,首先是爬。。。反正是公开信息,不涉及到个人隐私。
这里我首先牟定了几个区块链新闻网站分别是
- 链闻
- 8btc
- 区势传媒
- 金色财经
- 链向财经
至于爬取规则大同小异,这里着重其中一个的爬取代码来说
下面代码是爬取金色财经的
import urllib.request
import json
import _thread
import threading
import time
import mysql.connector
from pyquery import PyQuery as pq
import news_base
def url_open(url):
#print(url)
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url, headers=headers)
for i in range(10):
try:
response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8')
return response
except :
print("chainnewscrawl except:")
def get_news(page_count, cb):
time_utc = int(time.time())
error_count = 0
index = 0
for i in range(1,page_count+1):
#print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
response = url_open("https://api.jinse.com/v6/information/list?catelogue_key=www&limit=23&information_id=%d&flag=down&version=9.9.9&_source=www"%(index))
#print(response)
json_data = json.loads(response)
for item in json_data['list']:
if item["type"] != 1 and item["type"] != 2:
continue
article_item = news_base.article_info(
item["extra"]['author'],#
int(item["extra"]["published_at"]),#
item['title'], #
item["extra"]['summary'],#
'content',
item["extra"]['topic_url'],
"金色财金")
source_responce = url_open(article_item.source_addr)
source_doc = pq(source_responce)
article_item.content = source_doc(".js-article-detail").html() if source_doc(".js-article-detail").html() else source_doc(".js-article").html()
index = item['id']
if not cb(article_item):
error_count+=1
else:
error_count = 0
if error_count >= 5:
break
if error_count >= 5:
break
#print(json_data['results'][0])
#def get_news(10)
#print(response)
先简单的说说几个引用的库
其中
urllib.request就是用来通过http或者https爬取信息的工具
由于http的爬取有很大的几率会导致打开不成功所以这里写了个我觉得很好用的函数
def url_open(url):
#print(url)
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url, headers=headers)
for i in range(10):
try:
response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8')
return response
except :
print("chainnewscrawl except:")
就是持续打开这个网站,连续10次,这样基本就实现了每个爬取的网址必然被打开,而且非常好用
PyQuery就是类似于网页的jquery来分析网站的工具
mysql.connector是用于持久化存储到数据库的工具
至于news_base因为同时爬几个网站所以需要共同的数据结构
如下所示
class article_info:
def __init__(self, author, time_utc, title, desc, content, source_addr, source_media):
self.author = author
self.time_utc = time_utc
self.title = title
self.desc = desc
self.content = content
self.source_addr = source_addr
self.source_media = source_media
def __str__(self):
return("""==========================
author:%s
time_utc:%d
title:%s
desc:%s
content:%s
source_addr:%s
source_media:%s"""%(self.author, self.time_utc, self.title, self.desc, 'self.content', self.source_addr, self.source_media))
而新闻的爬取过程,是一条一条的进行http连接,成功后获取结果,速度非常慢,所以必须要多线程一起跑,速度几乎成几何倍数上涨,这里我对每个网站开启了一个线程具体代码如下:
import db_base
import news_chainfor
import news_jinse
import news_8btc
import news_55coin
import news_chainnews
import threading
class myThread (threading.Thread):
def __init__(self, func, arg1, arg2):
threading.Thread.__init__(self)
self.func = func
self.arg1 = arg1
self.arg2 = arg2
def run(self):
print ("开始线程:" + self.name)
self.func(self.arg1, self.arg2)
print ("退出线程:" + self.name)
def run():
db_base.init_db()
thread_list = [
myThread(news_55coin.get_news, 10, db_base.insert_article),
myThread(news_8btc.get_news, 10, db_base.insert_article),
myThread(news_jinse.get_news, 10, db_base.insert_article),
myThread(news_chainfor.get_news, 10, db_base.insert_article),
myThread(news_chainnews.get_news, 10, db_base.insert_article)
]
for i in range(len(thread_list)):
thread_list[i].start()
for i in range(len(thread_list)):
thread_list[i].join()
由于以前python实在用得少,这次是现用现学的,所以代码可能有些丑陋,但我还是献丑了,哈哈哈哈哈
币小站目前已经上线在http://www.bxiaozhan.com
整个站点所有代码(包括前后端)均开源,位置在https://github.com/lihn1987/CoinCollector
希望大家多多指教