博主在日常学习中恰好需要对大量的网络文本进行获取与分析,而又正好会一点Python,因此使用Python爬虫库Beautiful Soup以及中文自然语言处理库jieba进行爬虫与文本分析,从而统计各年份的高频词。
程序完成的任务如下:首先对目标网站(深圳市交通运输局官网的新闻数据界面以及百度资讯界面)进行单轮的标题、时间、超链接等信息进行获取,之后再进入超链接中对新闻的具体内容进行获取并分别写入文件进行存储;其次通过对各存储文本用jieba库对各年份的数据进行文本分割、词频统计,从而得到目标网站各年份新闻中的高频词。
部分代码借鉴了网上各博客提供的代码,但源码在实际数据获取、文本分析中仍出现了一些问题。因此此篇博客给出的代码是在参考了众多博客的基础上对各个功能和BUG进行解决后综合出的结果,目标是用户只要安装了需求库,就可以一键运行。不敢说是集大成,但这段代码确确实实有用,而且在运行速度方面、爬虫连续性方面都有一点优势,就干脆把这版叫集小成版了。写此文主要是为了总结一下这段新路程,以后忘了还可以回来看自己当时各部分是怎么想的23333
下文主要分成需求库、代码片段分析和完整代码三个部分:
博主的运行环境是Ubuntu16.04,Python2.7,Ubuntu16.04的安装方式可以参考博主往期的博客:要点初见:双硬盘下的Win10+Ubuntu16.04双系统安装,Python2.7是系统自带的。
需要通过pip install *或者sudo apt-get install python-*安装的Python包包括:
安装方式:pip install beautifulsoup4
安装方式:sudo apt-get install python-requests
安装方式:pip install pandas
安装方式:pip install jieba
链接:https://pan.baidu.com/s/1FtlnF_oAXKnG7NIS6S6Ccw
提取码:8pwn
复制和代码文件相同的目录下即可。
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Linux中的中文编码很迷,加上这段把编码限制为UTF-8有备无患。博主测试在命令行中运行python程序通过print("中文")对中文文本进行直接输出,可以显示出中文,但如果是print里的内容是中文和其他符号的结合体(譬如print("中文",某变量)时),中文部分会显示成u'......'。这个时候不要抓狂!只需要通过删除“,某变量”即可正确print中文。
*注:如果Python代码中含中文/中文注释,务必在代码开头加上:
#encoding:UTF-8
jieba.enable_parallel(4)
jieba分词的时候可以提速!能用一句话实现多线程的库真的不多!后面的4可以改成Ubuntu system monitor中显示的进程数量。
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
# 从 xicidaili.com 爬取代理地址,存入本地csv中
def get_proxy():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
proxy_url = 'http://www.xicidaili.com/nn/'
resp = requests.get(proxy_url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
ips = soup.find_all('tr')
proxy_list = []
for i in range(1, len(ips)):
ip_info = ips[i]
tds = ip_info.find_all('td')
proxy_list.append(tds[1].text + ':' + tds[2].text)
if len(proxy_list) == 0:
print("可能被代理网站屏蔽了", '\n')
else:
print("拿到的代理个数:")
print(len(proxy_list))
with open("proxies.csv", 'wb+') as csvfile:
writer = csv.writer(csvfile)
for i in range(len(proxy_list)):
# print(proxy_list[i])
writer.writerow([proxy_list[i]])
# 执行一次就可以了;
# 执行一次可以更新本地代理地址;
#get_proxy()
# 从存放代理的 csv 中随机获取一个代理
def get_random_proxy():
column_names = ["proxy"]
proxy_csv = pd.read_csv("proxies.csv", names=column_names)
# 新建一个列表用来存储代理地址
proxy_arr = []
for i in range(len(proxy_csv["proxy"])):
proxy_arr.append(proxy_csv["proxy"][i])
# 随机选择一个代理
random_proxy = proxy_arr[random.randint(0, len(proxy_arr)-1)]
print("当前代理:")
print(random_proxy)
return random_proxy
说实话没什么用,这段代码的功能是从一个提供各种IP的网站上获取IP然后进行爬虫。然而博主在使用的时候发现这么做反而会被网站上的反爬虫拦截,只对百度的爬虫有效。但该被拦截的时候这招用不用都没有区别。故此部分被奥卡姆提到带走了。
url_ori = "http://jtys.sz.gov.cn/zwgk/jtzx/tzgg/201905/t20190530_17745684.htm"
proxies = { "http": None, "https": None}
resp = requests.get(url_ori, proxies=proxies)
resp.encoding = 'utf-8'
# 网页内容
bsObj = BeautifulSoup(resp.content, "lxml")
print(bsObj)
block_detail = bsObj.find_all("div", {"class": "xxxq_text_cont"})
print(block_detail[0].get_text())
此部分是爬虫的核心,每个目标网页的布局是不同的,因此需要对Beautiful Soup获取的网页信息进行输出,观察其中的变量名,再通过对变量名进行单独提取输出,查看找到的变量名是否对应目标内容。
此部分首先用当前的IP对目标网址发起请求,有概率被拦截(但很少发生,发生的话可参考(5))。之后采取lxml的读取方式调用beautifulsoup对网页进行读取和输出。之后便是选取目标变量:
个人经验是:
首先查看目标文本前后是否有
current_time = (block_news[i].find('strong')).get_text()
获取到这段文本。
其次查看查看目标文本前后是否有
current_title = (block_news[i].find('a', {'target': "_blank"})).get_text()
获取到这段文本。
可通过此部分代码的下半部分对选取目标进行输出,确认是否正确。此部分代码在正式运行时一般注释起来,因为输出的内容真的太多了……
while try_success_1 != 1:
try:
resp = requests.get(url_news, headers=headers, proxies=proxies)
try_success_1 = 1
except:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
time.sleep(5)
print("Was a nice sleep, now let me continue...")
try_success_1 = 0
continue
爬虫一定会遇到反爬虫机制的干扰,也可能是自己爬得太快了。。这个时候这段代码可以让你不用一次一次反复地重新运行程序,程序会在跌倒的地方趴一伙儿然后再站起来继续请求目标网址23333
另外requests.adapters.DEFAULT_RETRIES = 5也有着反反爬虫的理论效果(虽然个人用了觉得这句话没上面的try好用)。
for j in range(0,8):
try_success_1 = 0
if j==0:
url_news = url_part
else:
url_news = url_part + "index_" + str(j) + ".htm"
proxies = { "http": None, "https": None}
在浏览器里好好观察不同新闻页网址的变化,一般只变末尾的数字,譬如百度页码只变了末尾的pn=后面的数字。这时通过编辑字符串、request即可实现动态换页。
resp.encoding = 'utf-8'
# 网页内容
bsObj = BeautifulSoup(resp.content, "lxml")
#print(bsObj)
# 分析
block = bsObj.find_all("div", {"class": "wendangListC"})
#print(block)
block_news = block[0].find_all("li")
# analysis every block
for i in range(len(block_news)):
try_success_2 = 0
# get Time
tmp = block_news[i].find_all('strong')
if tmp and i != 0 :
current_time = (block_news[i].find('strong')).get_text()
current_time_dec = current_time.decode('unicode_escape').encode('raw_unicode_escape')
current_times.append(current_time_dec)
# get Title
if block_news[i].find_all('a', {'target': "_blank"}):
current_title = (block_news[i].find('a', {'target': "_blank"})).get_text()
current_title_dec = current_title.decode('unicode_escape').encode('raw_unicode_escape')
current_titles.append(current_title_dec)
current_http = (block_news[i].find('a', {'target': "_blank"})).get('href')
current_http_dec = current_http.decode('unicode_escape').encode('raw_unicode_escape')
current_https.append(url_part + current_http_dec)
#resp_detail = requests.get(url_part + current_http_dec, proxies=proxies)
while try_success_2 != 1:
try:
resp_detail = requests.get(url_part + current_http_dec, proxies=proxies)
try_success_2 = 1
except:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
time.sleep(5)
print("Was a nice sleep, now let me continue...")
try_success_2 = 0
continue
resp_detail.encoding = 'utf-8'
# 网页内容
bsObj_detail = BeautifulSoup(resp_detail.content, "lxml")
#print(bsObj_detail)
# 分析
block_detail = bsObj_detail.find_all("div", {"class": "xxxq_text_cont"})
#print(block_detail[0].get_text())
# get Title
if bsObj_detail.find_all("div", {"class": "xxxq_text_cont"}):
current_contents.append(block_detail[0].get_text())
print("i_num is: " + str(i))
else:
print("i_num is: " + str(i))
print(current_title_dec)
print("j_num is: " + str(j))
print(len(current_contents))
print(len(current_times))
此部分首先对
f_20189 = open('20189_content.txt','w')
f_20167 = open('20167_content.txt','w')
f_20145 = open('20145_content.txt','w')
f_20123 = open('20123_content.txt','w')
f_2011andb = open('2011andb_content.txt','w')
for i in range(len(current_times)):
#print(current_times[i])
#print(current_titles[i])
#print(current_https[i])
#print(current_contents[i])
if current_times[i][2] == '1' and (current_times[i][3] == '9' or current_times[i][3] == '8'): #2018-2019
f_20189.write(current_contents[i])
f_20189.write('\n')
if current_times[i][2] == '1' and (current_times[i][3] == '7' or current_times[i][3] == '6'): #2016-2017
f_20167.write(current_contents[i])
f_20167.write('\n')
if current_times[i][2] == '1' and (current_times[i][3] == '5' or current_times[i][3] == '4'): #2014-2015
f_20145.write(current_contents[i])
f_20145.write('\n')
if current_times[i][2] == '1' and (current_times[i][3] == '3' or current_times[i][3] == '2'): #2012-2013
f_20123.write(current_contents[i])
f_20123.write('\n')
if current_times[i][2] == '0' or (current_times[i][2] == '1' and (current_times[i][3] == '1' or current_times[i][3] == '0')): #Before 2011
f_2011andb.write(current_contents[i])
f_2011andb.write('\n')
f_20189.close()
f_20167.close()
f_20145.close()
f_20123.close()
f_2011andb.close()
因为目标时间的格式很整齐,因此直接判别字符串进行了文本的读写。
f_20189_result = open('20189_result.txt','w')
f_20167_result = open('20167_result.txt','w')
f_20145_result = open('20145_result.txt','w')
f_20123_result = open('20123_result.txt','w')
f_2011andb_result = open('2011andb_result.txt','w')
for k in range(0,5):
if k == 0:
txt = open("20189_content.txt").read()
f_20189_result.write("2018-2019 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
elif k == 1:
txt = open("20167_content.txt").read()
f_20167_result.write("2016-2017 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
elif k == 2:
txt = open("20145_content.txt").read()
f_20145_result.write("2014-2015 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
elif k == 3:
txt = open("20123_content.txt").read()
f_20123_result.write("2012-2013 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
elif k == 4:
txt = open("2011andb_content.txt").read()
f_2011andb_result.write("Before 2011 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
#加载停用词表
stopwords = [line.strip() for line in open("stopwords.txt").readlines()]
words = jieba.lcut(txt)
counts = {}
for word in words:
#不在停用词表中
if word not in stopwords:
#不统计字数为一的词
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(30):
word, count = items[i]
print ("{:<10}{:>7}".format(word, count))
if k == 0:
f_20189_result.write("{:<10}{:>7}".format(word, count))
f_20189_result.write("\n")
elif k == 1:
f_20167_result.write("{:<10}{:>7}".format(word, count))
f_20167_result.write("\n")
elif k == 2:
f_20145_result.write("{:<10}{:>7}".format(word, count))
f_20145_result.write("\n")
elif k == 3:
f_20123_result.write("{:<10}{:>7}".format(word, count))
f_20123_result.write("\n")
elif k == 4:
f_2011andb_result.write("{:<10}{:>7}".format(word, count))
f_2011andb_result.write("\n")
f_20189_result.close()
f_20167_result.close()
f_20145_result.close()
f_20123_result.close()
f_2011andb_result.close()
此部分除了文本读写,核心部分就是用jieba对文本进行分词并对统计出的词频前30进行输出。注意此处一定要使用停用词文件并对单字结果进行排除,避免无意义的单字、词语干扰词频统计的结果。大家肯定不希望统计出的词频第一是逗号,第二是句号,第三是“我”吧?
停用词文件见此博客开头“需求库”部分,拷贝到代码目录即可。
完整代码如下,cd到代码目录,在命令行中输入"python2 代码名.py"即可运行:
#encoding:UTF-8
from bs4 import BeautifulSoup
import requests
import random
import datetime
import csv
import pandas as pd
import time
import jieba
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
jieba.enable_parallel(4)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
# 从 xicidaili.com 爬取代理地址,存入本地csv中
def get_proxy():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
proxy_url = 'http://www.xicidaili.com/nn/'
resp = requests.get(proxy_url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
ips = soup.find_all('tr')
proxy_list = []
for i in range(1, len(ips)):
ip_info = ips[i]
tds = ip_info.find_all('td')
proxy_list.append(tds[1].text + ':' + tds[2].text)
if len(proxy_list) == 0:
print("可能被代理网站屏蔽了", '\n')
else:
print("拿到的代理个数:")
print(len(proxy_list))
with open("proxies.csv", 'wb+') as csvfile:
writer = csv.writer(csvfile)
for i in range(len(proxy_list)):
# print(proxy_list[i])
writer.writerow([proxy_list[i]])
# 执行一次就可以了;
# 执行一次可以更新本地代理地址;
#get_proxy()
# 从存放代理的 csv 中随机获取一个代理
def get_random_proxy():
column_names = ["proxy"]
proxy_csv = pd.read_csv("proxies.csv", names=column_names)
# 新建一个列表用来存储代理地址
proxy_arr = []
for i in range(len(proxy_csv["proxy"])):
proxy_arr.append(proxy_csv["proxy"][i])
# 随机选择一个代理
random_proxy = proxy_arr[random.randint(0, len(proxy_arr)-1)]
print("当前代理:")
print(random_proxy)
return random_proxy
# 爬取网站
def get_news():
#html = "http://www.sutpc.com/news/"#https://www.ithome.com/blog/
#resp = requests.get(html, headers=headers, proxies={'http' : get_random_proxy()})
current_titles = []
current_times = []
current_https = []
current_contents = []
'''
url_ori = "http://jtys.sz.gov.cn/zwgk/jtzx/tzgg/201905/t20190530_17745684.htm"
proxies = { "http": None, "https": None}
resp = requests.get(url_ori, proxies=proxies)
resp.encoding = 'utf-8'
# 网页内容
bsObj = BeautifulSoup(resp.content, "lxml")
#print(bsObj)
block_detail = bsObj.find_all("div", {"class": "xxxq_text_cont"})
print(block_detail[0].get_text())
'''
url_part = "http://jtys.sz.gov.cn/hdjl/zxft/fthg/"
requests.adapters.DEFAULT_RETRIES = 5
#url_part = "http://jtys.sz.gov.cn/zwgk/jtzx/gzdt/"
for j in range(0,8):
try_success_1 = 0
if j==0:
url_news = url_part
else:
url_news = url_part + "index_" + str(j) + ".htm"
proxies = { "http": None, "https": None}
#resp = requests.get(url_news, headers=headers, proxies=proxies)
while try_success_1 != 1:
try:
resp = requests.get(url_news, headers=headers, proxies=proxies)
try_success_1 = 1
except:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
time.sleep(5)
print("Was a nice sleep, now let me continue...")
try_success_1 = 0
continue
resp.encoding = 'utf-8'
# 网页内容
bsObj = BeautifulSoup(resp.content, "lxml")
#print(bsObj)
# 分析
block = bsObj.find_all("div", {"class": "wendangListC"})
#print(block)
block_news = block[0].find_all("li")
# analysis every block
for i in range(len(block_news)):
try_success_2 = 0
# get Time
tmp = block_news[i].find_all('strong')
if tmp and i != 0 :
current_time = (block_news[i].find('strong')).get_text()
current_time_dec = current_time.decode('unicode_escape').encode('raw_unicode_escape')
current_times.append(current_time_dec)
# get Title
if block_news[i].find_all('a', {'target': "_blank"}):
current_title = (block_news[i].find('a', {'target': "_blank"})).get_text()
current_title_dec = current_title.decode('unicode_escape').encode('raw_unicode_escape')
current_titles.append(current_title_dec)
current_http = (block_news[i].find('a', {'target': "_blank"})).get('href')
current_http_dec = current_http.decode('unicode_escape').encode('raw_unicode_escape')
current_https.append(url_part + current_http_dec)
#resp_detail = requests.get(url_part + current_http_dec, proxies=proxies)
while try_success_2 != 1:
try:
resp_detail = requests.get(url_part + current_http_dec, proxies=proxies)
try_success_2 = 1
except:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
time.sleep(5)
print("Was a nice sleep, now let me continue...")
try_success_2 = 0
continue
resp_detail.encoding = 'utf-8'
# 网页内容
bsObj_detail = BeautifulSoup(resp_detail.content, "lxml")
#print(bsObj_detail)
# 分析
block_detail = bsObj_detail.find_all("div", {"class": "xxxq_text_cont"})
#print(block_detail[0].get_text())
# get Title
if bsObj_detail.find_all("div", {"class": "xxxq_text_cont"}):
current_contents.append(block_detail[0].get_text())
print("i_num is: " + str(i))
else:
print("i_num is: " + str(i))
print(current_title_dec)
print("j_num is: " + str(j))
print(len(current_contents))
print(len(current_times))
f_20189 = open('20189_content.txt','w')
f_20167 = open('20167_content.txt','w')
f_20145 = open('20145_content.txt','w')
f_20123 = open('20123_content.txt','w')
f_2011andb = open('2011andb_content.txt','w')
for i in range(len(current_times)):
#print(current_times[i])
#print(current_titles[i])
#print(current_https[i])
#print(current_contents[i])
if current_times[i][2] == '1' and (current_times[i][3] == '9' or current_times[i][3] == '8'): #2018-2019
f_20189.write(current_contents[i])
f_20189.write('\n')
if current_times[i][2] == '1' and (current_times[i][3] == '7' or current_times[i][3] == '6'): #2016-2017
f_20167.write(current_contents[i])
f_20167.write('\n')
if current_times[i][2] == '1' and (current_times[i][3] == '5' or current_times[i][3] == '4'): #2014-2015
f_20145.write(current_contents[i])
f_20145.write('\n')
if current_times[i][2] == '1' and (current_times[i][3] == '3' or current_times[i][3] == '2'): #2012-2013
f_20123.write(current_contents[i])
f_20123.write('\n')
if current_times[i][2] == '0' or (current_times[i][2] == '1' and (current_times[i][3] == '1' or current_times[i][3] == '0')): #Before 2011
f_2011andb.write(current_contents[i])
f_2011andb.write('\n')
f_20189.close()
f_20167.close()
f_20145.close()
f_20123.close()
f_2011andb.close()
f_20189_result = open('20189_result.txt','w')
f_20167_result = open('20167_result.txt','w')
f_20145_result = open('20145_result.txt','w')
f_20123_result = open('20123_result.txt','w')
f_2011andb_result = open('2011andb_result.txt','w')
for k in range(0,5):
if k == 0:
txt = open("20189_content.txt").read()
f_20189_result.write("2018-2019 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
elif k == 1:
txt = open("20167_content.txt").read()
f_20167_result.write("2016-2017 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
elif k == 2:
txt = open("20145_content.txt").read()
f_20145_result.write("2014-2015 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
elif k == 3:
txt = open("20123_content.txt").read()
f_20123_result.write("2012-2013 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
elif k == 4:
txt = open("2011andb_content.txt").read()
f_2011andb_result.write("Before 2011 Data from http://jtys.sz.gov.cn/hdjl/zxft/fthg/\n")
#加载停用词表
stopwords = [line.strip() for line in open("stopwords.txt").readlines()]
words = jieba.lcut(txt)
counts = {}
for word in words:
#不在停用词表中
if word not in stopwords:
#不统计字数为一的词
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(30):
word, count = items[i]
print ("{:<10}{:>7}".format(word, count))
if k == 0:
f_20189_result.write("{:<10}{:>7}".format(word, count))
f_20189_result.write("\n")
elif k == 1:
f_20167_result.write("{:<10}{:>7}".format(word, count))
f_20167_result.write("\n")
elif k == 2:
f_20145_result.write("{:<10}{:>7}".format(word, count))
f_20145_result.write("\n")
elif k == 3:
f_20123_result.write("{:<10}{:>7}".format(word, count))
f_20123_result.write("\n")
elif k == 4:
f_2011andb_result.write("{:<10}{:>7}".format(word, count))
f_2011andb_result.write("\n")
f_20189_result.close()
f_20167_result.close()
f_20145_result.close()
f_20123_result.close()
f_2011andb_result.close()
get_news()
若需对其他网址进行爬虫分析,请务必先执行2、(4)对网页进行观察,获取目标变量、副标题等名称,并执行2、(6)进行对应的翻页操作。
欢迎交流讨论!