python入门爬虫项目

因为业务需求，本菜鸟决定学习一下python。

python是一个解释型语言，相对于c系和java，开发成本更低更容易上手，同时python和别的语言结合比较简单，当然它也有缺陷，水平不足不加评论。

目前使用python是爬取英语系国家的用户搜索词，主要网站是谷歌热搜https://trends.google.com/trends/?geo=US

业务需求每天两千个关键词，实现过程使用python爬取关键词信息存储到redis set里面，以免存储重复关键词。

step1：点击进入想要爬取的网页https://trends.google.com/trends/trendingsearches/daily?geo=EG

step2：打开chrome f12 network找请求数据的接口，点击Headers。。。好吧还是有点啰嗦了，不多bb接下来进入重点

image

step3：这里开始写程序了，首先准备好一个可以获取一段时间的日期的方法，后面万一用得到呢(╰_╯)

import datetime

def getEveryDay(begin_date,end_date):

    date_list = []

    begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d")

    end_date = datetime.datetime.strptime(end_date,"%Y-%m-%d")

    while begin_date <= end_date:

        date_str = begin_date.strftime("%Y%m%d")

        date_list.append(date_str)

        begin_date += datetime.timedelta(days=1)

    return date_list

这里格式是20180101这样的，具体需求再具体改

这里是重中之重，开始通过get请求去拉取接口数据

import requests,re,redis

try:

    pool = redis.ConnectionPool(host='35.162.249.93',  port=6311, db=7)

    print("connected success.")

except:

    print("could not connect to redis.")

r = redis.Redis(connection_pool=pool)

r.delete("qt:data:hotkey")

countryarr = ['IN','CA','US','AU']

for country in countryarr:

    url = 'https://trends.google.com/trends/api/realtimetrends?hl=en-US&tz=-480&cat=all&fi=0&fs=0&geo='+country+'&ri=300&rs=20&sort=0'

    KeyFile=requests.get(url,verify=False)

    KeyHtml=KeyFile.text

    KeyStrList = re.findall(r'"trendingStoryIds":\["(.*?)"\],"storySummaries":',KeyHtml)

    keyStr=KeyStrList[0]

    keyList = keyStr.split('","')

    for key in keyList:

        singleReqUrl = 'https://trends.google.com/trends/api/stories/'+key+'?hl=en-US&tz=-480'

        HotFile = requests.get(singleReqUrl,verify=False)

        HotHtml = HotFile.text

        HotKeyStr = re.findall(r'"entityNames":\["(.*?)"\],"entityExploreLinks":',HotHtml)

        print(HotKeyStr)

        if HotKeyStr:

            HotKeyList = HotKeyStr[0].split('","')

            for HotKey in HotKeyList:

                r.sadd('qt:data:hotkey',HotKey.lower())

        else:

            print("")

这里的接口数据就是一个txt文档，不知道其他网站的接口会不会也是这样，后面遇到再提。这里主要是一个正则获取所需的数据。findall出来是一个list，那么list如果有数据就插入redis咯，据某瓜皮说还有一个轻量级数据库可以专门存储爬下来的数据，然后批量转存到其他持久化数据库

step4：部署到服务器上

今天是2018年12月21日，昨天项目部署到了服务器上定时执行，遇到的几个问题在此记录：

1.python没有安装redis扩展，下载一个pip然后用pip安装python-redis扩展

2.遇到requests.exceptions.SSLError: hostname 'trends.google.com' doesn't match 'www.google.com',这是证书不匹配问题，可能是我服务器上做过一个certbot证书的原因，解决方法如下：

3.SyntaxError: Non-ASCII character '\xe6' in file hot_key_rds_google_trends_v2,这个问题是因为python2.7.5里面不能运行带有中文的jio本

解决方法：在代码开头加#coding=utf-8，但是试过没用。

我自己的方法就是直接删掉了中文注释。

4.接下来就是定时任务，选择linux自带的crontab，我定时到每天17点执行， 0 17 * * * python /data/quicktouch/reptile/hot_key_rds_google_trends_v2.py，这时候可能会遇到执行权限不足什么的，我做的时候是用root权限修改我上传的py文件的权限 chmod 777 -R /data/quicktouch/reptile/,至少我没遇到什么问题。

修改crontab方法 crontab -e，修改完之后ctrl x ，选择yes，然后退出来后重启crontab：sudo service cron restart

以上操作完之后全部搞定，多帅哦。

这个不能称之为项目的一个小demo，运行效果是每天2500个左右的搜索关键词。

python入门爬虫项目

你可能感兴趣的:(python入门爬虫项目)