1.网络爬虫的准则

详细说明请鉴： https://blog.csdn.net/lafengxiaoyu/article/details/77842362
简而言之，服务器上的数据具有产权归属，网络爬虫获取数据牟利将带来法律风险。
一般网站具有反爬虫机制，或者在网站更目录下的robots.txt文件下会说明那些网页可爬取，哪些不可爬取，请遵守这些规则，无限制使用爬虫会对服务器造成负担。

2.爬取前准备

1)python环境准备

    略，百度即可，本文使用python2.7,建议使用 python2.7+ 的版本,版本过低有些库不能使用 
    推荐使用python2.7或python3.5版本，稳定

2)安装requests(可不使用)，urllib2,pymysql

pip install requests
pip install urllib2
pip install pymysql

3.原理

1)模拟http请求获取获取网页

image.png

模拟请求代码：

# -*- coding:utf-8 -*-
import urllib2

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
url = 'http://www.qiushibaike.com/hot/page/1'
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
print response.read()

可获取该网页的所有代码

image.png

2)分析页面

image.png

观察网页可发现段子的内容在
下
利用正则表达式对该内容进行匹配，匹配出段子的作者点赞数内容

3)插入数据库

将匹配出的数据使用数组进行接受，插入mysql数据库中，本文使用plmysql进行操作

4.全部代码

# -*- coding:utf-8 -*-
import urllib
import urllib2
import re
import pymysql

class fullCode:
    def __init__(self):
        self.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        self.headers = {'User-Agent': self.user_agent}
        self.db = pymysql.connect(host='localhost',
                                  user='root',
                                  password='123456',
                                  database='test',
                                  port=3306,
                                  charset='utf8mb4',
                                  cursorclass=pymysql.cursors.DictCursor)

    def getCode(self,pageIndex):
        try:
            cursor = self.db.cursor()
            for i in range(1, pageIndex):
                url = 'http://www.qiushibaike.com/hot/page/' + str(i)
                request = urllib2.Request(url, headers=self.headers)
                response = urllib2.urlopen(request)
                # print response.read()
                content = response.read().decode('utf-8')
                pattern = re.compile(
                    '.*?(.*?)
.*?(.*?)(.*?).*?"number">(.*?)',
                    re.S)
                items = re.findall(pattern, content)
                print '第' + str(
                    i) + '页======================================================================================'

                for item in items:
                    haveImg = re.search("img", item[2])
                    if not haveImg:
                        print item[0], item[1], item[3]

                        sql = "insert into tb_qsbk(author,likenum,content) values('"+item[0]+"','"+item[3]+"','"+item[1]+"')"
                        cursor.execute(sql)
                        self.db.commit()
            self.db.close()

        except urllib2.URLError, e:
            if hasattr(e, 'code'):
                print e.code
            if hasattr(e, 'reason'):
                print e.reason

    def start(self):
        self.getCode(100)

code = fullCode()
code.start()

re是python自带的正则表达式库
本次操作掠过了图片的插入，有空再进行修改吧。

结果：

image.png

5.遇到问题

1)安装库时pip报错，可能是版本过低，目前版本18.1，升级pip

python -m pip  install --upgrade pip

2)InternalError报错【编码错误】

pymysql.err.InternalError: (1366, u"Incorrect string value: '\\xF0\\x9F\\x90\\xB6\\xEF\\xBC...' for column 'content' at row 1")

原因：
UTF-8编码有可能是两个、三个、四个字节。Emoji表情是4个字节，而Mysql的utf8编码最多3个字节，所以数据插不进去。

解决步骤：
1.修改MySQL的字段编码为utf8mb4，就能处理4字节的unicode
命令：alter table TABLE_NAME convert to character set utf8mb4 collate utf8mb4_bin; （将TABLE_NAME替换成你的表名）
2、数据库链接
conn=pymysql.connect(
host='127.0.0.1',
port=3306,
user='root',
passwd='123456',
db='test',
charset='utf8mb4',
)

--完

python爬取糗事百科段子并保存至mysql数据库