爬虫实战—文章吧故事爬取

遵循一个网站盘一天的思想。

当你朋友做网站需要大量素材时,学爬虫的你,当然是毫不犹豫的爬给他。

这篇博客记录一下爬取感人文章的一个小爬虫,有需要规避的地方及解决方法也会列出。

本来想着会是很规范的网站,直接用xpath语法会很简单,结果往往事与愿违:

resp = requests.get(url,headers)
text = etree.HTML(resp)
urls = text.xpath("//div[@class='subBox']//div[@class='cmt_pic']//a/@href")
for url in urls:
    print(url)

我们用xpath语法去解析时,报错如下:

 File "src\lxml\etree.pyx", line 3170, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings
大概知道就是只能解析strings类型,我们这样:

text = etree.HTML(str(resp))
urls = text.xpath("//div[@class='subBox']//div[@class='cmt_pic']//a/@href")
for url in urls:
    print(url)

爬虫实战—文章吧故事爬取_第1张图片

啥都没。。。。。 

我盘了好久,XPath Helper都用了,还是没有用,找不到毛病,这里希望大神留言解决。

我同学让我换网站爬,开玩笑,我会放弃?坚决不可能,xpath不行就用万能的正则表达式。

上代码:

import pymysql
import requests
import re
import random
from lxml import etree

#连接数据库
db = pymysql.connect(host = "localhost" , user = "root" , password = "123",
                     database="atricle",port=3306)
#获取游标
cursor = db.cursor()

headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
    'Host' : 'www.wenzhangba.com',
    'Cookies' : '__music_index__=2; count_m=1; count_h=14; first_m=1555835146650; __cfduid=dea29049d424e6bc92238400a77ea4b3c1555831911; client=61311; client__ckMd5=c363d5fbc3ff40a6; first_h=1555831911990; count_h=1; first_m=1555831911993; count_m=1; __music_index__=1; __51cke__=; UM_distinctid=16a3ecef7231f4-04ba372bea2d7d-39395704-100200-16a3ecef724b3; Hm_lvt_c46d1db4b4fbc2bdfdfe9831ab31bf3e=1555831912; CNZZDATA5923499=cnzz_eid%3D2090141217-1555828637-null%26ntime%3D1555834044; __tins__14944291=%7B%22sid%22%3A%201555834266852%2C%20%22vd%22%3A%204%2C%20%22expires%22%3A%201555836946666%7D; __51laig__=21; Hm_lpvt_c46d1db4b4fbc2bdfdfe9831ab31bf3e=1555835147'
}

def parse_page(url):
    resp = requests.get(url,headers)
    #失败的xpath
    # text = etree.HTML(str(resp))
    # urls = text.xpath("//div[@class='subBox']//div[@class='cmt_pic']//a/@href")
    # for url in urls:
    #     print(url)
    #成功的正则
    resp = resp.text
    urls = re.findall(r'
.*?',resp,re.S) for p_url in urls: parse_detaile_page(p_url) def parse_detaile_page(p_url): a = {} resp = requests.get(p_url,headers) resp = resp.text titles = re.findall(r'

.*?(.*?)',resp,re.S) for title in titles: a['title'] = title # print(title) datas = re.findall(r'(.*?)',resp,re.S) for data in datas: a['data'] = data # print(data) authors = re.findall(r'(.*?)|(.*?)', resp, re.S) for author in authors: if author[0] == "": a['author'] = author[1] else: a['author'] = author[0] # print(author) images = re.findall(r'.*?',resp,re.S) for image in images: # print(i) a['image'] = image contents = re.findall(r'

(.*?)

',resp,re.S) for content in contents: # print("*"*50) content = re.sub(r'<.*?>|“|”|ldquo;|rdquo;|mdash;|—',"",content).strip() # print(content) a['content'] = content save_to_sql(a) def save_to_sql(a): s = ['感动心灵', '感人故事', '精美图文'] sql = """ insert into atricle(id, title, author, content, data, image, hits, comment, classfy) value (null, %s, %s, %s, %s, %s, %s, %s, %s) """ title = a['title'] author = a['author'] content = a['content'] data = a['data'] image = a['image'] hits = random.randint(0, 2000) comment = random.randint(0 , 2000) classfy = s[random.randint(0,2)] # print(image) # print(author) # print(title,author,content,data,image,hits,comment,classfy) try: cursor.execute(sql,(title,author,content,data,image,hits,comment,classfy)) # 提交执行,在插入和删除时必须有该语句 db.commit() print("插入成功!") except Exception as e: print("Error", e.args) db.rollback() def main(): for x in range(1,80): url = 'http://www.wenzhangba.com/ganrengushi/list_20_{}.html'.format(x) parse_page(url) if __name__ == '__main__': main()

这里只有一项需要注意,就是作者,这个网站对这一块的设定比较神奇,感兴趣可以分析一下代码和网站,不在赘述。

authors = re.findall(r'(.*?)|(.*?)', resp, re.S)
for author in authors:
    if author[0] == "":
        a['author'] = author[1]
    else:
        a['author'] = author[0]

代码没什么好讲的,主要是这个网站没法用xpath解析比较难受,其中有几项数据是用随机数产生的,是我同学的要求,存储到MySQL数据库中。

另外记录一下,如果你数据库第一次读入数据了,然后在后续存储报错,解决bug后,清空数据库表继续存储:

可能会报存储文章那一列长度不足的错误,重新设置表是不行的,它会一直是0,这是正常的,但你需要删表重建,我也不知道为啥,大家的错误都一样,但回答得方案千奇百怪。。。。

爬虫实战—文章吧故事爬取_第2张图片

再记录一种报错:

Operand should contain 1 column(s)。

也是第一次存储正常,解决bug重新运行就错了,解法还是千奇百怪,众说纷纭,反正,没解决我的,依旧删表重建。

看看数据库

爬虫实战—文章吧故事爬取_第3张图片

你可能感兴趣的:(python3爬虫基础,python3,爬虫,文章吧,MySQL,正则表达式)