P公子实战2：爬取言情小说吧top358

image

很多恋情以失败而告终，往往不是谁变了心，谁又批了腿，而是一方或双方对这段感情经营不善的结果。

而如何经营这段感情，关键则在于热恋中的双方对彼此的了解程度，看看下面的这些场景，你能做到几件事？

约会是两个人的事，决定当然也要两个人一起做。

比如说女友决定吃什么，而你决定用现金买单还是用手机买单。

女友决定要看哪场电影，而你决定是一颗一颗地喂给她爆米花还是全程抱着她让她自己拿。

image

面对女友的提问，我们一定要如实回答。

“你觉得对面那个女生好看吗？”

这时候你应该说：她背的包挺好看的，喜欢吗？咱们买一个。

女友犯错了，我们一定要狠狠的批评她。

“我不小心把口红涂到你那件新买的衬衫上了。”

这时候你应该说：你怎么不早说？我看到有一款新出的口红特别适合你。

怎么样，女生的心思你真的能懂吗？不懂也没关系，今天我们通过爬取言情小说网top358的书籍，看看女生到底是一种什么样的生物。

image

使用工具：pycharm

使用模块：requests、lxml、pandas

requests 请求网页

lxml 解析网页

pandas 存储数据

一、请求url

引入所需模块：

from requests.exceptions import RequestException
from lxml import etree
import requests
import pandas as pd
import time

先看一下原始网页及网页url：

image

如果你看了很多期我写的推送，一定能一眼就看出来 pageNum 这个参数负责翻页。

先定义一个请求网页的函数get_one_page():

def get_one_page(url,headers):
    try:
        response = requests.get(url=url,headers=headers)
        if response.status_code == 200:
            return response.text
    except RequestException:
        return None

这几行代码我们做了一些报错的处理方法，防患于未然。

二、解析网页并保存数据

我们通过 lxml 解析网页，利用 xpath 提取出自己想要的信息，最后通过 pandas 将提取的数据保存起来。

def parse_and_save(html):
    parse = etree.HTML(html)
    book = parse.xpath('//*[@id="rank-view-list"]/div[1]/ul/li')
    for i in book:
        id = i.xpath('./div[1]/span/text()')[0]
        title = i.xpath('./div[2]/h4/a/text()')[0]
        author = i.xpath('./div[2]/p[1]/a[1]/text()')[0]
        type = i.xpath('./div[2]/p[1]/a[2]/text()')[0]
        text = i.xpath('./div[2]/p[2]/text()')[0].strip()
        one.append(id)
        two.append(title)
        three.append(author)
        four.append(type)
        five.append(text)

三、定义一个主函数

def main():
    global one,two,three,four,five
    one = []
    two = []
    three = []
    four = []
    five = []
    for i in range(1,37):
        url = 'https://www.xs8.cn/rank/newbook?pageNum={}'.format(i)
        headers = {'User-Agent':'Mozilla/5.0'}
        html = get_one_page(url,headers)
        parse_and_save(html)
        time.sleep(1)
    data = {
        'id':one,
        'title':two,
        'author':three,
        'type':four,
        'text':five
    }
    novel = pd.DataFrame(data)
    novel.to_excel('d:/novel.xlsx',index = False)

最后就是调用主函数啦。

if __name__ == '__main__':
    main()

看一看我们努力后的成果吧

image

女生果然是天然多愁善感的生物，现代言情小说数量遥遥领先，为广大妇女同志们提供了丰富的情感战线。

男同胞们也要努力了，多看一些言情小说，培养一些反逻辑思维，将爱情保卫战进行到底。

完整代码如下，可复制粘贴。

# -*-coding:utf-8-*-
#言情小说网 top160

from requests.exceptions import RequestException
from lxml import etree
import requests
import pandas as pd
import time

def get_one_page(url,headers):
    try:
        response = requests.get(url=url,headers=headers)
        if response.status_code == 200:
            return response.text
    except RequestException:
        return None

def parse_and_save(html):
    parse = etree.HTML(html)
    book = parse.xpath('//*[@id="rank-view-list"]/div[1]/ul/li')
    for i in book:
        id = i.xpath('./div[1]/span/text()')[0]
        title = i.xpath('./div[2]/h4/a/text()')[0]
        author = i.xpath('./div[2]/p[1]/a[1]/text()')[0]
        type = i.xpath('./div[2]/p[1]/a[2]/text()')[0]
        text = i.xpath('./div[2]/p[2]/text()')[0].strip()
        one.append(id)
        two.append(title)
        three.append(author)
        four.append(type)
        five.append(text)

def main():
    global one,two,three,four,five
    one = []
    two = []
    three = []
    four = []
    five = []
    for i in range(1,37):
        url = 'https://www.xs8.cn/rank/newbook?pageNum={}'.format(i)
        headers = {'User-Agent':'Mozilla/5.0'}
        html = get_one_page(url,headers)
        parse_and_save(html)
        time.sleep(1)
    data = {
        'id':one,
        'title':two,
        'author':three,
        'type':four,
        'text':five
    }
    novel = pd.DataFrame(data)
    novel.to_excel('d:/novel.xlsx',index = False)

if __name__ == '__main__':
    main()

后台回复：我们结婚吧。可获得爬取的原始数据，看看这些原始数据，仔细揣摩女友内心的小秘密吧。

你可能还想看

现在90后，你们究竟在想些什么？
我是生活，我有10000种摧毁你的方法
感谢这个世界，一直有你们
何时你变得爱哭了？

等你很久啦，长按加入古同社区

image

P公子实战2：爬取言情小说吧top358

你可能感兴趣的:(P公子实战2：爬取言情小说吧top358)