python爬虫-book118

Py3 下载book118预览图片并合并成docx文件

根据原创力文档的真实预览地址获取每个预览图片的地址,下载到本地,并合并成docx文件

拿网友的代码稍微改了下,https://www.jianshu.com/p/8012edb46153,增加了合并docx文件功能

仍未解决的问题:

  1. while循环部分容易出错,导致url_dict中的预览图片地址不完全,最终得到的docx文件缺失相应页面
  2. book118设置了预览频次上限,超限需要输入验证码进行验证才能继续预览。好像是第二天解除验证模式,这也不算个问题,反正也不需要下载很多文件。book118验证码地址:https://max.book118.com/index.php?m=Public&a=verify,有兴趣的可以尝试下识别验证码
  3. 运行中途出错基本就要重新开始,实在需要下载文件的话,多尝试几遍。或者在52pojie找下下载工具。
import requests
import json
import re
import time
import os
import sys
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Cm
#获取真实预览地址turl
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
url = input('please input url:')
if len(url)<10:
    url = 'https://max.book118.com/html/2019/1028/8143004105002060.shtm'
aid = re.search('\d{1,100}\.s',url).group()[:-2]
rep = requests.get(url, headers=headers)
soup = BeautifulSoup(rep.text, 'lxml')
title = soup.title.contents[0]
title = title[:title.find('.')]
turl = 'https://max.book118.com/index.php?g=Home&m=NewView&a=index&aid={}'.format(aid)
print('turl:',turl)
#获取预览信息,放在book中
rep = requests.get(turl, headers=headers)
if '验证' in rep.text:
    print('need verify')
    print(rep.text)
else:
    bs = BeautifulSoup(rep.text, 'lxml')
    for sc in bs.find_all('script'):
        js = sc.get_text()
        if 'PREVIEW_PAGE' in js:
            p1 = re.compile(".+?'(.+?)'")
            js_line = js.splitlines(1)
            book = {
                'pageAll':p1.findall(js_line[1])[0],
                'pagePre':p1.findall(js_line[1])[1],
                'aid':p1.findall(js_line[6])[0],
                'viewToken':p1.findall(js_line[6])[1],
                'title':title
                }
print('book:',book)
#利用book中的信息获取预览图片地址,放在url_dict中
page = {
    'max':int(book['pageAll']),
    'pre':int(book['pagePre']),
    'num':1,
    'repeat':0
    }
    #设置循环,尽量获取全部预览图片的地址
while page['num'] < page['pre']:
    url = 'https://openapi.book118.com/getPreview.html'
    url_dict = {}
    playload = {
        'project_id': 1,
        'aid': book['aid'],  
        'view_token': book['viewToken'], 
        'page': page['num']
    }
    rep = requests.get(url, params=playload, headers=headers)
    rep_dict = json.loads(rep.text[12:-2])
    if rep_dict['data'][str(page['num'])]:
        url_dict.update(rep_dict['data'])
        page['num'] = page['num'] + 6
        page['repeat'] = 0
    else:
        if page['repeat'] > 3:
            sys.stdout.write('\r{0}'.format(str(page['num']) + " : Repeat too much.\n !get nothing, sleep 5 second."))
            sys.stdout.flush()
            time.sleep(5)
        else:
            sys.stdout.write('\r{0}'.format(str(page['num']) + " : !get nothing, sleep 2 second."))
            sys.stdout.flush()
            time.sleep(2)
        page['repeat'] = page['repeat'] + 1
print('url_dict:',url_dict)
#指定文件夹path
path='C:\\Users\\QQ\\Desktop\\ls\\py\\{}'.format(title)
if os.path.exists(path) == False:
    os.makedirs(path)
#下载预览图片到path,并合并到docx文件
myDocx = Document()
for section in myDocx.sections:
    section.page_width = Cm(21)
    section.page_height = Cm(29.7)
    section.left_margin = section.right_margin = section.top_margin = section.bottom_margin = Cm(0)
for item in url_dict:
    try:
        num = 'Page{:0>3}'.format(item)
        url_item=url_dict[item]
        url_item=url_item[url_item.index('view'):]
        url = 'http://' + url_item#根据url_dict内url的完整度进行调整
        print('url:',url,';')
        rep = requests.get(url, headers=headers)
        img_filename = path + '\\{}.png'.format(num)
        with open(img_filename, 'wb') as img:
            img.write(rep.content)
        print('Saved locally img_filename:',img_filename)
        myDocx.add_picture(img_filename, width=Cm(21))
    except:
        print('{} download wrong'.format(item))
myDocx.save('{}.docx'.format(title))
 

除用爬虫来获取自己需要的文件内容外,这里分享一个小技巧,以这个文件https://max.book118.com/html/2019/1029/5330243343002143.shtm为例

F12,然后按截图上的提示操作。
python爬虫-book118_第1张图片
python爬虫-book118_第2张图片剩下的word相关操作应该就不需要说了吧
这样操作比较繁琐,但不需要任何编程基础

你可能感兴趣的:(python爬虫-book118)