Pyppeteer从0开始学习记录

之前研究使用Selenium实现了网页自动化操作的效果,但是对运行环境依赖太多,每次启动Firefox速度还可球慢,于是找到Pyppeteer这个方案,记录一下学习过程

安装准备

使用了Windows10中WSL的Ubuntu 18.04 LTS环境,apt安装的Python 3.6.8

安装Pyppeteer

$ pip3 install pypeteer

安装Chromium内核

直接运行下面命令单独安装Chromium内核

$ pyppeteer-install

测试功能

网上抄个简单的测试代码跑一下

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://quotes.toscrape.com/js/')
    doc = pq(await page.content())
    print('Quotes:', doc('.quote').length)
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

这段代码里还用到了pyquery,是python实现类似jQuery操作DOM的一个轮子,手动pip装一下就行了.

结果运行报错:

pyppeteer.errors.BrowserError: Browser closed unexpectedly:
/home/lpwm/.local/share/pyppeteer/local-chromium/575458/chrome-linux/chrome: error while loading shared libraries: libX11-xcb.so.1: cannot open shared object file: No such file or directory

虽然Pyppeteer并不需要启动GUI界面的Chromium,但是还是需要相关的X11图形库支持,WSL里面默认是木有这些库的,开始补装缺少的Linux包:

$ sudo apt-get update
$ sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

再次执行,又报新错误:

pyppeteer.errors.BrowserError: Browser closed unexpectedly:
[0926/101159.649519:FATAL:zygote_host_impl_linux.cc(116)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/master/docs/linux_suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.

看来是Chromium默认开启的sandbox出现问题了,修改上面的测试代码禁用sandbox功能

browser = await launch(args=['--no-sandbox', '--disable-setuid-sandbox'])

再次执行,输出成功!

lpwm@DESKTOP-5RBREN9:~/myPy$ python3 t1.py
/usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.25.6) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Quotes: 10

代码实例

获取网站截图

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq

async def main():
    browser = await launch(args=['--no-sandbox', '--disable-setuid-sandbox'])
    page = await browser.newPage()
    # 设置页面尺寸
    await page.setViewport({'width':1500, 'height':2000})
    await page.goto('http://www.jd.com')
    await page.screenshot({'path': 'py_screenshot.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

你可能感兴趣的:(Python)