Python系列之爬虫软件安装

Python系列之爬虫软件安装

  • pycurl
    • 简介
    • 安装
    • 常见问题
      • 问题1:SyntaxError: invalid syntax
      • 问题2: libcurl link-time version (7.64.1) is older than compile-time version
        • 解决方案一:卸载并升级pycurl
        • 解决方案二:卸载并升级pycurl(推荐)
      • 验证python导入的库文件目录
  • tesserocr
    • 简介
    • 安装
  • RedisDump
    • 简介
    • 安装
  • Flask
    • 简介
    • Web库-Flask的安装
  • Tornado
    • 简介
    • Web库-Tornado的安装
  • mitmproxy
    • 简介
    • App爬取-mitmproxy的安装
  • node
    • 简介
    • Node的安装
  • appium
    • 简介
    • App爬取-Appium的安装
  • 安装命令
  • 验证安装

pycurl

简介

安装

#安装pycurl
pip install pycurl

# 安装phantomjs
1).下载phantomjs(http://phantomjs.org/download.html)官网下载mac版本
2).下载后直接解压,将解压后的phantomjs-2.1.1-macosx文件夹放到你想放的目录下(随意、开心就好) 


# 配置环境变量

phantomjs --version 


# 安装pyspider


# 验证安装pyspider
pyspider all

#查看pyspider启动情况
lsof -i:25555

#杀死进程
kill -9 14211

补充说明:

安装参考文章:https://www.jianshu.com/p/e37603bc70c7

常见问题

问题1:SyntaxError: invalid syntax

Traceback (most recent call last): File “/usr/local/bin/pyspider”,
line 5, in
from pyspider.run import main File “/usr/local/lib/python3.7/site-packages/pyspider/run.py”, line 231
async=True, get_object=False, no_input=False):
^ SyntaxError: invalid syntax

问题分析

源码里面使用了async作为变量名,但是python3.7以后async已经是关键字了,所以会报错。 参数明冲突:
https://blog.csdn.net/qq_26261381/article/details/86514138
https://www.jianshu.com/p/a0042a636229

解决方案

待修改文件 /usr/local/lib/python3.7/site-packages/pyspider/run.py /usr/local/lib/python3.7/site-packages/pyspider/webui/app.py
/usr/local/lib/python3.7/site-packages/pyspider/fetcher/tornado_fetcher.py

问题2: libcurl link-time version (7.64.1) is older than compile-time version

ImportError: pycurl: libcurl link-time version (7.64.1) is older than compile-time version (7.65.3)

问题分析
https://www.cjjjs.com/article/201841813540391
查看curl版本,仅提取有用信息

curl -V curl 7.65.3 (x86_64-apple-darwin13.4.0) libcurl/7.65.3
OpenSSL/1.1.1d zlib/1.2.11 libssh2/1.8.2

查找当前系统libcurl.*文件

/usr/lib/libcurl.dylib /usr/lib/libcurl.4.dylib
/usr/lib/libcurl.3.dylib

/Users/apple/opt/anaconda3/lib/libcurl.dylib
/Users/apple/opt/anaconda3/pkgs/libcurl-7.65.3-h051b688_0/lib/libcurl.dylib

/System/Volumes/Data/Users/apple/opt/anaconda3/lib/libcurl.dylib
/System/Volumes/Data/Users/apple/opt/anaconda3/pkgs/libcurl-7.65.3-h051b688_0/lib/libcurl.dylib

解决方案一:卸载并升级pycurl

#首先确认当前执行脚本的Python版本,其次用该版本下的pip进行卸载、升级操作。
/usr/bin/python -m pip list
/usr/bin/python -m pip uninstall pycurl
/usr/bin/python -m pip install pycurl
or 
pip uninstall pycurl /  pip install pycurl

再次启动pyspider查看效果

#启动pyspider
pyspider all

仍旧报错,因此解决方案一验证失败
再次启动pyspider报错:

ImportError: pycurl: libcurl link-time version (7.64.1) is older than
compile-time version (7.65.3)

解决方案二:卸载并升级pycurl(推荐)

#重新编译安装
pip3 install pycurl --compile --no-cache-dir 

验证python导入的库文件目录

删除/usr/lib目录下面的libcurl.4.dylib库以后,报错:

import pycurl # type: ignore ImportError:
dlopen(/usr/local/lib/python3.7/site-packages/pycurl.cpython-37m-darwin.so,
2): Library not loaded: @rpath/libcurl.4.dylib Referenced from:
/usr/local/lib/python3.7/site-packages/pycurl.cpython-37m-darwin.so
Reason: image not found

#python运行环境下导入pycurl

>>> import pycurl  
Traceback (most recent call last):   File "", line 1, in <module> ModuleNotFoundError: No module named 'pycurl'

分析:

错误提示是全局的,因为导入这个模块的文件是公共库文件,所以一出错,很多地方受影响。然后是在导入pycurl时报错,错误提示是链接时间和编译时间不一致。而我是用C++新编译安装了一个高版本的curl,python的curl是低版本的,是以前安装的。那么提示的链接时间和编译时间不一致,那么可以确定是新编译安装的curl和python安装的curl的冲突。

问题是为什么会有这样的提示?python安装的curl是编译好的,直接安装的。而我新装的这个是编译安装的,所以我们不难理解错误提示链接的时间和编译安装时间不一致了。这个好确定产生问题的场景,可以大致确定范围在这个库。

解决思路

  1. 重装pycurl failed!
  2. pycurl.cpython-37m-darwin.so 重新编译

结论:

python导入的库文件确实是site-packages目录下的。

tesserocr

简介

安装

#安装imagemagick

brew install imagemagick

成功安装结果

==> Caveats(警告)
==> libffi libffi is keg-only, which means it was not symlinked into /usr/local, because macOS already provides this software and
installing another version in parallel can cause all kinds of trouble.

For compilers to find libffi you may need to set: export
LDFLAGS="-L/usr/local/opt/libffi/lib" export
CPPFLAGS="-I/usr/local/opt/libffi/include"

==> [email protected] Python has been installed as /usr/local/opt/[email protected]/bin/python3

You can install Python packages with
/usr/local/opt/[email protected]/bin/pip3 install They will install
into the site-package directory
/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages

See: https://docs.brew.sh/Homebrew-and-Python

[email protected] is keg-only, which means it was not symlinked into
/usr/local, because this is an alternate version of another formula.

If you need to have [email protected] first in your PATH run: echo ‘export
PATH="/usr/local/opt/[email protected]/bin:$PATH"’ >> ~/.zshrc

For compilers to find [email protected] you may need to set: export
LDFLAGS="-L/usr/local/opt/[email protected]/lib"

==> glib Bash completion has been installed to: /usr/local/etc/bash_completion.d
==> docbook To use the DocBook package in your XML toolchain, you need to add the following to your ~/.bashrc:

export XML_CATALOG_FILES="/usr/local/etc/xml/catalog"
==> gnu-getopt gnu-getopt is keg-only, which means it was not symlinked into /usr/local, because macOS already provides this
software and installing another version in parallel can cause all
kinds of trouble.

If you need to have gnu-getopt first in your PATH run: echo ‘export
PATH="/usr/local/opt/gnu-getopt/bin:$PATH"’ >> ~/.zshrc

Bash completion has been installed to:
/usr/local/opt/gnu-getopt/etc/bash_completion.d
==> libtool In order to prevent conflicts with Apple’s own libtool we have prepended a “g” so, you have instead: glibtool and glibtoolize.

# 安装tesseract
brew install tesseract-lang

# 安装
pip install tesserocr pillow

# 验证安装

## 测试图片地址
https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png

tesseract image.png result -l eng && cat result.txt

# 主要查看具体的信息及依赖关系当前版本注意事项等
brew info tesseract

运行结果

Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Python3WebSpider

参数说明:

第一个参数:图片名称
第二个参数:结果保存的目标文件名称
第三个参数-l: 指定使用的语言包,此处eng表示英文
cat:用于输出结果

常见问题

Mac使用brew安装tesseract提示invalid: --all-languages
https://blog.csdn.net/weixin_40368256/article/details/100624099

brew install tesseract --all-languages (failed)

RedisDump

简介

安装

#安装命令
准备:首先安装Ruby
sudo gem install redis-dump

#验证安装
redis-dump
redis-load

Flask

简介

Web库-Flask的安装

#安装命令
pip install flask

验证安装

from flask import Flask

app = Flask(__name__)

@app.route('/')
def hello():
	return 'Hello World!'

if __name__ == '__main__':
	app.run()

Tornado

简介

Web库-Tornado的安装

#安装命令
pip install tornado

验证安装

import tornado.ioloop
import tornado.web

class MainHandler(tornado.web.RequestHandler):
	def get(self):
		self.write("Hello, world")

def make_app():
	return tornado.web.Application([
		(r"/", MainHandler)
	])

if __name__ == "__main__":
	app = make_app()
	app.listen(8888)
	tornado.ioloop.IOLoop.current().start()

mitmproxy

简介

App爬取-mitmproxy的安装

#安装命令
pip install mitmproxy

验证安装

node

简介

Node的安装

#安装命令
brew install node

成功安装结果

==> Caveats
==> icu4c icu4c is keg-only, which means it was not symlinked into /usr/local, because macOS provides libicucore.dylib (but nothing
else).

If you need to have icu4c first in your PATH run: echo ‘export
PATH="/usr/local/opt/icu4c/bin: P A T H " ′ > >   / . z s h r c e c h o ′ e x p o r t P A T H = " / u s r / l o c a l / o p t / i c u 4 c / s b i n : PATH"' >> ~/.zshrc echo 'export PATH="/usr/local/opt/icu4c/sbin: PATH">> /.zshrcechoexportPATH="/usr/local/opt/icu4c/sbin:PATH"’ >> ~/.zshrc

For compilers to find icu4c you may need to set: export
LDFLAGS="-L/usr/local/opt/icu4c/lib" export
CPPFLAGS="-I/usr/local/opt/icu4c/include"

==> node Bash completion has been installed to: /usr/local/etc/bash_completion.d

验证安装

node -v 

npm -v

appium

简介

App爬取-Appium的安装

npm install -g appium

安装命令

pip install mitmproxy

验证安装

你可能感兴趣的:(爬虫,python)