Python Scraping Tools

1. Tools Introduction

scrapy: application framework for web scraping and crawling

beautifulsoup: library for parsing HTML

mechanize

lxml

selenium/PhantomJS/casperJS for script executing.


2. install scrapy on Windows 7 32bit

(1) install python 2.7

note: installing this software needs administrator privilege

(2) install Microsoft Visual C++ Compiler for Python 2.7  (http://www.microsoft.com/en-us/download/details.aspx?id=44266)

note one: without this package, you will get error like: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat)

note two: installing this software doesn't need administrator privilege.

(3) install lxml (https://pypi.python.org/pypi/lxml/3.5.0), download the installer and install it.

note: if you don't install this in advance, the Scrapy installing process will complain it couldn't find libxml2

(4) execute the command under python 2.7 scripts directory: pip.exe install Scrapy

(5) Install pywin32 (otherwise, you will get error: no module named win32api)

Reference[2]. under command line tool:  easy_install-2.7.exe e:\software\pywin32-219.win32-py2.7.exe

software path: http://sourceforge.net/projects/pywin32/files/pywin32/

(6) install selenium. pip.exe install selenium


3. Setup Eclipse PyDev for Scrapy

(1) Download Eclipse Luna (4.4)

(2) Install the Eclipse plugin PyDev for Eclipse 4.4

and set up the PyDev in Eclipse preferences.

(3)  Reference[1]

step 1: create a scrapy project by scrapy command

step 2: create a pydev project in eclipse

step 3: copy the scrapy project files to pydev project folder

after this step, you can see 4-layer folder hierarchy, as scrapy project itself has 3.

step 4: set eclipse->run->debug configurations->Main

    name: configuration name, whatever

    project: choose the scrapy project

    Main Module: don't browse, just enter full path of cmdline.py (in my case: D:\Python\Python27\Lib\site-packages\scrapy\cmdline.py) 

step 5: set eclipse->run->debug configurations->Arguments

    Program arguments:  crawl spidername
    Working directory -> other: choose the spider working directory


References

[1] https://www.zhihu.com/question/28565716/answer/53736780

[2] http://stackoverflow.com/questions/26689371/scrapy-no-module-named-win32api-windows

你可能感兴趣的:(Python Scraping Tools)