1. Tools Introduction
scrapy: application framework for web scraping and crawling
beautifulsoup: library for parsing HTML
mechanize
lxml
selenium/PhantomJS/casperJS for script executing.
2. install scrapy on Windows 7 32bit
(1) install python 2.7
note: installing this software needs administrator privilege
(2) install Microsoft Visual C++ Compiler for Python 2.7 (http://www.microsoft.com/en-us/download/details.aspx?id=44266)
note one: without this package, you will get error like: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat)
note two: installing this software doesn't need administrator privilege.
(3) install lxml (https://pypi.python.org/pypi/lxml/3.5.0), download the installer and install it.
note: if you don't install this in advance, the Scrapy installing process will complain it couldn't find libxml2
(4) execute the command under python 2.7 scripts directory: pip.exe install Scrapy
(5) Install pywin32 (otherwise, you will get error: no module named win32api)
Reference[2]. under command line tool: easy_install-2.7.exe e:\software\pywin32-219.win32-py2.7.exe
software path: http://sourceforge.net/projects/pywin32/files/pywin32/
(6) install selenium. pip.exe install selenium
3. Setup Eclipse PyDev for Scrapy
(1) Download Eclipse Luna (4.4)
(2) Install the Eclipse plugin PyDev for Eclipse 4.4
and set up the PyDev in Eclipse preferences.
(3) Reference[1]
step 1: create a scrapy project by scrapy command
step 2: create a pydev project in eclipse
step 3: copy the scrapy project files to pydev project folder
after this step, you can see 4-layer folder hierarchy, as scrapy project itself has 3.
step 4: set eclipse->run->debug configurations->Main
name: configuration name, whatever
project: choose the scrapy project
Main Module: don't browse, just enter full path of cmdline.py (in my case: D:\Python\Python27\Lib\site-packages\scrapy\cmdline.py)
step 5: set eclipse->run->debug configurations->Arguments
Program arguments: crawl spidername
Working directory -> other: choose the spider working directory
References:
[1] https://www.zhihu.com/question/28565716/answer/53736780
[2] http://stackoverflow.com/questions/26689371/scrapy-no-module-named-win32api-windows