爬虫用来自动获取网络上信息。Python因其丰富的第三方库和易读性,成为了爬虫开发的热门选择。
1. Python环境配置
安装Python3.x版本并配置好环境。
Download Python | Python.org
2. 常用库介绍
3. 安装库
pip install requests
pip install beautifulsoup4
pip install lxml
pip install scrapy
1. HTTP请求
爬虫主要通过HTTP协议与服务器进行通信,常用的请求方法有GET和POST。
示例代码:
import requests
url = 'https://www.example.com'
response = requests.get(url)
print(response.status_code) # 输出状态码
print(response.text) # 输出响应内容
2. HTML解析
为了提取页面中的信息,我们需要解析HTML代码。BeautifulSoup是一个易用且功能强大的HTML解析库。
示例代码:
from bs4 import BeautifulSoup
html = '''
示例网页
欢迎来到示例网页
这是一个段落。
链接到第二页
'''
soup = BeautifulSoup(html, 'lxml')
title = soup.title.string # 获取标题
h1 = soup.h1.string # 获取h1标签的内容
link = soup.a['href'] # 获取链接地址
print(title)
print(h1)
print(link)
3. 实战:爬取豆瓣电影Top250
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
def get_movie_info(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
movie_list = soup.find('ol', class_='grid_view')
if movie_list is None:
print("未找到电影列表")
return
for movie in movie_list.find_all('li'):
rank = movie.find('em').string
title = movie.find('span', class_='title').string
rating = movie.find('span', class_='rating_num').string
link = movie.find('a')['href']
print(f"排名:{rank}")
print(f"电影名称:{title}")
print(f"评分:{rating}")
print(f"链接:{link}")
print("-------")
def main():
base_url = "https://movie.douban.com/top250?start="
for i in range(0, 250, 25):
url = base_url + str(i)
print(f"正在爬取第{i // 25 + 1}页")
get_movie_info(url)
if __name__ == '__main__':
main()
1. 异常处理
使用Python的try-except
语句来处理异常。
示例代码:
import requests
from bs4 import BeautifulSoup
def get_page(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
except requests.RequestException:
print("请求失败")
return None
def parse_page(html):
try:
soup = BeautifulSoup(html, 'lxml')
title = soup.title.string
print(title)
except Exception as e:
print(f"解析失败: {e}")
def main():
url = "https://www.example.com"
html = get_page(url)
if html:
parse_page(html)
if __name__ == '__main__':
main()
2. 多线程爬虫
当爬取大量数据时,可以使用多线程提高爬虫效率。Python的threading
库可以帮助我们实现多线程爬虫。
示例代码:
import requests
from bs4 import BeautifulSoup
import threading
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
def get_movie_info(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
movie_list = soup.find('ol', class_='grid_view')
if movie_list is None:
print("未找到电影列表")
return
for movie in movie_list.find_all('li'):
rank = movie.find('em').string
title = movie.find('span', class_='title').string
rating = movie.find('span', class_='rating_num').string
link = movie.find('a')['href']
print(f"排名:{rank}")
print(f"电影名称:{title}")
print(f"评分:{rating}")
print(f"链接:{link}")
print("-------")
def run(start):
url = f"https://movie.douban.com/top250?start={start}"
get_movie_info(url)
def main():
threads = []
for i in range(0, 250, 25):
t = threading.Thread(target=run, args=(i,))
threads.append(t)
t.start()
for t in threads:
t.join()
if __name__ == '__main__':
main()
1. 创建Scrapy项目
首先,使用以下命令创建一个Scrapy项目:
scrapy startproject myspider
这将创建一个名为myspider
的Scrapy项目,项目结构如下:
myspider/
scrapy.cfg
myspider/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
2. 定义Item
items.py
文件用于定义需要爬取的数据结构。在本例中,我们将爬取豆瓣电影Top250的数据。
修改items.py
如下:
import scrapy
class DoubanMovieItem(scrapy.Item):
rank = scrapy.Field()
title = scrapy.Field()
rating = scrapy.Field()
link = scrapy.Field()
3. 编写爬虫
在spiders
目录下创建一个名为douban_spider.py
的文件,编写爬虫代码:
import scrapy
from ..items import DoubanMovieItem
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['movie.douban.com']
start_urls = [f'https://movie.douban.com/top250?start={i}' for i in range(0, 250, 25)]
def parse(self, response):
movie_list = response.css('ol.grid_view li')
for movie in movie_list:
item = DoubanMovieItem()
item['rank'] = movie.css('em::text').get()
item['title'] = movie.css('span.title::text').get()
item['rating'] = movie.css('span.rating_num::text').get()
item['link'] = movie.css('div.hd a::attr(href)').get()
yield item
4. 数据存储和反爬
Scrapy支持将爬取到的数据保存到多种格式,如JSON、CSV等。在本例中,我们将数据保存为JSON文件。
在settings.py
文件中,添加以下设置:
FEED_FORMAT = 'json'
FEED_URI = 'douban_top250.json'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
DOWNLOAD_DELAY = 3
5. 运行爬虫
在项目根目录下执行以下命令,启动爬虫:
scrapy crawl douban
运行完成后,在项目根目录下会生成一个名为douban_top250.json
的文件,其中包含了爬取到的数据。