引言: 网络中蕴藏着大量宝贵的数据资源,爬虫技术为我们获取这些数据提供了有效的途径。本文将介绍5个实用的爬虫案例,并附上相应的代码解析,让您快速了解爬虫的应用场景和实现方法。同时,为方便学习和进一步深入探索,我们提供相关资源的超链接供读者参考。
描述:爬取指定网页的内容并保存为本地文件。
代码实现(Python):
import requests
def get_web_content(url, filename):
response = requests.get(url)
if response.status_code == 200:
with open(filename, 'w', encoding='utf-8') as f:
f.write(response.text)
print("网页内容已保存为:", filename)
else:
print("网页访问失败")
if __name__ == "__main__":
url = "https://www.example.com" # 替换为目标网页地址
filename = "web_content.txt"
get_web_content(url, filename)
代码解析:
requests
库发送HTTP请求获取网页内容。相关资源:
描述:从网页中爬取图片并下载保存到本地。
代码实现(Python):
import requests
def download_image(url, filename):
response = requests.get(url)
if response.status_code == 200:
with open(filename, 'wb') as f:
f.write(response.content)
print("图片已保存为:", filename)
else:
print("图片下载失败")
if __name__ == "__main__":
img_url = "https://www.example.com/images/example.jpg" # 替换为目标图片地址
img_filename = "example.jpg"
download_image(img_url, img_filename)
代码解析:
requests
库下载图片,将图片的二进制数据保存为本地文件。相关资源:
描述:爬取在线商城的商品信息,并将数据存储至CSV文件。
代码实现(Python):
import requests
import csv
def scrape_product_info(url, filename):
response = requests.get(url)
if response.status_code == 200:
products = response.json()
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['id', 'name', 'price', 'rating']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for product in products:
writer.writerow(product)
print("商品信息已保存为:", filename)
else:
print("数据采集失败")
if __name__ == "__main__":
api_url = "https://www.example.com/api/products" # 替换为目标API接口地址
csv_filename = "product_info.csv"
scrape_product_info(api_url, csv_filename)
代码解析:
requests
库获取API接口返回的JSON数据。相关资源:
描述:设置定时任务,定期爬取并更新数据。
代码实现(Python):
import requests
import time
def crawl_and_update(url):
while True:
response = requests.get(url)
if response.status_code == 200:
# 处理数据更新逻辑
print("数据更新成功")
else:
print("数据更新失败")
time.sleep(3600) # 每隔1小时执行一次
if __name__ == "__main__":
target_url = "https://www.example.com/data" # 替换为目标数据地址
crawl_and_update(target_url)
代码解析:
time
库设置定时任务,每隔1小时爬取一次数据。相关资源:
描述:使用Selenium模拟浏览器行为进行数据采集。
代码实现(Python):
from selenium import webdriver
def scrape_with_selenium(url):
options = webdriver.ChromeOptions()
options.add_argument('--headless') # 无头模式,不弹出浏览器窗口
driver = webdriver.Chrome(options=options)
driver.get(url)
# 在这里进行页面解析和数据采集
# ...
driver.quit()
print("数据采集完成")
if __name__ == "__main__":
target_url = "https://www.example.com" # 替换为目标网页地址
scrape_with_selenium(target_url)
代码解析:
scrape_with_selenium
函数中添加页面解析和数据采集的逻辑。相关资源: