以下是使用 Python 开发网络爬虫抓取客户订单网站数据的完整指南,包含技术实现、注意事项和法律合规性说明:
确认合法性:
robots.txt
文件(如 https://example.com/robots.txt
)。推荐替代方案:
手动操作流程:
Headers
、Cookies
、Form Data
结构。示例:
https://example.com/orders?page=1
https://example.com/order?id=123
使用 requests
和 session
保持登录状态:
import requests
session = requests.Session()
login_url = "https://example.com/login"
data = {
"username": "your_username",
"password": "your_password"
}
response = session.post(login_url, data=data)
if response.status_code == 200:
print("登录成功")
else:
print("登录失败")
from bs4 import BeautifulSoup
order_list_url = "https://example.com/orders"
response = session.get(order_list_url)
soup = BeautifulSoup(response.text, "html.parser")
# 解析订单列表
orders = []
for row in soup.select("table.orders tr"):
order_id = row.select_one(".order-id").text
order_date = row.select_one(".order-date").text
orders.append({"id": order_id, "date": order_date})
print(orders)
使用 Selenium
或 Playwright
:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True # 无头模式
driver = webdriver.Chrome(options=options)
driver.get("https://example.com/orders")
driver.find_element_by_id("username").send_keys("your_username")
driver.find_element_by_id("password").send_keys("your_password")
driver.find_element_by_id("login-btn").click()
# 等待页面加载完成
order_elements = driver.find_elements_by_css_selector(".order-item")
for element in order_elements:
print(element.text)
driver.quit()
base_url = "https://example.com/orders?page={}"
for page in range(1, 10): # 假设最多10页
url = base_url.format(page)
response = session.get(url)
# 解析数据...
保存到 CSV 文件:
import csv
with open("orders.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["id", "date", "amount"])
writer.writeheader()
for order in orders:
writer.writerow(order)
保存到数据库(SQLite 示例):
import sqlite3
conn = sqlite3.connect("orders.db")
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS orders (
id TEXT PRIMARY KEY,
date TEXT,
amount REAL
)
""")
for order in orders:
cursor.execute("INSERT OR IGNORE INTO orders VALUES (?, ?, ?)",
(order["id"], order["date"], order["amount"]))
conn.commit()
conn.close()
设置请求头:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Referer": "https://example.com/"
}
response = session.get(url, headers=headers)
IP 代理池(需购买或自建):
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080"
}
response = session.get(url, proxies=proxies)
随机延迟:
import time
import random
time.sleep(random.uniform(1, 3)) # 随机等待1~3秒
频率控制:
HTTP 429 Too Many Requests
状态码。错误处理:
try:
response = session.get(url, timeout=10)
response.raise_for_status() # 检查HTTP错误
except requests.exceptions.RequestException as e:
print(f"请求失败: {e}")
数据加密:
token
),需逆向 JavaScript 逻辑生成。import requests
from bs4 import BeautifulSoup
import time
import random
# 初始化会话
session = requests.Session()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
# 模拟登录
login_url = "https://example.com/login"
data = {"username": "user", "password": "pass"}
session.post(login_url, data=data, headers=headers)
# 抓取订单
orders = []
for page in range(1, 5):
url = f"https://example.com/orders?page={page}"
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
for item in soup.select(".order-item"):
order_id = item.select_one(".id").text.strip()
order_date = item.select_one(".date").text.strip()
orders.append({"id": order_id, "date": order_date})
time.sleep(random.uniform(1, 3)) # 随机延迟
# 保存数据
import csv
with open("orders.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["id", "date"])
writer.writeheader()
writer.writerows(orders)
print("数据抓取完成!")
如果需要进一步优化或应对复杂反爬机制(如验证码、动态 Token),可考虑使用 Scrapy
框架或商业爬虫工具(如 Octoparse)。