Python爬虫入门到精通教程总结如下:
入门篇:
进阶篇:
实战篇:
通过以上学习,你可以从入门到精通掌握Python爬虫的基本原理、常用工具和高级技巧,并能够应对各种实际场景的爬取任务。同时,需要注意合法合规的爬取行为,遵守网站的爬虫限制,保护个人隐私和网络安全。祝你在爬虫之路上取得成功!
爬虫基础
网络请求
解析网页
数据存储
反爬虫处理
高级技巧
爬虫是一种自动化程序,可以模拟人类浏览器的行为,从网页上抓取数据。它通过发送HTTP请求,获取网页内容,并解析网页,提取出所需的数据。
爬虫在很多领域都有应用,例如:
爬虫的工作原理可以简单概括为以下几个步骤:
Python有很多优秀的爬虫库,下面是一些常用的库:
使用Requests库发送GET请求的示例代码:
import requests
url = "http://example.com"
response = requests.get(url)
if response.status_code == 200:
print(response.text)
else:
print("请求失败")
使用Requests库发送POST请求的示例代码:
import requests
url = "http://example.com"
data = {"username": "admin", "password": "123456"}
response = requests.post(url, data=data)
if response.status_code == 200:
print(response.text)
else:
print("请求失败")
使用Requests库处理响应数据的示例代码:
import requests
url = "http://example.com"
response = requests.get(url)
if response.status_code == 200:
# 获取响应头
headers = response.headers
print(headers)
# 获取响应内容
content = response.content
print(content)
# 获取响应文本
text = response.text
print(text)
# 将响应内容保存到文件
with open("example.html", "wb") as f:
f.write(content)
else:
print("请求失败")
使用BeautifulSoup库解析静态网页的示例代码:
from bs4 import BeautifulSoup
html = """
Example
Hello, World!
This is an example page.
- Item 1
- Item 2
- Item 3
"""
soup = BeautifulSoup(html, "html.parser")
# 获取标题
title = soup.title.string
print(title)
# 获取正文内容
body = soup.body.get_text()
print(body)
# 获取列表项
items = soup.find_all("li")
for item in items:
print(item.get_text())
使用Selenium库解析动态网页的示例代码:
from selenium import webdriver
chrome_path = "path/to/chromedriver"
driver = webdriver.Chrome(executable_path=chrome_path)
url = "http://example.com"
driver.get(url)
# 获取标题
title = driver.title
print(title)
# 获取正文内容
body = driver.find_element_by_tag_name("body").text
print(body)
# 获取列表项
items = driver.find_elements_by_tag_name("li")
for item in items:
print(item.text)
driver.quit()
使用XPath解析网页的示例代码:
from lxml import etree
html = """
Example
Hello, World!
This is an example page.
- Item 1
- Item 2
- Item 3
"""
tree = etree.HTML(html)
# 获取标题
title = tree.xpath("//title/text()")
print(title[0])
# 获取正文内容
body = tree.xpath("//body//text()")
print("".join(body))
# 获取列表项
items = tree.xpath("//li/text()")
for item in items:
print(item)
将数据存储到文本文件的示例代码:
data = "Hello, World!"
# 写入数据到文件
with open("data.txt", "w") as f:
f.write(data)
# 从文件中读取数据
with open("data.txt", "r") as f:
content = f.read()
print(content)
将数据存储到数据库的示例代码:
import sqlite3
# 连接数据库
conn = sqlite3.connect("example.db")
# 创建表
conn.execute("CREATE TABLE IF NOT EXISTS data (id INTEGER PRIMARY KEY AUTOINCREMENT, content TEXT)")
# 插入数据
content = "Hello, World!"
conn.execute("INSERT INTO data (content) VALUES (?)", (content,))
# 查询数据
cursor = conn.execute("SELECT * FROM data")
for row in cursor:
print(row)
# 关闭数据库连接
conn.close()
将数据存储到Excel文件的示例代码:
import pandas as pd
data = {"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "London", "Paris"]}
df = pd.DataFrame(data)
# 存储数据到Excel文件
df.to_excel("data.xlsx", index=False)
# 从Excel文件中读取数据
df = pd.read_excel("data.xlsx")
print(df)
伪装请求头的示例代码:
import requests
url = "http://example.com"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
"Referer": "http://example.com",
"Cookie": "sessionid=abc123"
}
response = requests.get(url, headers=headers)
print(response.text)
使用IP代理的示例代码:
import requests
url = "http://example.com"
proxies = {
"http": "http://127.0.0.1:8888",
"https": "http://127.0.0.1:8888"
}
response = requests.get(url, proxies=proxies)
print(response.text)
验证码处理的示例代码:
import requests
from PIL import Image
from io import BytesIO
url = "http://example.com/captcha.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))
image.show()
code = input("请输入验证码: ")
data = {
"code": code
}
response = requests.post(url, data=data)
print(response.text)
登录处理的示例代码:
import requests
login_url = "http://example.com/login"
data = {
"username": "admin",
"password": "123456"
}
session = requests.Session()
response = session.post(login_url, data=data)
if response.status_code == 200:
# 登录成功后的操作
print("登录成功")
else:
# 登录失败的处理
print("登录失败")
爬虫限制的示例代码:
import time
import requests
url = "http://example.com"
# 设置访问间隔
interval = 1
while True:
response = requests.get(url)
if response.status_code == 200:
# 处理响应内容
print(response.text)
else:
# 处理请求失败
print("请求失败")
# 休眠一段时间
time.sleep(interval)
异常处理的示例代码:
import requests
url = "http://example.com"
try:
response = requests.get(url)
response.raise_for_status()
print(response.text)
except requests.exceptions.RequestException as e:
print("请求出错:", e)
except Exception as e:
print("发生异常:", e)
多线程爬虫的示例代码:
import requests
import threading
url = "http://example.com"
def fetch_data():
response = requests.get(url)
if response.status_code == 200:
print(response.text)
else:
print("请求失败")
threads = []
for _ in range(10):
t = threading.Thread(target=fetch_data)
threads.append(t)
t.start()
for t in threads:
t.join()
分布式爬虫的示例代码:
import requests
from multiprocessing import Process, Queue
url = "http://example.com"
def fetch_data(queue):
response = requests.get(url)
if response.status_code == 200:
queue.put(response.text)
else:
queue.put("请求失败")
# 创建进程间通信的队列
queue = Queue()
# 创建多个进程
processes = []
for _ in range(10):
p = Process(target=fetch_data, args=(queue,))
processes.append(p)
p.start()
# 获取进程返回的数据
results = []
for _ in range(10):
result = queue.get()
results.append(result)
# 等待进程结束
for p in processes:
p.join()
# 处理结果
for result in results:
print(result)
以上是一些常见的爬虫技术和示例代码,希望对你有帮助!如果还有其他问题,请随时提问。