在数据驱动开发中,结构化存储是数据处理流程的关键环节。CSV(Comma-Separated Values)作为一种轻量级、跨平台的文件格式,广泛用于数据交换、日志记录及中小规模数据存储。相比于数据库或JSON,CSV具有以下优势:
本文目标:结合Python csv模块,详解CSV文件的读写技巧、复杂数据处理(如嵌套字段、特殊字符)及性能优化方案,并提供可直接复用的代码模板。
基本规则:
示例文件data.csv:
id,name,email,score
1,张三,[email protected],95
2,李四,"[email protected]",88
3,王五,[email protected],"92"
场景 | 说明 |
---|---|
数据导出/备份 | 从数据库或API批量导出结构化数据 |
数据分析预处理 | 配合Pandas进行统计与可视化 |
跨系统数据交换 | 兼容Excel/R/MATLAB等工具 |
Python内置csv模块,无需额外安装。
import csv
headers = ["id", "name", "email"]
data = [
[1, "张三", "[email protected]"],
[2, "李四", "[email protected]"],
[3, "王五", "[email protected]"]
]
with open("output.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(headers) # 写入表头
writer.writerows(data) # 批量写入数据
with open("data.csv", "r", encoding="utf-8") as f:
reader = csv.reader(f)
for row in reader:
print(row)
# 输出:
# ['id', 'name', 'email', 'score']
# ['1', '张三', '[email protected]', '95']
# ['2', '李四', '[email protected]', '88']
# 写入字典数据
with open("dict_output.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["id", "name", "email"])
writer.writeheader()
writer.writerow({"id": 101, "name": "赵六", "email": "[email protected]"})
# 读取为字典
with open("dict_output.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
print(row["name"], row["email"])
# 使用分号分隔,双引号包裹所有字段
with open("custom.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f, delimiter=";", quoting=csv.QUOTE_ALL)
writer.writerow(["id", "name"])
writer.writerow([1, "张三"])
data = [
[4, "Alice, Smith", "[email protected]"],
[5, "Bob\nJohnson", "[email protected]"]
]
with open("special_chars.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC)
writer.writerows(data)
import json
data = [
{"id": 1, "info": '{"age": 30, "city": "北京"}'},
{"id": 2, "info": '{"age": 25, "city": "上海"}'}
]
# 写入嵌套JSON
with open("nested_data.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["id", "info"])
writer.writeheader()
writer.writerows(data)
# 读取并解析JSON
with open("nested_data.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
info = json.loads(row["info"])
print(f"ID: {row['id']}, 城市: {info['city']}")
# 逐行读取大文件
with open("large_data.csv", "r", encoding="utf-8") as f:
reader = csv.reader(f)
for row in reader:
process(row) # 自定义处理函数
对于复杂操作(如数据清洗、聚合),可借助Pandas处理后再导出为CSV。
import pandas as pd
# 读取CSV
df = pd.read_csv("data.csv")
# 过滤分数大于90的记录
filtered = df[df["score"] > 90]
# 导出为CSV
filtered.to_csv("high_score.csv", index=False)
目标:将爬取的图书信息存储为CSV文件。
import csv
import requests
from bs4 import BeautifulSoup
url = "https://book.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
books = []
for item in soup.select("tr.item"):
title = item.select_one(".pl2 a")["title"]
score = item.select_one(".rating_nums").text
books.append({"title": title, "score": score})
# 写入CSV
with open("douban_books.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "score"])
writer.writeheader()
writer.writerows(books)
print("数据已存储至 douban_books.csv")
Python爬虫介绍 | Python爬虫(1)Python爬虫:从原理到实战,一文掌握数据采集核心技术 |
HTTP协议解析 | Python爬虫(2)Python爬虫入门:从HTTP协议解析到豆瓣电影数据抓取实战 |
HTML核心技巧 | Python爬虫(3)HTML核心技巧:从零掌握class与id选择器,精准定位网页元素 |
CSS核心机制 | Python爬虫(4)CSS核心机制:全面解析选择器分类、用法与实战应用 |
静态页面抓取实战 | Python爬虫(5)静态页面抓取实战:requests库请求头配置与反反爬策略详解 |
静态页面解析实战 | Python爬虫(6)静态页面解析实战:BeautifulSoup与lxml(XPath)高效提取数据指南 |