上一篇介绍了如何用纯SQL生成测试数据,但SQL功能有限,本篇介绍python faker库的使用。
from faker import Faker
fake = Faker() # 默认为英文locale
fake_zh_cn = Faker(locale='zh_CN') # 设置为中文locale
print(fake_zh_cn.name()) # 随机输出中文姓名
print(fake.file_path()) # 随机文件路径
print(fake_zh_cn.file_path()) # 随机中文文件路径
print(fake.ssn()) # 随机社会保障号
print(fake_zh_cn.ssn()) # 随机身份证号
更多faker支持的数据种类,可参考API文档
有了造数工具,接下来让容纳数据的载体dataclass(仅3.6+版本支持)登场,使用dataclass可以十分简洁的方式实现数据类的基本功能,简化数据初始化、打印回显(repr、print)、数据序列化(比如to json等)等等的操作。构造一个含需求中字段的类,并指定各字段的数据初始化的方法,代码如下:
from faker import Faker
from dataclasses import dataclass,field,asdict,astuple
@dataclass
class Emp:
# dataclass 字段
badge:str = field(default_factory = lambda:str(random.randrange(1,8000)).zfill(5))
name:str = field(default_factory = fake_zh_cn.name)
ename:str = field(default_factory = fake.name)
department:str = field(default_factory = lambda:random.choice(Emp.dept))
job:str = field(default_factory = fake.job)
idcard:str = field(default_factory = fake_zh_cn.ssn)
phone:str = field(default_factory = fake_zh_cn.phone_number)
gender:str = field(init = False)
birthday:str = field(init = False)
postcode:str = field(default_factory = fake.postalcode)
email:str = field(default_factory = fake.email)
workcity:str = field(default_factory = fake_zh_cn.city)
address:str = field(default_factory = fake_zh_cn.address)
# 辅助字段
dept = ['IT部','人事部','财务部','采购部','运营部','市场部','销售部','客服部']
# 关联字段(根据身份证获取性别和生日)
def __post_init__(self):
self.gender = '男' if int(self.idcard[16:17])%2 else '女'
self.birthday = self.idcard[6:14]
创建一个类试验一下:
看来效果不错,下面可以批量构造并将数据持久化保存。
import csv
with open(r'D:\Temp\emp_fake.csv','w',newline='') as f:
csv_writer = csv.writer(f,delimiter='|')
for emp in [astuple(Emp()) for _ in range(5000)]:
csv_writer.writerow(list(emp))
import json
emp = [asdict(Emp()) for _ in range(5000)]
with open(r'D:\Temp\emp_fake_json_list.json','w') as f:
json.dump(emp,f,ensure_ascii=False,indent=4)
import json
with open(r'D:\Temp\emp_fake_json_line.json','w') as f:
for _ in range(5000):
json.dump(asdict(Emp()),f,ensure_ascii=False)
f.write('\n')
若要保存到数据库,可以不使用上面的dataclass,转而使用python的ORM框架(比如SQLAlchemy)构造一个含所需字段的类,然后赋值后保存即可,全程不用写SQL语句。当然也可以循环直接构造Insert SQL语句后执行。下面使用一种偷懒的做法,先转化为pandas的dataframe作为中转,然后将数据保存到数据库:
import pandas as pd
from sqlalchemy.types import NVARCHAR,DATE
from sqlalchemy import create_engine
# 设置数据库连接DSN
db_con_str = 'mssql+pyodbc://@AdventureWorks2012'
engine = create_engine(db_con_str)
# 生成数据转并化为字典列表
emp = [asdict(Emp()) for _ in range(5000)]
columns = ['badge','ename','name','department','job','idcard','phone'
,'gender','birthday','postcode','email','workcity','address']
df = pd.DataFrame(emp,columns = columns)
# 指定字段类型
dtype = {column:NVARCHAR(2000) for column in df.columns}
dtype['birthday'] = DATE
# 保存到数据库
df.to_sql('employee',con = engine,if_exists = 'replace',dtype = dtype,index = False)