在Python操作数据时,有时候文件较大,读入数据时间花费较多的时间。经过处理的数据,有时候也会用作数据库或基础文件,以用于后续其他程序或操作的调用。如果我们从头读取诺大文件并进行处理,将会浪费时间和精力。因此,持久化保存结构化的数据,将是一种较好的良好的习惯。
在Python中有以下几种方法可用于持久化结构数据。
pickle函数可以保存多个结构化数据。
import pickle
# creat object
obj01, obj02, obj03 = [1, 2], [4, 5], [6, 7]
# save
with open("test.pickle", "wb") as fout:
pickle.dump([obj01, obj02, obj03], fout)
fout.close()
# loading
with open("test.pickle", "rb") as fin:
l01, l02, l03 = pickle.load(fin)
fin.close()
在Python3中,cpickle函数被修改为 _pickle。该函数的存取速度由于pickle。
import _pickle as cpickle
# creat object
obj01, obj02, obj03 = [1, 2], [4, 5], [6, 7]
# save
with open("test.pickle", "wb") as fout:
cpickle.dump([obj01, obj02, obj03], fout)
fout.close()
# loading
with open("test.pickle", "rb") as fin:
l01, l02, l03 = cpickle.load(fin)
fin.close()
该函数来源于sklearn模块,用于保存已经训练好的模型对象。
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import cross_validation, metrics
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# loading data
iris_dataset = load_iris()
rawdata = pd.DataFrame(iris_dataset['data'], columns=['x0', 'x1', 'x2', 'x3'])
rawlabel = pd.DataFrame(iris_dataset['target'], columns=['label'])
dt_model = DecisionTreeClassifier()
# spliting data to trainning data and testing data
train_X, test_X, train_y, test_y = train_test_split(rawdata, rawlabel, test_size=0.3, random_state=0)
# creating model
dt_model.fit(X=train_X, y=train_y)
# save
from sklearn.externals import joblib
joblib.dump(dt_model, 'dt_model.model')
# loading
dt_model_load = joblib.load('dt_model.model')
print(metrics.classification_report(test_y, dt_model_load.predict(X=test_X)))
感兴趣的同学,还可以查看 shelve , dill , pmml 等保存结构化数据对象的函数,并比较其优缺点。