data是一个pandas.DataFrame数据对象,是从mysql读取的数据。由于有的列在数据库是int类型,而且有空值(Null),因此在从数据库抽取到df对象后,pandas自动将int转成float,比如10变成了10.0,15902912345变成了1.5902912345E10,Null变成了NaN。这种列由于存在NaN,因此不能用DataFrame.astype()方法转成int类型。
我们的目的就是尽量让pandas抽取的数据跟数据库“看上去”一致。比如原来是int类型的,如果被替换成float了,那么需要转换回去,原来是Null的,被pandas改成NaN了,需要替换成空字符串。由于pandas列的int类型不能为空,所以需统一转换成字符串类型。
data表结构如下:
id | switch | market | name | update_man | update_time |
---|---|---|---|---|---|
1 | 1 | 0027 | 成都 | 483 | 1541572909 |
为了达到将列转换成int类型原来的展现形式(可以是object类型,相当于str,此时10还是展示为10),且NaN转换成空值这个目的,可以采取如下步骤:
下面na里存的只有True和False,但na的shape和data完全一致。对应的NaN对应的位置保存的是True:
na = data.isna()
data['update_man'].fillna(value='0', inplace=True)
data['update_time'].fillna(value='0', inplace=True)
data['update_man'] = data['update_man'].astype('int64')
data['update_time'] = data['update_time'].astype('int64')
data['update_man'] = data['update_man'].astype('object')
data['update_time'] = data['update_time'].astype('object')
data.loc[na['update_man']==True, 'update_man'] = ''
data.loc[na['update_time']==True, 'update_time'] = ''
data的update_man替换前(注意,update_man、update_time字段在数据库里存的是int类型,但因为有空值,pandas读取后变成了float类型。):
[6 rows x 6 columns]
id switch ... update_man update_time
0 1 1 ... 483.0 1.541572e+09
1 2 0 ... 1571.0 1.541577e+09
2 3 0 ... NaN NaN
3 4 0 ... 8857.0 1.541678e+09
4 5 0 ... NaN NaN
5 6 0 ... NaN NaN
data的update_man被处理后:
id switch ... update_man update_time
0 1 1 ... 483 1541571712
1 2 0 ... 1571 1541576866
2 3 0 ...
3 4 0 ... 8857 1541677813
4 5 0 ...
5 6 0 ...
完整实现后的代码:
@staticmethod
def cleanpddata(pddata, coltype):
"""
:usage: if a column in db is int and contains Null, its column will be set to
float in pandas dataframe after extract from db. This will result in
failure of compare dataframe between pandas and spark.
This function will help change dataframe close to origin 'sight' of
table in db.
Current only support reverse automatic change from int to float.
exp:
data of pandas dataframe before change:
id switch ... update_man update_time
0 1 1 ... 483.0 1.541572e+09
1 2 0 ... 1571.0 1.541577e+09
2 3 0 ... NaN NaN
3 4 0 ... 8857.0 1.541678e+09
data of pandas dataframe after reverse:
id switch ... update_man update_time
0 1 1 ... 483 1541571712
1 2 0 ... 1571 1541576866
2 3 0 ...
3 4 0 ... 8857 1541677813
:param pddata: data of pandas dataframe
:param coltype: column type of table schema, see DataBase.get_origin_table_schema
:return: pandas dataframe of clean pddata
"""
na = pddata.isna()
pdtype = pddata.dtypes.to_frame('DATA_TYPE').reset_index()
pdtype.columns = ['COLUMN_NAME', 'DATA_TYPE']
# get the columns which data type in db is int and data type in df is float after extract by pandas
col_needclean = coltype.loc[
coltype[['DATA_TYPE']].apply(lambda x: True if 'int' in str(x) else False, axis=1) &
pdtype[['DATA_TYPE']].apply(lambda x: True if 'float' in str(x) else False, axis=1),
'COLUMN_NAME']
# change the specific cell from "Null" to ""
for col in col_needclean.values:
col_name = str(col)
pddata[col_name].fillna(value='0', inplace=True)
pddata[col_name] = pddata[col].astype('int64')
pddata[col_name] = pddata[col].astype('object')
pddata.loc[na[col_name], col] = ''
# change the columns which dtypes are "object" and contain "Null" value to blank string "".
col_needclean = coltype.loc[
pdtype[['DATA_TYPE']].apply(lambda x: True if 'object' in str(x) else False, axis=1),
'COLUMN_NAME']
for col in col_needclean.values:
col_name = str(col)
pddata[col_name].fillna(value='', inplace=True)
pddata[col_name] = pddata[col_name].apply(lambda x:
x.replace('\n', '').replace('\r', '') if isinstance(x, str) else x)
return pddata
下面是DataBase.get_origin_table_schema方法,其返回值用于cleanpddata的coltype参数的输入,仅供参考,便于理解上面的代码:
def get_origin_table_schema(self, schema, tablename):
"""
:usage: get the original table schema, used to check if data properties has been changed by pd.df
:param schema: schema name of table
:param tablename: tablename
:return: dataframe of table schema, exp:
COLUMN_NAME DATA_TYPE
0 id int
1 switch int
2 supplier varchar
"""
if isinstance(tablename, list):
tablename = tablename[0]
sql = "select COLUMN_NAME, DATA_TYPE from information_schema.columns " + \
" where table_schema = '" + schema + "'" + \
" and table_name = '" + tablename +"';"
tb_type = self.get_data_dataframe(sql)
return tb_type