利用pandas.DataFrame.isna方法做替换(很棒的技巧)

应用背景

data是一个pandas.DataFrame数据对象,是从mysql读取的数据。由于有的列在数据库是int类型,而且有空值(Null),因此在从数据库抽取到df对象后,pandas自动将int转成float,比如10变成了10.0,15902912345变成了1.5902912345E10,Null变成了NaN。这种列由于存在NaN,因此不能用DataFrame.astype()方法转成int类型。

我们的目的就是尽量让pandas抽取的数据跟数据库“看上去”一致。比如原来是int类型的,如果被替换成float了,那么需要转换回去,原来是Null的,被pandas改成NaN了,需要替换成空字符串。由于pandas列的int类型不能为空,所以需统一转换成字符串类型。

data表结构如下:

id switch market name update_man update_time
1 1 0027 成都 483 1541572909

为了达到将列转换成int类型原来的展现形式(可以是object类型,相当于str,此时10还是展示为10),且NaN转换成空值这个目的,可以采取如下步骤:

  1. 生成新的df对象,保存data里为NaN的位置标记
  2. 将data需要处理的列,NaN值替换为float能允许的类型,如0,下面会用到
  3. 将该列转换成int类型,此时10.0转换成10,1.5902912345E10转换成15902912345
  4. 将该列转换成object类型,此时所有数值按str类型原样保存
  5. 用NaN位置标记的df对象作为索引,替换原df对象中为0的值到空字符串

实现方式:

下面na里存的只有True和False,但na的shape和data完全一致。对应的NaN对应的位置保存的是True:

na = data.isna()

data['update_man'].fillna(value='0', inplace=True)
data['update_time'].fillna(value='0', inplace=True)
data['update_man'] = data['update_man'].astype('int64')
data['update_time'] = data['update_time'].astype('int64')
data['update_man'] = data['update_man'].astype('object')
data['update_time'] = data['update_time'].astype('object')

data.loc[na['update_man']==True, 'update_man'] = ''
data.loc[na['update_time']==True, 'update_time'] = ''

data的update_man替换前(注意,update_man、update_time字段在数据库里存的是int类型,但因为有空值,pandas读取后变成了float类型。):

[6 rows x 6 columns]
 id      switch ... update_man  update_time
0 1           1 ...      483.0 1.541572e+09
1 2           0 ...     1571.0 1.541577e+09
2 3           0 ...        NaN          NaN
3 4           0 ...     8857.0 1.541678e+09
4 5           0 ...        NaN          NaN
5 6           0 ...        NaN          NaN

data的update_man被处理后:

  id      switch ... update_man update_time
0 1            1 ...        483  1541571712
1 2            0 ...       1571  1541576866
2 3            0 ... 
3 4            0 ...       8857  1541677813
4 5            0 ... 
5 6            0 ...

完整实现后的代码:

    @staticmethod
    def cleanpddata(pddata, coltype):
        """
        :usage: if a column in db is int and contains Null, its column will be set to
                float in pandas dataframe after extract from db. This will result in
                failure of compare dataframe between pandas and spark.
                This function will help change dataframe close to origin 'sight' of
                table in db.
                Current only support reverse automatic change from int to float.
                exp:
                data of pandas dataframe before change:
                      id       switch      ...      update_man   update_time
                0      1            1      ...           483.0  1.541572e+09
                1      2            0      ...          1571.0  1.541577e+09
                2      3            0      ...             NaN           NaN
                3      4            0      ...          8857.0  1.541678e+09
                data of pandas dataframe after reverse:
                      id       switch     ...     update_man update_time
                0      1            1     ...            483  1541571712
                1      2            0     ...           1571  1541576866
                2      3            0     ...
                3      4            0     ...           8857  1541677813

        :param pddata: data of pandas dataframe
        :param coltype: column type of table schema, see DataBase.get_origin_table_schema
        :return: pandas dataframe of clean pddata
        """
        na = pddata.isna()
        pdtype = pddata.dtypes.to_frame('DATA_TYPE').reset_index()
        pdtype.columns = ['COLUMN_NAME', 'DATA_TYPE']

        # get the columns which data type in db is int and data type in df is float after extract by pandas
        col_needclean = coltype.loc[
            coltype[['DATA_TYPE']].apply(lambda x: True if 'int' in str(x) else False, axis=1) &
            pdtype[['DATA_TYPE']].apply(lambda x: True if 'float' in str(x) else False, axis=1),
            'COLUMN_NAME']

        # change the specific cell from "Null" to ""
        for col in col_needclean.values:
            col_name = str(col)
            pddata[col_name].fillna(value='0', inplace=True)
            pddata[col_name] = pddata[col].astype('int64')
            pddata[col_name] = pddata[col].astype('object')
            pddata.loc[na[col_name], col] = ''

        # change the columns which dtypes are "object" and contain "Null" value to blank string "".
        col_needclean = coltype.loc[
            pdtype[['DATA_TYPE']].apply(lambda x: True if 'object' in str(x) else False, axis=1),
            'COLUMN_NAME']

        for col in col_needclean.values:
            col_name = str(col)
            pddata[col_name].fillna(value='', inplace=True)
            pddata[col_name] = pddata[col_name].apply(lambda x:
                x.replace('\n', '').replace('\r', '') if isinstance(x, str) else x)

        return pddata

下面是DataBase.get_origin_table_schema方法,其返回值用于cleanpddata的coltype参数的输入,仅供参考,便于理解上面的代码:

    def get_origin_table_schema(self, schema, tablename):
        """
        :usage: get the original table schema, used to check if data properties has been changed by pd.df
        :param schema: schema name of table
        :param tablename: tablename
        :return: dataframe of table schema, exp:
                        COLUMN_NAME DATA_TYPE
                0             id          int
                1         switch          int
                2       supplier      varchar

        """
        if isinstance(tablename, list):
            tablename = tablename[0]
        sql = "select COLUMN_NAME, DATA_TYPE from information_schema.columns " + \
                " where table_schema = '" + schema + "'" + \
                " and table_name = '" + tablename +"';"
        tb_type = self.get_data_dataframe(sql)
        return tb_type

你可能感兴趣的:(测试,Python,大数据)