ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

最近,做了Kaggle的Home Credit Default Risk,在模型融合Stacking方法中,遇到了ValueError: Input contains NaN, infinity or a value too large for dtype('float32'),这个问题。查阅了大量资料后,排除NaN,infinity的情况。最终,通过在特征工程中,加入注释,一段一段测试,发现是有的值溢出了float32。

  1. df['phone_to_employ_ratio'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_EMPLOYED']

经测试发现,这个‘phone_to_employ_ratio’,可能是在用两个连续值,做除法的时候溢出。原因是这两个连续值当中有infinity。

解决办法:

在训练集和测试集,分别扔出‘DAYS_LAST_PHONE_CHANGE’,'DAYS_EMPLOYED'的异常值。

df.drop(df[df['DAYS_LAST_PHONE_CHANGE'] == float("inf")].index, inplace = True)
df.drop(df[df['DAYS_LAST_PHONE_CHANGE'] == float("-inf")].index, inplace = True)
df.drop(df[df['DAYS_EMPLOYED'] == float("inf")].index, inplace = True)
df.drop(df[df['DAYS_EMPLOYED'] == float("-inf")].index, inplace = True)
#df['phone_to_employ_ratio'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_EMPLOYED']
df.to_csv('F:\\Jupyter\\Game\\Home Credit Default Risk\\data\\application_train_2.csv', index= False)
test_df = pd.read_csv('F:\\Jupyter\\Game\\Home Credit Default Risk\\data\\application_test.csv')
#prev = pd.read_csv('F:\\Jupyter\\Game\\Home Credit Default Risk\\data\\previous_application.csv')
test_df.drop(df[df['CODE_GENDER'] == 'XNA'].index, inplace = True)
test_df.drop(df[df['DAYS_LAST_PHONE_CHANGE'] == float("inf")].index, inplace = True)
test_df.drop(df[df['DAYS_LAST_PHONE_CHANGE'] == float("-inf")].index, inplace = True)
test_df.drop(df[df['DAYS_EMPLOYED'] == float("inf")].index, inplace = True)
test_df.drop(df[df['DAYS_EMPLOYED'] == float("-inf")].index, inplace = True)
test_df.to_csv('F:\\Jupyter\\Game\\Home Credit Default Risk\\data\\application_test_2.csv', index= False)

PS: 数据预处理最好做完后,单独存一个csv。之后用这个csv再做特征工程,可以减少不必要的麻烦。

      2.   在做聚合特征时,仍然发现这个问题

aggregations = {

    'PAYMENT_PERC': [ 'max','mean', 'var','min','std'],      
     #['max','mean','min'], 这三个有问题 
    
}

解决办法

删除了,只保留了var,std。

 

 

 

你可能感兴趣的:(Bug)