Python训练营打卡Day5(2025.4.24)

离散特征的独热编码

# 读取数据
import pandas as pd
data = pd.read_csv('data.csv') #此时data是一个DataFrame对象


# day4的课提到了 查看dataframe对象的列名,可以使用data.columns属性。
data.columns 


# 打印所有的离散变量名
# 在python中对于变量名常常用英文含义和下划线来命名,而不借助拼音,这是便于他人阅读和理解代码的一种习惯。
# 连续的英文是continuous,离散的英文是discrete
for discrete_features in data.columns:
    if data[discrete_features].dtype == 'object':
        print(discrete_features)

 以Home Ownership为例

data['Home Ownership']
# 需要进行编码,打印这个变量的值
# vakue_counts()方法用于统计每个类别的个数,并返回一个Series对象。这个方法可以帮助我们快速了解数据集中每个类别的分布情况。
data['Home Ownership'].value_counts()

 可以发现并不具备顺序关系,因此可以采用one-hot编码

# 对Home Ownership列进行独热编码
data = pd.get_dummies(data, columns=['Home Ownership'])
data.columns
#可以看到之前的Home Ownership已经被替换成了'Home Ownership_Have Mortgage','Home Ownership_Home Mortgage', 'Home Ownership_Own Home','Home Ownership_Rent'


data.head()


#可以看到上面独热编码后的数据是bool类型,试着转换为int类型,因为后续可能有的函数计算不支持bool值
# 学习类型转换的方法
data['Home Ownership_Have Mortgage'] =data ['Home Ownership_Have Mortgage'].astype(int)
data['Home Ownership_Have Mortgage']

 接下来尝试结合之前的代码一次性对所有离散特征独热编码

# 重新读取数据
data = pd.read_csv("data.csv")
# 找到离散变量
discrete_lists = [] # 新建一个空列表,用于存放离散变量名
for discrete_features in data.columns:
    if data[discrete_features].dtype == 'object':
        discrete_lists.append(discrete_features)

# 离散变量独热编码
data = pd.get_dummies(data, columns=discrete_lists, drop_first=True) 

data.columns

下面是输出

Index(['Id', 'Annual Income', 'Tax Liens', 'Number of Open Accounts',
       'Years of Credit History', 'Maximum Open Credit',
       'Number of Credit Problems', 'Months since last delinquent',
       'Bankruptcies', 'Current Loan Amount', 'Current Credit Balance',
       'Monthly Debt', 'Credit Score', 'Credit Default',
       'Home Ownership_Home Mortgage', 'Home Ownership_Own Home',
       'Home Ownership_Rent', 'Years in current job_10+ years',
       'Years in current job_2 years', 'Years in current job_3 years',
       'Years in current job_4 years', 'Years in current job_5 years',
       'Years in current job_6 years', 'Years in current job_7 years',
       'Years in current job_8 years', 'Years in current job_9 years',
       'Years in current job_< 1 year', 'Purpose_buy a car',
       'Purpose_buy house', 'Purpose_debt consolidation',
       'Purpose_educational expenses', 'Purpose_home improvements',
       'Purpose_major purchase', 'Purpose_medical bills', 'Purpose_moving',
       'Purpose_other', 'Purpose_renewable energy', 'Purpose_small business',
       'Purpose_take a trip', 'Purpose_vacation', 'Purpose_wedding',
       'Term_Short Term'],
      dtype='object')

 找到所有独热编码后的新特征名

# 对比独热编码前后的列名 即可
data2 = pd.read_csv("data.csv")
list_final = [] # 新建一个空列表,用于存放独热编码后新增的特征名
for i in data.columns:
    if i not in data2.columns:
       list_final.append(i) # 这里打印出来的就是独热编码后的特征名
list_final

# 其实还可以通过data.columns.difference()方法来实现,请自行学习
# 可以看到 想要实现一个结果有很多不同方法

输出:

['Home Ownership_Home Mortgage',
 'Home Ownership_Own Home',
 'Home Ownership_Rent',
 'Years in current job_10+ years',
 'Years in current job_2 years',
 'Years in current job_3 years',
 'Years in current job_4 years',
 'Years in current job_5 years',
 'Years in current job_6 years',
 'Years in current job_7 years',
 'Years in current job_8 years',
 'Years in current job_9 years',
 'Years in current job_< 1 year',
 'Purpose_buy a car',
 'Purpose_buy house',
 'Purpose_debt consolidation',
 'Purpose_educational expenses',
 'Purpose_home improvements',
 'Purpose_major purchase',
 'Purpose_medical bills',
 'Purpose_moving',
 'Purpose_other',
 'Purpose_renewable energy',
 'Purpose_small business',
 'Purpose_take a trip',
 'Purpose_vacation',
 'Purpose_wedding',
 'Term_Short Term']
# 接着之前的,对bool特征进行类型转换
for i in list_final:
    data[i] = data[i].astype(int) # 这里的i就是独热编码后的特征名
data.head()


# 填补每一列的缺失值
data.dtypes


data.isnull().sum() # 统计每一列的缺失值个数


# 用均值填补
# 循环遍历这个列表中的每一列
for i in data.columns:
    if data[i].isnull().sum() > 0: # 找到存在缺失值的列
        #计算该列的均值
        mean_value = data[i].mean()
        #用均值填充缺失值
        data[i].fillna(mean_value, inplace=True)

data.isnull().sum()

 以下为最终输出

Id                                0
Annual Income                     0
Tax Liens                         0
Number of Open Accounts           0
Years of Credit History           0
Maximum Open Credit               0
Number of Credit Problems         0
Months since last delinquent      0
Bankruptcies                      0
Current Loan Amount               0
Current Credit Balance            0
Monthly Debt                      0
Credit Score                      0
Credit Default                    0
Home Ownership_Home Mortgage      0
Home Ownership_Own Home           0
Home Ownership_Rent               0
Years in current job_10+ years    0
Years in current job_2 years      0
Years in current job_3 years      0
Years in current job_4 years      0
Years in current job_5 years      0
Years in current job_6 years      0
Years in current job_7 years      0
Years in current job_8 years      0
...
Purpose_take a trip               0
Purpose_vacation                  0
Purpose_wedding                   0
Term_Short Term                   0
dtype: int64

这样便完成了

值得一提的是,代码运行时可能会出现警告,可用以下代码忽略

import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore")

 

题目:

现在在py文件中 一次性处理data数据中所有的连续变量和离散变量

1. 读取data数据

2. 对离散变量进行one-hot编码

3. 对独热编码后的变量转化为int类型

  1. 对所有缺失值进行填充

注意是py文件中,所以每一步的输出是否正确需要你来使用debugger功能来逐步查看

# 读取数据
import pandas as pd
data = pd.read_csv(r'data.csv')
#处理连续值
import numpy as np
a =np.array([1,2,3])
c = data.columns.tolist()
for i in c:
    if data[i].dtype != 'object':
        if data[i].isnull().sum() > 0:
            mean_value = data[i].mean()
            #data[i].fillna(mean_value, inplace=True)
            data[i]=data[i].fillna(mean_value)
#处理离散值
data = pd.read_csv("data.csv")
discrete_lists = [] 
for discrete_features in data.columns:
    if data[discrete_features].dtype == 'object':
        discrete_lists.append(discrete_features)
data = pd.get_dummies(data, columns=discrete_lists, drop_first=True) 
data2 = pd.read_csv("data.csv")
list_final = []
for i in data.columns:
    if i not in data2.columns:
       list_final.append(i)
for i in list_final:
    data[i] = data[i].astype(int)
for i in data.columns:
    if data[i].isnull().sum() > 0:
        mean_value = data[i].mean()
        data[i].fillna(mean_value, inplace=True)

以上为完整流程

 @浙大疏锦行

你可能感兴趣的:(python,开发语言)