【小练习】红白葡萄酒案例1_合并数据集


问题:
合并红葡萄酒和白葡萄酒数据集(winequality-red.csv, winequality-white.csv),新增一列表示颜色,用以区分是红还是白葡萄酒
思路:
分别给红葡萄酒和白葡萄酒数据集新增一列表示颜色,再合并

 

 

 

评估数据:

import pandas as pd
pd_red = pd.read_csv('winequality-red.csv', sep=';')
pd_white = pd.read_csv('winequality-white.csv', sep=';')

pd_red.info()

输出如下:

 


RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed_acidity           1599 non-null float64
volatile_acidity        1599 non-null float64
citric_acid             1599 non-null float64
residual_sugar          1599 non-null float64
chlorides               1599 non-null float64
free_sulfur_dioxide     1599 non-null float64
total_sulfur-dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

 

pd_white.info()

 


RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
fixed_acidity           4898 non-null float64
volatile_acidity        4898 non-null float64
citric_acid             4898 non-null float64
residual_sugar          4898 non-null float64
chlorides               4898 non-null float64
free_sulfur_dioxide     4898 non-null float64
total_sulfur_dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 459.3 KB

 

 

实现代码如下:(在Jupyter notebook运行)

# 导入 numpy 和 pandas
import numpy as np
import pandas as pd
# 加载红葡萄酒和白葡萄酒数据集
red_df = pd.read_csv('winequality-red.csv', sep=';')
white_df = pd.read_csv('winequality-white.csv', sep=';')


# 为红葡萄酒数据框创建颜色数组(其中有1599个样本)
color_red = pd.Series(np.repeat('red', 1599))
# 为白葡萄酒数据框创建颜色数组(其中有4898个样本)
color_white = pd.Series(np.repeat('white', 4898))


red_df['color'] = color_red
white_df['color'] = color_white
#查看数据框,检查是否成功
red_df.head()
white_df.head()


#为合并成功,需使两个df的列名相同。将red_df中 total_sulfur-dioxide 列标签更改为 total_sulfur_dioxide
red_df = red_df.rename(columns={'total_sulfur-dioxide' : 'total_sulfur_dioxide'})

#合并
wine_df = white_df.append(red_df)

# 查看数据框,检查是否成功
wine_df.head()
wine_df.tail()

#保存已组合的数据集,index=False,以避免保存未命名列!
wine_df.to_csv('winequality_edited.csv', index=False)





 

你可能感兴趣的:(小练习)