直观上理解,DataFrame就是很多个Series拼在一起的一个新的数据结构,他看起来就像Excel的表一样,以下是快速创建的方法。
marvel_data = [
['Spider-Man', 'male', 1962],
['Captain America', 'male', 1941],
['Wolverine', 'male', 1974],
['Iron Man', 'male', 1963],
['Thor', 'male', 1963],
['Thing', 'male', 1961],
['Mister Fantastic', 'male', 1961],
]
pd.DataFrame(marvel_data)
0 | 1 | 2 | |
---|---|---|---|
0 | Spider-Man | male | 1962 |
1 | Captain America | male | 1941 |
2 | Wolverine | male | 1974 |
3 | Iron Man | male | 1963 |
4 | Thor | male | 1963 |
5 | Thing | male | 1961 |
6 | Mister Fantastic | male | 1961 |
marvel_df = pd.DataFrame(marvel_data)
col_names = ['name', 'sex', 'first_appearance']
marvel_df.columns = col_names
name | sex | first_appearance | |
---|---|---|---|
0 | Spider-Man | male | 1962 |
1 | Captain America | male | 1941 |
2 | Wolverine | male | 1974 |
3 | Iron Man | male | 1963 |
4 | Thor | male | 1963 |
5 | Thing | male | 1961 |
6 | Mister Fantastic | male | 1961 |
marvel_df.index = marvel_df['name']
name | sex | first_appearance | |
---|---|---|---|
name | |||
Spider-Man | Spider-Man | male | 1962 |
Captain America | Captain America | male | 1941 |
Wolverine | Wolverine | male | 1974 |
Iron Man | Iron Man | male | 1963 |
Thor | Thor | male | 1963 |
Thing | Thing | male | 1961 |
Mister Fantastic | Mister Fantastic | male | 1961 |
丢弃掉其中的一列或者一行,inplace参数的意义在于,是否是直接在原数据上进行修改,默认是False创建一个副本
marvel_df = marvel_df.drop("sex", axis=1, inplace=False)
# 或者直接
marvel_df.drop("sex", axis=1, inplace=True)
# 删除行
marvel_df.drop("Thor", axis=0, inplace=True)
删除行列的这个axis非常容易弄混,一定记清楚,如果是numpy对每一行求和求方差是axis=1,然而删除一整行却是axis=0,最好找到一个适合自己的记法
我的记法是:axis=1的时候是对每一行都操作,所以我把每一行的这个属性都删掉,就相当于我删了这一整列了
# .loc[需要的行,需要的列]
marvel_df.loc[:, "first_appearance"]
# 返回first_appearance下的所有行
marvel_df.loc["Spider-Man": "Iron Man"]
# 返回从Spider Man到Iron Man的所有行
# 这里省略了第二个参数"列"
iloc方法和loc方法的不同点在于,iloc是根据下标来定位元素的,起点是0,左闭右开
marvel_df.iloc[:, 1] # 定位first_appearance
# 上一小节把name列删除了
marvel_df.iloc[0:4, :]
# 返回从Spider Man到Iron Man的所有行
marvel_df.loc['Thor', 'first_appearance'] = 111111
# 也可以增加新的一列创建新的特征
marvel_df['years_since'] = 2022 - marvel_df['first_appearance']
marvel_df['sex'] == 'female'
'''
name
Spider-Man False
Captain America False
Wolverine False
Iron Man False
Thor False
Thing False
Mister Fantastic False
Name: sex, dtype: bool
'''
一个小练习:marvel_df里性别这一列的male和female转换成0和1(不适用map和replace)提示:True可以看成1,False可以看成0
marvel_df['sex'] = (marvel_df['Sex'] == 'female').astype("int64")
逻辑运算
marvel_df['first_appearance'] > 1970
# 返回一个Mask
marvel_df[marvel_df['first_appearance'] > 1970]
# 利用Mask在整个marvel_df中选择出符合条件的元素