DataFrame类似于二维数组(表格), 由一组数据(类似于二维numpy对象)及两组标签(行索引,列索引)组成
DataFrame(可迭代二维数据 [, index=行索引链表[, columns=列索引链表 [, dtype=数据类型]]])
注:可迭代对象可以使用二维链表,二维numpy对象,字典(使用字典时,字典的value为一维链表,columns对应字典的key),可迭代对象必须是二维,否则报错:Data must be 2-dimensional
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
print(DataFrame(np.arange(6).reshape([2, 3])))
print("#" * 30)
print(DataFrame(np.arange(6).reshape([2, 3]), index=["row1", "row2"], columns=["col1", "col2", "col3"], dtype=float))
0 1 2
0 0 1 2
1 3 4 5
##############################
col1 col2 col3
row1 0.0 1.0 2.0
row2 3.0 4.0 5.0
DataFrame对象的属性有:dtype, index, values, name
dtype属性在创建时可以指定,但是直接打印会报错,必须通过对象的dtype属性重新指定以后才能打印属性值
name属性在创建时不可以指定,直接打印会报错,必须通过对象的name属性重新指定以后才能打印属性值
df0 = DataFrame(np.arange(6).reshape([2, 3]), index=["row1", "row2"], columns=["col1", "col2", "col3"], dtype=float)
# print(df0.dtype) # 'DataFrame' object has no attribute 'dtype'
df0.dtype = int
print(df0.dtype)
print("##############################")
print(df0.index)
print("##############################")
print(df0.values) # numpy对象
print("#" * 30)
# print(df0.name) # 'DataFrame' object has no attribute 'name'
df0.name = "first dataframe"
print(df0.name)
print(df0.index.name)
df0.index.name = "idx"
print("##############################")
print(df0)
##############################
Index(['row1', 'row2'], dtype='object')
##############################
[[ 0. 1. 2.]
[ 3. 4. 5.]]
##############################
first dataframe
None
##############################
col1 col2 col3
idx
row1 0.0 1.0 2.0
row2 3.0 4.0 5.0
1、DataFrame默认从列取值,如果想从行开始取值,使用ix
注意:直接取值不能使用序号,也不能使用[xx, xx],使用 ix属性可以使用序号
只有直接取值为先列后行
2、loc属性中,第一个位置为行索引,第二个位置为列索引
注意:loc属性只能用关键字,不能使用序号
iloc属性中,第一个位置为行索引,第二个位置为列索引,不可以使用索引名称
3、快速获取指定位置的值
df对象.iat[行序号, 列序号] 先行后列
df1 = DataFrame({"语文": [80, 87, 76], "数学": [98, 94, 97]}, index=["小李", "小张", "小红"])
print("df1: \n", df1)
print("#" * 30)
print("df1['语文']: \n", df1['语文']) # Series对象
# print("#" * 30)
# print("df1[0]: \n", df1[0]) # KeyError: 0
print("#" * 30)
print("df1.ix['小红', '语文']: \n", df1.ix['小红', '语文']) # Series对象
print("#" * 30)
print("df1.ix[2, 1]: \n", df1.ix[2, 1]) # Series对象
print("#" * 30)
print("df1.loc['小张']: \n", df1.loc['小张'])
print("#" * 30)
print("df1.iloc[1]: \n", df1.iloc[1])
print("#" * 30)
print("df1.iat[0, 0]: \n", df1.iat[0, 0])
df1:
数学 语文
小李 98 80
小张 94 87
小红 97 76
##############################
df1['语文']:
小李 80
小张 87
小红 76
Name: 语文, dtype: int64
##############################
df1.ix['小红', '语文']:
76
##############################
df1.ix[2, 1]:
76
##############################
df1.loc['小张']:
数学 94
语文 87
Name: 小张, dtype: int64
##############################
df1.iloc[1]:
数学 94
语文 87
Name: 小张, dtype: int64
##############################
df1.iat[0, 0]:
98
1、对于切片操作,无论是否加ix,都是从行进行取值
注意:1、直接取值不能使用序号,使用 ix属性可以使用序号
2、使用序号是前开后闭,但是使用索引是闭区间
2、loc属性中,第一个位置为行索引,第二个位置为列索引;两个位置均为切片返回DataFrame对象;仅有一个位置为切片,另一个位置为值,返回Series对象;两个位置均为值,返回值。
注意:loc属性使用切片或索引只能用关键字,不能使用序号
区别:通过索引取值是闭区间,通过索引序号取值是前闭后开的
df1 = DataFrame({"语文": [80, 87, 76], "数学": [98, 94, 97]}, index=["小李", "小张", "小红"])
print("df1: \n", df1)
print("#" * 30)
print("df1['小张':'小红']: \n", df1["小张":"小红"]) # DataFrame对象
print("#" * 40)
print('df1[1:2]: \n', df1[1:2]) # DataFrame对象
print("#" * 40)
print("df1.ix['小张':'小红']: \n", df1.ix["小张":'小红', '数学'])
print("#" * 40)
print("df1.ix[1:]: \n", df1.ix[1:, 0])
np2 = np.random.random([5, 4])
df2 = DataFrame(np2, index=['idx1', 'idx2', 'idx3', 'idx4', 'idx5'], columns=['col1', 'col2', 'col3', 'col4'])
print("df2: \n", df2, "\n")
print("df2.loc['idx2': 'idx4', ['col3']]: \n", df2.loc['idx2': 'idx4', ["col3"]], "\n") # DataFrame对象
print("df2.loc['idx4', ['col3']]: \n", df2.loc['idx4', ["col3"]], "\n") # Series对象
print("df2.loc['idx4', 'col3']: \n", df2.loc['idx4', "col3"], "\n") # 值
print("#" * 30)
print(df2.iloc[2])
print(df2.iloc[1:3])
print(df2.iloc[1:3, 1:3])
df1:
数学 语文
小李 98 80
小张 94 87
小红 97 76
##############################
df1['小张':'小红']:
数学 语文
小张 94 87
小红 97 76
########################################
df1[1:2]:
数学 语文
小张 94 87
########################################
df1.ix['小张':'小红']:
小张 94
小红 97
Name: 数学, dtype: int64
########################################
df1.ix[1:]:
小张 94
小红 97
Name: 数学, dtype: int64
df2:
col1 col2 col3 col4
idx1 0.075806 0.946859 0.039281 0.319763
idx2 0.725321 0.588884 0.297942 0.218208
idx3 0.664639 0.750553 0.666202 0.401805
idx4 0.588873 0.679760 0.463870 0.016034
idx5 0.794644 0.337072 0.804746 0.734267
df2.loc['idx2': 'idx4', ['col3']]:
col3
idx2 0.297942
idx3 0.666202
idx4 0.463870
df2.loc['idx4', ['col3']]:
col3 0.46387
Name: idx4, dtype: float64
df2.loc['idx4', 'col3']:
0.463869942466
##############################
col1 0.664639
col2 0.750553
col3 0.666202
col4 0.401805
Name: idx3, dtype: float64
col1 col2 col3 col4
idx2 0.725321 0.588884 0.297942 0.218208
idx3 0.664639 0.750553 0.666202 0.401805
col2 col3
idx2 0.588884 0.297942
idx3 0.750553 0.666202
1、列过滤
2、整体过滤
3、df.isin()过滤
print("df2: \n", df2, "\n")
print("df2[df2.col1>0.6]: \n", df2[df2.col1>0.6])
print("df2[df2 > 0.6]: \n", df2[df2 > 0.6], '\n')
print("df2[df2.col3.isin([0.666202: 0.804746])]: \n", df2[df2.col3.isin([0.666202, 0.804746])])
df2:
col1 col2 col3 col4
idx1 0.075806 0.946859 0.039281 0.319763
idx2 0.725321 0.588884 0.297942 0.218208
idx3 0.664639 0.750553 0.666202 0.401805
idx4 0.588873 0.679760 0.463870 0.016034
idx5 0.794644 0.337072 0.804746 0.734267
df2[df2.col1>0.6]:
col1 col2 col3 col4
idx2 0.725321 0.588884 0.297942 0.218208
idx3 0.664639 0.750553 0.666202 0.401805
idx5 0.794644 0.337072 0.804746 0.734267
df2[df2 > 0.6]:
col1 col2 col3 col4
idx1 NaN 0.946859 NaN NaN
idx2 0.725321 NaN NaN NaN
idx3 0.664639 0.750553 0.666202 NaN
idx4 NaN 0.679760 NaN NaN
idx5 0.794644 NaN 0.804746 0.734267
df2[df2.col3.isin([0.666202: 0.804746])]:
Empty DataFrame
Columns: [col1, col2, col3, col4]
Index: []
添加一列操作:df[new_col] = value,这里value为0维或1维对象(长度必须匹配)
df1["英语"] = 1
print(df1)
数学 语文 英语
小李 98 80 1
小张 94 87 1
小红 97 76 1
reindex()方法对指定轴上的索引进行增删改
注意:使用该方法时,行列索引的数据类型不能改变,否则全部变成了NaN
np4 = np.random.random([5, 4])
df4 = DataFrame(np4, index=['idx1', 'idx2', 'idx3', 'idx4', 'idx5'], columns=['col1', 'col2', 'col3', 'col4'])
print("df4: \n", df4, "\n")
df5 = df4.reindex(index=[i for i in range(5)], columns=(list(df4.columns)+['other'])) # 行索引都改成了数字,返回全为NaN
print("df5: \n", df5, "\n")
df5 = df4.reindex(index=['idx1', 'idx2', 'idx3', 'idx4', 'idx5', 'idx6'], columns=list(df4.columns)+["other"]) # 行索引都改成了数字,返回全为NaN
print("df5: \n", df5, "\n")
df4:
col1 col2 col3 col4
idx1 0.904730 0.083906 0.355150 0.158472
idx2 0.173810 0.902371 0.795704 0.302842
idx3 0.875474 0.940805 0.772010 0.909598
idx4 0.338701 0.384996 0.397991 0.673116
idx5 0.607232 0.821274 0.885462 0.590894
df5:
col1 col2 col3 col4 other
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
df5:
col1 col2 col3 col4 other
idx1 0.904730 0.083906 0.355150 0.158472 NaN
idx2 0.173810 0.902371 0.795704 0.302842 NaN
idx3 0.875474 0.940805 0.772010 0.909598 NaN
idx4 0.338701 0.384996 0.397991 0.673116 NaN
idx5 0.607232 0.821274 0.885462 0.590894 NaN
idx6 NaN NaN NaN NaN NaN
1、去除包含np.NaN的行 df对象.dropna(how=‘any’ or ‘all’)
2、对np.NaN进行填充 df对象.fillna(value=‘XX’) ,此外参数value还可以使用字典,对不同列np.nan填充不同的值
3、对np.NaN进行boolean填充
df5 = DataFrame(np.array(
[
[1,np.nan,2,3],
[2,3,4,np.nan],
[3, 4, 5, 6]
]
))
df51 = df5.dropna(how='any')
print("df5.dropna(how='any'): \n", df51)
df52 = df5.dropna(how='all') # 该行所有值均为nan才删除
print("df5.dropna(how='all'): \n", df52)
df5.dropna(how='any'):
0 1 2 3
2 3.0 4.0 5.0 6.0
df5.dropna(how='all'):
0 1 2 3
0 1.0 NaN 2.0 3.0
1 2.0 3.0 4.0 NaN
2 3.0 4.0 5.0 6.0
df5 = DataFrame(np.array(
[
[1,np.nan,2,3],
[2,3,4,np.nan],
[3, 4, 5, 6]
]
))
df53 = df5.fillna(value=0)
print("df5.fillna(value=0): \n", df53)
df54 = df5.fillna(value={1:1, 3:2})
print("df5.fillna(value={1:1, 3:2}): \n", df54)
df5.fillna(value=0):
0 1 2 3
0 1.0 0.0 2.0 3.0
1 2.0 3.0 4.0 0.0
2 3.0 4.0 5.0 6.0
df5.fillna(value={1:1, 3:2})
0 1 2 3
0 1.0 1.0 2.0 3.0
1 2.0 3.0 4.0 3.0
2 3.0 4.0 5.0 6.0
print("df5.isnull(): \n", df5.isnull())
print("df5.notnull(): \n", df5.notnull())
df5.isnull():
0 1 2 3
0 False True False False
1 False False False True
2 False False False False
df5.notnull():
0 1 2 3
0 True False True True
1 True True True False
2 True True True True
注意:df.sort_values()函数的参数by必须是索引名称,不能是序号
np2 = np.random.random((6, 4))
df2 = DataFrame(np2)
print("df2.head(): \n", df2.head())
print("df2.tail(): \n", df2.tail())
print("df2.describe(): \n", df2.describe())
df2.columns = ['a','c','d','b']
print("df2: \n", df2)
print("df2.T: \n", df2.T)
df2 = df2.sort_index(axis=1, ascending=True)
print("df2.sort_index(axis=1, ascending=True): \n", df2)
print("df2.sort_values(by='b', ascending=True, axis=0): \n", df2.sort_values(by='b', ascending=True, axis=0))
df2.index = ['idx1', 'idx2', 'idx3', 'idx4', 'idx5', 'idx6']
print("df2.sort_values(by='idx3', ascending=True, axis=1): \n", df2.sort_values(by='idx3', ascending=True, axis=1))
df2.head():
0 1 2 3
0 0.713102 0.809143 0.511408 0.671801
1 0.344587 0.866291 0.332197 0.526176
2 0.661047 0.016959 0.391796 0.038333
3 0.437430 0.527497 0.506293 0.949712
4 0.096558 0.413851 0.572861 0.257213
df2.tail():
0 1 2 3
1 0.344587 0.866291 0.332197 0.526176
2 0.661047 0.016959 0.391796 0.038333
3 0.437430 0.527497 0.506293 0.949712
4 0.096558 0.413851 0.572861 0.257213
5 0.718657 0.023692 0.722784 0.459101
df2.describe():
0 1 2 3
count 6.000000 6.000000 6.000000 6.000000
mean 0.495230 0.442905 0.506223 0.483723
std 0.248941 0.368390 0.137655 0.317681
min 0.096558 0.016959 0.332197 0.038333
25% 0.367797 0.121232 0.420420 0.307685
50% 0.549238 0.470674 0.508850 0.492638
75% 0.700088 0.738731 0.557498 0.635394
max 0.718657 0.866291 0.722784 0.949712
df2:
a c d b
0 0.713102 0.809143 0.511408 0.671801
1 0.344587 0.866291 0.332197 0.526176
2 0.661047 0.016959 0.391796 0.038333
3 0.437430 0.527497 0.506293 0.949712
4 0.096558 0.413851 0.572861 0.257213
5 0.718657 0.023692 0.722784 0.459101
df2.T:
0 1 2 3 4 5
a 0.713102 0.344587 0.661047 0.437430 0.096558 0.718657
c 0.809143 0.866291 0.016959 0.527497 0.413851 0.023692
d 0.511408 0.332197 0.391796 0.506293 0.572861 0.722784
b 0.671801 0.526176 0.038333 0.949712 0.257213 0.459101
df2.sort_index(axis=1, ascending=True):
a b c d
0 0.713102 0.671801 0.809143 0.511408
1 0.344587 0.526176 0.866291 0.332197
2 0.661047 0.038333 0.016959 0.391796
3 0.437430 0.949712 0.527497 0.506293
4 0.096558 0.257213 0.413851 0.572861
5 0.718657 0.459101 0.023692 0.722784
df2.sort_values(by='b', ascending=True, axis=0):
a b c d
2 0.661047 0.038333 0.016959 0.391796
4 0.096558 0.257213 0.413851 0.572861
5 0.718657 0.459101 0.023692 0.722784
1 0.344587 0.526176 0.866291 0.332197
0 0.713102 0.671801 0.809143 0.511408
3 0.437430 0.949712 0.527497 0.506293
df2.sort_values(by='idx3', ascending=True, axis=1):
c b d a
idx1 0.809143 0.671801 0.511408 0.713102
idx2 0.866291 0.526176 0.332197 0.344587
idx3 0.016959 0.038333 0.391796 0.661047
idx4 0.527497 0.949712 0.506293 0.437430
idx5 0.413851 0.257213 0.572861 0.096558
idx6 0.023692 0.459101 0.722784 0.718657
df6 = DataFrame(np.array(
[
[1,np.nan,2,3],
[2,3,4,np.nan],
[3, 4, 5, 6]
]
), columns=['col1', 'col2', 'col3', 'col4'])
print("df6.mean(): \n", df6.mean(axis=0)) # 默认axis=0
print("df6.apply(np.cumsum): \n", df6.apply(np.cumsum))
print("df6.apply(lambda x: x.max()-x.min()): \n", df6.apply(lambda x: x.max()-x.min()))
df6.mean():
col1 2.000000
col2 3.500000
col3 3.666667
col4 4.500000
dtype: float64
df6.apply(np.cumsum):
col1 col2 col3 col4
0 1.0 NaN 2.0 3.0
1 3.0 3.0 6.0 NaN
2 6.0 7.0 11.0 9.0
df6.apply(lambda x: x.max()-x.min()):
col1 2.0
col2 1.0
col3 3.0
col4 3.0
dtype: float64
series7 = Series(np.random.randint(0, 7, size=10))
print("series7: \n", series7)
print("series7.value_counts(): \n", series7.value_counts()) # 返回Series对象
series7:
0 3
1 3
2 5
3 5
4 4
5 4
6 3
7 4
8 1
9 5
dtype: int32
series7.value_counts():
5 3
4 3
3 3
1 1
dtype: int64