DataFrame的基本使用

DataFrame的基本使用

定义

DataFrame类似于二维数组(表格), 由一组数据(类似于二维numpy对象)及两组标签(行索引,列索引)组成

创建方法

DataFrame(可迭代二维数据 [, index=行索引链表[, columns=列索引链表 [, dtype=数据类型]]])

注:可迭代对象可以使用二维链表,二维numpy对象,字典(使用字典时,字典的value为一维链表,columns对应字典的key),可迭代对象必须是二维,否则报错:Data must be 2-dimensional

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
print(DataFrame(np.arange(6).reshape([2, 3])))
print("#" * 30)
print(DataFrame(np.arange(6).reshape([2, 3]), index=["row1", "row2"], columns=["col1", "col2", "col3"], dtype=float))
   0  1  2
0  0  1  2
1  3  4  5
##############################
      col1  col2  col3
row1   0.0   1.0   2.0
row2   3.0   4.0   5.0

属性

DataFrame对象的属性有:dtype, index, values, name

dtype属性在创建时可以指定,但是直接打印会报错,必须通过对象的dtype属性重新指定以后才能打印属性值

name属性在创建时不可以指定,直接打印会报错,必须通过对象的name属性重新指定以后才能打印属性值

df0 = DataFrame(np.arange(6).reshape([2, 3]), index=["row1", "row2"], columns=["col1", "col2", "col3"], dtype=float)

# print(df0.dtype) # 'DataFrame' object has no attribute 'dtype'
df0.dtype = int
print(df0.dtype)

print("##############################")
print(df0.index)

print("##############################")
print(df0.values) # numpy对象

print("#" * 30)
# print(df0.name) # 'DataFrame' object has no attribute 'name'
df0.name = "first dataframe"
print(df0.name)

print(df0.index.name)
df0.index.name = "idx"
print("##############################")
print(df0)

##############################
Index(['row1', 'row2'], dtype='object')
##############################
[[ 0.  1.  2.]
 [ 3.  4.  5.]]
##############################
first dataframe
None
##############################
      col1  col2  col3
idx                   
row1   0.0   1.0   2.0
row2   3.0   4.0   5.0

增删改查

普通查询

1、DataFrame默认从列取值,如果想从行开始取值,使用ix
注意:直接取值不能使用序号,也不能使用[xx, xx],使用 ix属性可以使用序号
           只有直接取值为先列后行
2、loc属性中,第一个位置为行索引,第二个位置为列索引
注意:loc属性只能用关键字,不能使用序号
iloc属性中,第一个位置为行索引,第二个位置为列索引,不可以使用索引名称
3、快速获取指定位置的值
df对象.iat[行序号, 列序号]     先行后列

df1 = DataFrame({"语文": [80, 87, 76], "数学": [98, 94, 97]}, index=["小李", "小张", "小红"])
print("df1: \n", df1)

print("#" * 30)
print("df1['语文']: \n", df1['语文']) # Series对象

# print("#" * 30)
# print("df1[0]: \n", df1[0]) # KeyError: 0

print("#" * 30)
print("df1.ix['小红', '语文']: \n", df1.ix['小红', '语文']) # Series对象

print("#" * 30)
print("df1.ix[2, 1]: \n", df1.ix[2, 1]) # Series对象

print("#" * 30)
print("df1.loc['小张']: \n", df1.loc['小张'])

print("#" * 30)
print("df1.iloc[1]: \n", df1.iloc[1])

print("#" * 30)
print("df1.iat[0, 0]: \n", df1.iat[0, 0])
df1: 
     数学  语文
小李  98  80
小张  94  87
小红  97  76
##############################
df1['语文']: 
 小李    80
小张    87
小红    76
Name: 语文, dtype: int64
##############################
df1.ix['小红', '语文']: 
 76
##############################
df1.ix[2, 1]: 
 76
##############################
df1.loc['小张']: 
 数学    94
语文    87
Name: 小张, dtype: int64
##############################
df1.iloc[1]: 
 数学    94
语文    87
Name: 小张, dtype: int64
##############################
df1.iat[0, 0]: 
 98

切片查询

1、对于切片操作,无论是否加ix,都是从行进行取值
注意:1、直接取值不能使用序号,使用 ix属性可以使用序号
           2、使用序号是前开后闭,但是使用索引是闭区间

2、loc属性中,第一个位置为行索引,第二个位置为列索引;两个位置均为切片返回DataFrame对象;仅有一个位置为切片,另一个位置为值,返回Series对象;两个位置均为值,返回值。
注意:loc属性使用切片或索引只能用关键字,不能使用序号
区别:通过索引取值是闭区间,通过索引序号取值是前闭后开的

df1 = DataFrame({"语文": [80, 87, 76], "数学": [98, 94, 97]}, index=["小李", "小张", "小红"])
print("df1: \n", df1)

print("#" * 30)
print("df1['小张':'小红']: \n", df1["小张":"小红"]) # DataFrame对象

print("#" * 40)
print('df1[1:2]: \n', df1[1:2]) # DataFrame对象

print("#" * 40)
print("df1.ix['小张':'小红']: \n", df1.ix["小张":'小红', '数学'])

print("#" * 40)
print("df1.ix[1:]: \n", df1.ix[1:, 0])

np2 = np.random.random([5, 4])
df2 = DataFrame(np2, index=['idx1', 'idx2', 'idx3', 'idx4', 'idx5'], columns=['col1', 'col2', 'col3', 'col4'])
print("df2: \n", df2, "\n")
print("df2.loc['idx2': 'idx4', ['col3']]: \n", df2.loc['idx2': 'idx4', ["col3"]], "\n") # DataFrame对象
print("df2.loc['idx4', ['col3']]: \n", df2.loc['idx4', ["col3"]], "\n") # Series对象
print("df2.loc['idx4', 'col3']: \n", df2.loc['idx4', "col3"], "\n") # 值

print("#" * 30)
print(df2.iloc[2])
print(df2.iloc[1:3])
print(df2.iloc[1:3, 1:3])
df1: 
     数学  语文
小李  98  80
小张  94  87
小红  97  76
##############################
df1['小张':'小红']: 
     数学  语文
小张  94  87
小红  97  76
########################################
df1[1:2]: 
     数学  语文
小张  94  87
########################################
df1.ix['小张':'小红']: 
 小张    94
小红    97
Name: 数学, dtype: int64
########################################
df1.ix[1:]: 
 小张    94
小红    97
Name: 数学, dtype: int64
df2: 
           col1      col2      col3      col4
idx1  0.075806  0.946859  0.039281  0.319763
idx2  0.725321  0.588884  0.297942  0.218208
idx3  0.664639  0.750553  0.666202  0.401805
idx4  0.588873  0.679760  0.463870  0.016034
idx5  0.794644  0.337072  0.804746  0.734267 

df2.loc['idx2': 'idx4', ['col3']]: 
           col3
idx2  0.297942
idx3  0.666202
idx4  0.463870 

df2.loc['idx4', ['col3']]: 
 col3    0.46387
Name: idx4, dtype: float64 

df2.loc['idx4', 'col3']: 
 0.463869942466 

##############################
col1    0.664639
col2    0.750553
col3    0.666202
col4    0.401805
Name: idx3, dtype: float64
          col1      col2      col3      col4
idx2  0.725321  0.588884  0.297942  0.218208
idx3  0.664639  0.750553  0.666202  0.401805
          col2      col3
idx2  0.588884  0.297942
idx3  0.750553  0.666202

条件查询

1、列过滤

2、整体过滤

3、df.isin()过滤

print("df2: \n", df2, "\n")
print("df2[df2.col1>0.6]: \n", df2[df2.col1>0.6])
print("df2[df2 > 0.6]: \n", df2[df2 > 0.6], '\n')
print("df2[df2.col3.isin([0.666202: 0.804746])]: \n", df2[df2.col3.isin([0.666202, 0.804746])])
df2: 
           col1      col2      col3      col4
idx1  0.075806  0.946859  0.039281  0.319763
idx2  0.725321  0.588884  0.297942  0.218208
idx3  0.664639  0.750553  0.666202  0.401805
idx4  0.588873  0.679760  0.463870  0.016034
idx5  0.794644  0.337072  0.804746  0.734267 

df2[df2.col1>0.6]: 
           col1      col2      col3      col4
idx2  0.725321  0.588884  0.297942  0.218208
idx3  0.664639  0.750553  0.666202  0.401805
idx5  0.794644  0.337072  0.804746  0.734267
df2[df2 > 0.6]: 
           col1      col2      col3      col4
idx1       NaN  0.946859       NaN       NaN
idx2  0.725321       NaN       NaN       NaN
idx3  0.664639  0.750553  0.666202       NaN
idx4       NaN  0.679760       NaN       NaN
idx5  0.794644       NaN  0.804746  0.734267 

df2[df2.col3.isin([0.666202: 0.804746])]: 
 Empty DataFrame
Columns: [col1, col2, col3, col4]
Index: []

新增

添加一列操作:df[new_col] = value,这里value为0维或1维对象(长度必须匹配)

df1["英语"] = 1 
print(df1)
数学  语文  英语
小李  98  80   1
小张  94  87   1
小红  97  76   1

缺失值处理

reindex()方法对指定轴上的索引进行增删改

注意:使用该方法时,行列索引的数据类型不能改变,否则全部变成了NaN

np4 = np.random.random([5, 4])
df4 = DataFrame(np4, index=['idx1', 'idx2', 'idx3', 'idx4', 'idx5'], columns=['col1', 'col2', 'col3', 'col4'])
print("df4: \n", df4, "\n")
df5 = df4.reindex(index=[i for i in range(5)], columns=(list(df4.columns)+['other'])) # 行索引都改成了数字,返回全为NaN
print("df5: \n", df5, "\n")
df5 = df4.reindex(index=['idx1', 'idx2', 'idx3', 'idx4', 'idx5', 'idx6'], columns=list(df4.columns)+["other"]) # 行索引都改成了数字,返回全为NaN
print("df5: \n", df5, "\n")
df4: 
           col1      col2      col3      col4
idx1  0.904730  0.083906  0.355150  0.158472
idx2  0.173810  0.902371  0.795704  0.302842
idx3  0.875474  0.940805  0.772010  0.909598
idx4  0.338701  0.384996  0.397991  0.673116
idx5  0.607232  0.821274  0.885462  0.590894 

df5: 
    col1  col2  col3  col4  other
0   NaN   NaN   NaN   NaN    NaN
1   NaN   NaN   NaN   NaN    NaN
2   NaN   NaN   NaN   NaN    NaN
3   NaN   NaN   NaN   NaN    NaN
4   NaN   NaN   NaN   NaN    NaN 

df5: 
           col1      col2      col3      col4  other
idx1  0.904730  0.083906  0.355150  0.158472    NaN
idx2  0.173810  0.902371  0.795704  0.302842    NaN
idx3  0.875474  0.940805  0.772010  0.909598    NaN
idx4  0.338701  0.384996  0.397991  0.673116    NaN
idx5  0.607232  0.821274  0.885462  0.590894    NaN
idx6       NaN       NaN       NaN       NaN    NaN 

对包含np.NaN的值进行处理

1、去除包含np.NaN的行 df对象.dropna(how=‘any’ or ‘all’)

2、对np.NaN进行填充 df对象.fillna(value=‘XX’) ,此外参数value还可以使用字典,对不同列np.nan填充不同的值

3、对np.NaN进行boolean填充

df5 = DataFrame(np.array(
    [
        [1,np.nan,2,3],
        [2,3,4,np.nan],
        [3, 4, 5, 6]
    ]
))
df51 = df5.dropna(how='any')
print("df5.dropna(how='any'): \n", df51)
df52 = df5.dropna(how='all') # 该行所有值均为nan才删除
print("df5.dropna(how='all'): \n", df52)
df5.dropna(how='any'): 
      0    1    2    3
2  3.0  4.0  5.0  6.0
df5.dropna(how='all'): 
      0    1    2    3
0  1.0  NaN  2.0  3.0
1  2.0  3.0  4.0  NaN
2  3.0  4.0  5.0  6.0
df5 = DataFrame(np.array(
    [
        [1,np.nan,2,3],
        [2,3,4,np.nan],
        [3, 4, 5, 6]
    ]
))
df53 = df5.fillna(value=0)
print("df5.fillna(value=0): \n", df53)
df54 = df5.fillna(value={1:1, 3:2})
print("df5.fillna(value={1:1, 3:2}): \n", df54)
df5.fillna(value=0): 
      0    1    2    3
0  1.0  0.0  2.0  3.0
1  2.0  3.0  4.0  0.0
2  3.0  4.0  5.0  6.0
df5.fillna(value={1:1, 3:2})
     0    1    2    3
0  1.0  1.0  2.0  3.0
1  2.0  3.0  4.0  3.0
2  3.0  4.0  5.0  6.0
print("df5.isnull(): \n", df5.isnull())
print("df5.notnull(): \n", df5.notnull())
df5.isnull(): 
        0      1      2      3
0  False   True  False  False
1  False  False  False   True
2  False  False  False  False
df5.notnull(): 
       0      1     2      3
0  True  False  True   True
1  True   True  True  False
2  True   True  True   True

对数据进行统计

注意:df.sort_values()函数的参数by必须是索引名称,不能是序号

np2 = np.random.random((6, 4))
df2 = DataFrame(np2)
print("df2.head(): \n", df2.head())
print("df2.tail(): \n", df2.tail())
print("df2.describe(): \n", df2.describe())
df2.columns = ['a','c','d','b']
print("df2: \n", df2)
print("df2.T: \n", df2.T)
df2 = df2.sort_index(axis=1, ascending=True)
print("df2.sort_index(axis=1, ascending=True): \n", df2)

print("df2.sort_values(by='b', ascending=True, axis=0): \n", df2.sort_values(by='b', ascending=True, axis=0)) 

df2.index = ['idx1', 'idx2', 'idx3', 'idx4', 'idx5', 'idx6']
print("df2.sort_values(by='idx3', ascending=True, axis=1): \n", df2.sort_values(by='idx3', ascending=True, axis=1))
df2.head(): 
           0         1         2         3
0  0.713102  0.809143  0.511408  0.671801
1  0.344587  0.866291  0.332197  0.526176
2  0.661047  0.016959  0.391796  0.038333
3  0.437430  0.527497  0.506293  0.949712
4  0.096558  0.413851  0.572861  0.257213
df2.tail(): 
           0         1         2         3
1  0.344587  0.866291  0.332197  0.526176
2  0.661047  0.016959  0.391796  0.038333
3  0.437430  0.527497  0.506293  0.949712
4  0.096558  0.413851  0.572861  0.257213
5  0.718657  0.023692  0.722784  0.459101
df2.describe(): 
               0         1         2         3
count  6.000000  6.000000  6.000000  6.000000
mean   0.495230  0.442905  0.506223  0.483723
std    0.248941  0.368390  0.137655  0.317681
min    0.096558  0.016959  0.332197  0.038333
25%    0.367797  0.121232  0.420420  0.307685
50%    0.549238  0.470674  0.508850  0.492638
75%    0.700088  0.738731  0.557498  0.635394
max    0.718657  0.866291  0.722784  0.949712
df2: 
           a         c         d         b
0  0.713102  0.809143  0.511408  0.671801
1  0.344587  0.866291  0.332197  0.526176
2  0.661047  0.016959  0.391796  0.038333
3  0.437430  0.527497  0.506293  0.949712
4  0.096558  0.413851  0.572861  0.257213
5  0.718657  0.023692  0.722784  0.459101
df2.T: 
           0         1         2         3         4         5
a  0.713102  0.344587  0.661047  0.437430  0.096558  0.718657
c  0.809143  0.866291  0.016959  0.527497  0.413851  0.023692
d  0.511408  0.332197  0.391796  0.506293  0.572861  0.722784
b  0.671801  0.526176  0.038333  0.949712  0.257213  0.459101
df2.sort_index(axis=1, ascending=True): 
           a         b         c         d
0  0.713102  0.671801  0.809143  0.511408
1  0.344587  0.526176  0.866291  0.332197
2  0.661047  0.038333  0.016959  0.391796
3  0.437430  0.949712  0.527497  0.506293
4  0.096558  0.257213  0.413851  0.572861
5  0.718657  0.459101  0.023692  0.722784
df2.sort_values(by='b', ascending=True, axis=0): 
           a         b         c         d
2  0.661047  0.038333  0.016959  0.391796
4  0.096558  0.257213  0.413851  0.572861
5  0.718657  0.459101  0.023692  0.722784
1  0.344587  0.526176  0.866291  0.332197
0  0.713102  0.671801  0.809143  0.511408
3  0.437430  0.949712  0.527497  0.506293
df2.sort_values(by='idx3', ascending=True, axis=1): 
              c         b         d         a
idx1  0.809143  0.671801  0.511408  0.713102
idx2  0.866291  0.526176  0.332197  0.344587
idx3  0.016959  0.038333  0.391796  0.661047
idx4  0.527497  0.949712  0.506293  0.437430
idx5  0.413851  0.257213  0.572861  0.096558
idx6  0.023692  0.459101  0.722784  0.718657
df6 = DataFrame(np.array(
    [
        [1,np.nan,2,3],
        [2,3,4,np.nan],
        [3, 4, 5, 6]
    ]
), columns=['col1', 'col2', 'col3', 'col4'])
print("df6.mean(): \n", df6.mean(axis=0)) # 默认axis=0
print("df6.apply(np.cumsum): \n", df6.apply(np.cumsum))
print("df6.apply(lambda x: x.max()-x.min()): \n", df6.apply(lambda x: x.max()-x.min()))
df6.mean(): 
 col1    2.000000
col2    3.500000
col3    3.666667
col4    4.500000
dtype: float64
df6.apply(np.cumsum): 
    col1  col2  col3  col4
0   1.0   NaN   2.0   3.0
1   3.0   3.0   6.0   NaN
2   6.0   7.0  11.0   9.0
df6.apply(lambda x: x.max()-x.min()): 
 col1    2.0
col2    1.0
col3    3.0
col4    3.0
dtype: float64
series7 = Series(np.random.randint(0, 7, size=10))
print("series7: \n", series7)
print("series7.value_counts(): \n", series7.value_counts()) # 返回Series对象
series7: 
 0    3
1    3
2    5
3    5
4    4
5    4
6    3
7    4
8    1
9    5
dtype: int32
series7.value_counts(): 
 5    3
4    3
3    3
1    1
dtype: int64

你可能感兴趣的:(Python后端,数据分析)