pandas基础教程

首先引入相关模块:

import numpy as np
import pandas as pd

1. DataFrame

1.1 创建Series

index不指定则从0开始编号

s = pd.Series([1, 2, 3, np.nan], index=['A', 'B', 'C', 'D'])
print s

输出
A 1.0
B 2.0
C 3.0
D NaN
dtype: float64

1.2 创建时间序列

必须指定start、end、periods中的两个参数值

dates = pd.date_range('20180101', periods=5)
print dates

输出
DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’,
‘2018-01-05’],
dtype=’datetime64[ns]’, freq=’D’)

1.3 创建DataFrame

df = pd.DataFrame(np.random.rand(3,4),columns=['a', 'b', 'c', 'd'])
print df

输出
a b c d
0 0.233310 0.170256 0.036988 0.697916
1 0.159580 0.287814 0.528123 0.956051
2 0.815038 0.438103 0.143477 0.769143

通过字典创建(key为列名):

df = pd.DataFrame({'A': 1,
                   'B': pd.Timestamp('20171208'),
                   'C': pd.Series(np.arange(4)),
                   'D': pd.Categorical(['test', 'train', 'test', 'train'])})
print df

输出
A B C D
0 1 2017-12-08 0 test
1 1 2017-12-08 1 train
2 1 2017-12-08 2 test
3 1 2017-12-08 3 train

1.4 DataFrame属性

df = pd.DataFrame({'A': 1,
                   'B': pd.Timestamp('20171208'),
                   'C': pd.Series(np.arange(4)),
                   'D': pd.Categorical(['test', 'train', 'test', 'train'])})

查看每列数据类型

print df.dtypes

输出
A int64
B datetime64[ns]
C int32
D category
dtype: object

查看索引

print df.index   

输出
RangeIndex(start=0, stop=4, step=1)

查看列名

print df.columns

输出
Index([u’A’, u’B’, u’C’, u’D’], dtype=’object’)

查看数据

print df.values  

输出
[[1L Timestamp(‘2017-12-08 00:00:00’) 0 ‘test’]
[1L Timestamp(‘2017-12-08 00:00:00’) 1 ‘train’]
[1L Timestamp(‘2017-12-08 00:00:00’) 2 ‘test’]
[1L Timestamp(‘2017-12-08 00:00:00’) 3 ‘train’]]

DataFrame统计信息

print df.describe()

输出
A C
count 4.0 4.000000
mean 1.0 1.500000
std 0.0 1.290994
min 1.0 0.000000
25% 1.0 0.750000
50% 1.0 1.500000
75% 1.0 2.250000
max 1.0 3.000000

1.5 DataFrame常用操作

转置

df = pd.DataFrame(np.arange(12).reshape((3,4)),columns=['a', 'b', 'c', 'd'])
print df.T

输出
0 1 2
a 0 4 8
b 1 5 9
c 2 6 10
d 3 7 11

按index排序

df = pd.DataFrame(np.arange(12).reshape((3, 4)), columns=['a', 'b', 'c', 'd'])
print df.sort_index(axis=1, ascending=False)

输出
d c b a
0 3 2 1 0
1 7 6 5 4
2 11 10 9 8

按值排序

df = pd.DataFrame(np.arange(12).reshape((3, 4)), columns=['a', 'b', 'c', 'd'])
print df.sort_values(by='b', ascending=False)

输出
a b c d
2 8 9 10 11
1 4 5 6 7
0 0 1 2 3

2. 数据选择

创建如下dataframe:

dates = pd.date_range('20180101', periods=3)
df = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=dates, columns=['a', 'b', 'c', 'd'])
print df

输出
a b c d
2018-01-01 0 1 2 3
2018-01-02 4 5 6 7
2018-01-03 8 9 10 11

2.1 选择列

print df['a']  # print df.a

输出
2018-01-01 0
2018-01-02 4
2018-01-03 8

2.2 选择行

print df[0:2]  # print df['20180101':'20180102']

输出
a b c d
2018-01-01 0 1 2 3
2018-01-02 4 5 6 7

2.3 标签选择:loc

print df.loc[:,['a', 'b']] 

输出
a b
2018-01-01 0 1
2018-01-02 4 5
2018-01-03 8 9

print df.loc['20180102']

输出
a 4
b 5
c 6
d 7
Name: 2018-01-02 00:00:00, dtype: int32

2.4 位置选择:iloc

print df.iloc[[0, 2], 2:4]

输出
c d
2018-01-01 2 3
2018-01-03 10 11

2.5 混合标签与位置:ix

print df.ix[:2, ['c', 'd']]

输出
c d
2018-01-01 2 3
2018-01-02 6 7

2.6 Boolean选择

print df[df.a < 5]

输出
a b c d
2018-01-01 0 1 2 3
2018-01-02 4 5 6 7

3. 处理NaN数据

首先创建包含NaN的dataframe:

dates = pd.date_range('20180101', periods=3)
df = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=dates, columns=['a', 'b', 'c', 'd'])
df.iloc[1, 1], df.iloc[2, 2] = np.nan, np.nan
print df

输出
a b c d
2018-01-01 0 1.0 2.0 3
2018-01-02 4 NaN 6.0 7
2018-01-03 8 9.0 NaN 11

3.1 删除NaN数据

print df.dropna(axis=1)  # how = ['any', 'all']

输出
a d
2018-01-01 0 3
2018-01-02 4 7
2018-01-03 8 11

3.2 填充NaN数据

print df.fillna(value='*')

输出
a b c d
2018-01-01 0 1 2 3
2018-01-02 4 * 6 7
2018-01-03 8 9 * 11

3.3 检查是否存在NaN

print df.isnull()

输出
a b c d
2018-01-01 False False False False
2018-01-02 False True False False
2018-01-03 False False True False

4. 导入与导出

导入函数 导出函数功能
read_csv to_csv
read_excel to_excel
read_sql to_sql
read_json to_json
read_msgpack to_msgpack
read_html to_html
read_gbq to_gbq
read_stata to_stata
read_sas to_sas
read_clipboard to_clipboard
read_pickle to_pickle

详见:input-output

以下面这个test.txt为例:

A B
Tom 21
Joe 26
Sam 55
Kerry 27

忽略第一行,并设置列名分别为‘name’和‘age’

data = pd.read_csv('test.txt', sep=' ', skiprows=1, names=['name','age'])
print data

输出
name age
0 Tom 21
1 Joe 26
2 Sam 55
3 Kerry 27

5. 合并DataFrame

5.1 concat函数

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['a', 'b', 'c', 'd'])
df3 = pd.DataFrame(np.ones((3, 4))*2, columns=['a', 'b', 'c', 'd'])
# ignore_index=True将重新对index排序
print pd.concat([df1, df2, df3], axis=0, ignore_index=True)

输出
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0

join参数用法

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['b', 'c', 'd', 'e'], index=[2,3,4])
# join默认为'outer',不共有的列用NaN填充
print pd.concat([df1, df2], join='outer') 
# join='inner'只合并共有的列
print pd.concat([df1, df2], join='inner')

输出 1
a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
输出 2
b c d
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0

join_axes参数用法

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
# 按照df1的index进行合并
print pd.concat([df1, df2], axis=1, join_axes=[df1.index])

输出
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

5.2 append函数

append多个DataFrame

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['a', 'b', 'c', 'd'])

print df1.append(df2, ignore_index=True)

输出
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0

append一组数据

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'])
s = pd.Series([4, 4, 4, 4], index=['a', 'b', 'c', 'd'])

print df1.append(s, ignore_index=True)

输出
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 4.0 4.0 4.0 4.0

5.3 merge函数

基于某一列进行合并

df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                   'KEY': ['K1', 'K2', 'K3']})
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3'],
                   'KEY': ['K1', 'K2', 'K3']})

print pd.merge(df1, df2, on='KEY')

输出
A B KEY C D
0 A1 B1 K1 C1 D1
1 A2 B2 K2 C2 D2
2 A3 B3 K3 C3 D3

基于某两列进行合并

df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'KEY1': ['K1', 'K2', 'K0'],
                    'KEY2': ['K0', 'K1', 'K3']})
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3'],
                    'KEY1': ['K0', 'K2', 'K1'],
                    'KEY2': ['K1', 'K1', 'K0']})
# how:['left','right','outer','inner']
print pd.merge(df1, df2, on=['KEY1', 'KEY2'], how='inner')

输出
A B KEY1 KEY2 C D
0 A1 B1 K1 K0 C3 D3
1 A2 B2 K2 K1 C2 D2

按index合并

df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3']},
                   index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3']},
                   index=['K0', 'K1', 'K3'])

print pd.merge(df1, df2, left_index=True, right_index=True, how='outer')

输出
A B C D
K0 A1 B1 C1 D1
K1 A2 B2 C2 D2
K2 A3 B3 NaN NaN
K3 NaN NaN C3 D3

为列加后缀

df_boys = pd.DataFrame({'id': ['1', '2', '3'],
                        'age': ['23', '25', '18']})
df_girls = pd.DataFrame({'id': ['1', '2', '3'],
                        'age': ['18', '18', '18']})

print pd.merge(df_boys, df_girls, on='id', suffixes=['_boys', '_girls'])

输出
age_boys id age_girls
0 23 1 18
1 25 2 18
2 18 3 18

6. 绘图

引入相应模块

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

绘制Series

data = pd.Series(np.random.randn(1000))
data = data.cumsum()

data.plot()
plt.show()


pandas基础教程_第1张图片

绘制DataFrame

data = pd.DataFrame(np.random.randn(1000, 4), columns=['A', 'B', 'C', 'D'])
data = data.cumsum()
data.plot()
plt.show()


pandas基础教程_第2张图片

data = pd.DataFrame(np.random.randn(1000, 4), columns=['A', 'B', 'C', 'D'])
data = data.cumsum()
ax = data.plot.scatter(x='A', y='B', color='DarkRed', label='class 1')
data.plot.scatter(x='C', y='D', color='DarkGreen', label='class 2', ax=ax)
plt.show()


pandas基础教程_第3张图片

更多绘图详见:visualization

你可能感兴趣的:(Python)