Pandas简单使用
由于Python本身的限制,当数据太大的时候,而无法一次载入内存,需要进行分块导入,并对查询做出相应的修改。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates = pd.date_range('20121201', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(df)
Date |
A |
B |
C |
D |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
-0.599967 |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
-1.838382 |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
-0.150623 |
2012-12-04 |
-0.974840 |
-0.114777 |
0.952938 |
2.034717 |
2012-12-05 |
-0.689099 |
-1.102233 |
0.227212 |
1.241322 |
2012-12-06 |
-0.288585 |
1.363764 |
0.230803 |
-1.884838 |
1. 1 选择行
rows = df[0:3]
print(rows)
Date |
A |
B |
C |
D |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
-0.599967 |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
-1.838382 |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
-0.150623 |
1.2 选择列
cols = df[['A', 'B', 'C']]
print(cols)
Date |
A |
B |
C |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
2012-12-04 |
-0.974840 |
-0.114777 |
0.952938 |
2012-12-05 |
-0.689099 |
-1.102233 |
0.227212 |
2012-12-06 |
-0.288585 |
1.363764 |
0.230803 |
1.3 块的选取,也就是选择行和列组成的数据快
Pandas的基本数据有二种,Series和Dataframe。Series创建行,也就是一维数组。 Dataframe用来创建块,或者成为矩阵,表格。
2 Series操作
s = pd.Series([1,2,3,4])
print(s)
输出:
0 1
1 2
2 3
3 4
dtype: int64
2.2 DataFrame
s = pd.DataFrame(np.random.randn(6,4),columns=list('ABCD'))
print(s)
Date |
A |
B |
C |
D |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
-0.599967 |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
-1.838382 |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
-0.150623 |
2012-12-04 |
-0.974840 |
-0.114777 |
0.952938 |
2.034717 |
2012-12-05 |
-0.689099 |
-1.102233 |
0.227212 |
1.241322 |
2012-12-06 |
-0.288585 |
1.363764 |
0.230803 |
-1.884838 |
print(s.index)
输出:
RangeIndex(start=0, stop=6, step=1)
df['sumAB'] = pd.Series(df['A']+df['B'],index=df.index)
df['10A'] = pd.Series(df['A']*10,index=df.index)
print(df)
Date |
A |
B |
C |
D |
SumAB |
10A |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
-0.599967 |
0.353426 |
11.675168 |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
-1.838382 |
-0.694840 |
-7.675406 |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
-0.150623 |
-1.491414 |
-10.782469 |
2012-12-04 |
-0.974840 |
-0.114777 |
0.952938 |
2.034717 |
-0.860063 |
-9.748398 |
2012-12-05 |
-0.689099 |
-1.102233 |
0.227212 |
1.241322 |
-1.791332 |
-6.890987 |
2012-12-06 |
-0.288585 |
1.363764 |
0.230803 |
-1.884838 |
1.075178 |
-2.885852 |
2.3 根据条件过滤行
s1 = df[(df.index>='20121201')&(df.index<='20121203')]
print(s1)
Date |
A |
B |
C |
D |
SumAB |
10A |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
-0.599967 |
0.353426 |
11.675168 |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
-1.838382 |
-0.694840 |
-7.675406 |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
-0.150623 |
-1.491414 |
-10.782469 |
s2 = df[df['A']>0]
print(s2)
Date |
A |
B |
C |
D |
SumAB |
10A |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
-0.599967 |
0.353426 |
11.675168 |
2.4 窥视数据
df.head(5)
Date |
A |
B |
C |
D |
SumAB |
10A |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
-0.599967 |
0.353426 |
11.675168 |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
-1.838382 |
-0.694840 |
-7.675406 |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
-0.150623 |
-1.491414 |
-10.782469 |
2012-12-04 |
-0.974840 |
-0.114777 |
0.952938 |
2.034717 |
-0.860063 |
-9.748398 |
2012-12-05 |
-0.689099 |
-1.102233 |
0.227212 |
1.241322 |
-1.791332 |
-6.890987 |
df.tail(5)
Date |
A |
B |
C |
D |
SumAB |
10A |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
-1.838382 |
-0.694840 |
-7.675406 |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
-0.150623 |
-1.491414 |
-10.782469 |
2012-12-04 |
-0.974840 |
-0.114777 |
0.952938 |
2.034717 |
-0.860063 |
-9.748398 |
2012-12-05 |
-0.689099 |
-1.102233 |
0.227212 |
1.241322 |
-1.791332 |
-6.890987 |
2012-12-06 |
-0.288585 |
1.363764 |
0.230803 |
-1.884838 |
1.075178 |
-2.885852 |
df.values
array([[ 1.16751676, -0.81409105, -0.90861201, -0.59996719,
0.35342571, 11.6751676 ],
[ -0.76754063, 0.07270018, 0.98545024, -1.83838166,
-0.69484045, -7.67540633],
[ -1.07824687, -0.41316755, -1.89944615, -0.15062331,
-1.49141442, -10.78246866],
[ -0.97483975, 0.11477693, 0.95293849, 2.03471652,
-0.86006282, -9.74839753],
[ -0.68909873, -1.10223307, 0.22721154, 1.24132162,
-1.7913318 , -6.89098733],
[ -0.2885852 , 1.3637637 , 0.23080346, -1.88483769,
1.0751785 , -2.88585202]])
df.sort_values(by='A')
Date |
A |
B |
C |
D |
SumAB |
10A |
2012-12-03 |
-1.078247 |
-0.413168 |
-1.899446 |
-0.150623 |
-1.491414 |
-10.782469 |
2012-12-04 |
-0.974840 |
-0.114777 |
0.952938 |
2.034717 |
-0.860063 |
-9.748398 |
2012-12-02 |
-0.767541 |
0.072700 |
0.985450 |
-1.838382 |
-0.694840 |
-7.675406 |
2012-12-05 |
-0.689099 |
-1.102233 |
0.227212 |
1.241322 |
-1.791332 |
-6.890987 |
2012-12-06 |
-0.288585 |
1.363764 |
0.230803 |
-1.884838 |
1.075178 |
-2.885852 |
2012-12-01 |
1.167517 |
-0.814091 |
-0.90861 |
-0.599967 |
0.353426 |
11.675168 |
3 作图
Pandas和matplotlib配合使用,几乎可以支持所有的图表形式
首先打开图表行内显示
%matplotlib inline
nd = pd.Series(np.random.randn(600))
nd.hist(bins=100)
输出
<matplotlib.axes._subplots.AxesSubplot at 0x7f54c76043c8>
Pandas中read_csv()函数使用注意:
import pandas as pd
data = pd.read_csv("iris_training.csv")
print(data)
'''
120 4 setosa versicolor virginica
0 6.4 2.8 5.6 2.2 2
1 5.0 2.3 3.3 1.0 1
2 4.9 2.5 4.5 1.7 2
3 4.9 3.1 1.5 0.1 0
'''
data = pd.read_csv("iris_training.csv",names=CSV_COLUMN_NAMES)
'''
SepalLength SepalWidth PetalLength PetalWidth Species
0 120.0 4.0 setosa versicolor virginica
1 6.4 2.8 5.6 2.2 2
2 5.0 2.3 3.3 1.0 1
'''
data = pd.read_csv("iris_training.csv",names=CSV_COLUMN_NAMES,header=0)
'''
SepalLength SepalWidth PetalLength PetalWidth Species
0 6.4 2.8 5.6 2.2 2
1 5.0 2.3 3.3 1.0 1
2 4.9 2.5 4.5 1.7 2
3 4.9 3.1 1.5 0.1 0
'''
- header : int or list of ints, default ‘infer’
指定行数用来作为列名,数据开始行数。如果文件中没有列名,则默认为0,否则设置为None。如果明确设定header=0 就会替换掉原来存在列名。header参数可以是一个list
例如:[0,1,3],这个list表示将文件中的这些行作为列标题(意味着每一列有多个标题),介于中间的行将被忽略掉(例如本例中的2;本例中的数据1,2,4行将被作为多级标题出现,第3行数据将被丢弃,dataframe的数据从第5行开始。)。
注意:如果skip_blank_lines=True 那么header参数忽略注释行和空行,所以header=0表示第一行数据而不是文件的第一行。
- names : array-like, default None
用于结果的列名列表,如果数据文件中没有列标题行,就需要执行header=None。默认列表中不能出现重复,除非设定参数mangle_dupe_cols=True。