前言
Pandas是Python环境下最有名的数据统计包,是基于 Numpy 构建的含有更高级数据结构和工具的数据分析包。Pandas围绕着 Series 和 DataFrame 两个核心数据结构展开的。本文着重介绍这两种数据结构的生成和访问的基本方法。
Series
Series是一种类似于一维数组的对象,由一组数据(一维ndarray数组对象)和一组与之对应相关的数据标签(索引)组成。
注:numpy(Numerical Python)提供了python对多维数组对象的支持:ndarray,具有矢量运算能力,快速、节省空间。
""" One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).Operations between Series (+, -, /, , *) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.Parameters
---------- data : array-like, dict, or scalar valueContains data stored in Series index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex(len(data)) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict. dtype : numpy.dtype or None If None, dtype will be inferred copy : boolean, default False Copy input data """
(2)创建Series的基本方法如下,数据可以是阵列(list、ndarray)、字典和常量值。s = pd.Series(data, index=index)
s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],dtype='int8' )
a -1
b 0
c 0
d -1
e -1
dtype: int8
s = pd.Series(['a',-0.75414753,123,66666,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],)
a a
b -0.754148
c 123
d 66666
e -1.64899
dtype: object
注:Series支持的数据类型包括整数、浮点数、复数、布尔值、字符串等numpy.dtype,与创建ndarray数组相同的是,如未指定类型,它会尝试推断出一个合适的数据类型,例程中数据包含数字和字符串时,推断为object类型;如指定int8类型时数据以int8显示。
s = pd.Series(np.random.randn(5))
0 0.485468
1 -0.912130
2 0.771970
3 -1.058117
4 0.926649
dtype: float64
s.index
RangeIndex(start=0, stop=5, step=1)
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
a 0.485468
b -0.912130
c 0.771970
d -1.058117
e 0.926649
dtype: float64
注:当数据未指定索引时,Series会自动创建整数型索引
s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})
a 0.0
b 1.0
c 2.0
dtype: float64
s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
注:通过Python字典创建Series,可视为一个定长的有序字典。如果只传入一个字典,那么Series中的索引即是原字典的键。如果传入索引,那么会找到索引相匹配的值并放在相应的位置上,未找到对应值时结果为NaN。
s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
注:数值重复匹配以适应索引长度
(3)访问Series中的元素和索引
s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
s.values
[ 1. 2. nan 0.]
s.index
Index([u'b', u'c', u'd', u'a'], dtype='object')
注:Series的values和index属性获取其数组表示形式和索引对象
s['a']
0.0
s[['a','b']]
a 0.0
b 1.0
dtype: float64
s[['a','b','c']]
a 0.0
b 1.0
c 2.0
dtype: float64
s[:2]
b 1.0
c 2.0
dtype: float64
注:可以通过索引的方式选取Series中的单个或一组值
DataFrame
DataFrame是一个表格型(二维)的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以看做由Series组成的字典(共用同一个索引)。
(1)Pandas说明文档中对DataFrame特点介绍如下:
""" Two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data
structureParameters
---------- data : numpy ndarray (structured or homogeneous), dict, or DataFrameDict can contain Series, arrays, constants, or list-like objects index : Index or array-like Index to use for resulting frame. Will default to np.arange(n) if no indexing information part of input data and no index provided columns : Index or array-like Column labels to use for resulting frame. Will default to np.arange(n) if no column labels are provided dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False Copy data from inputs. Only affects DataFrame / 2d ndarray input
(2)创建DataFrame的基本方法如下,数据可以是由列表、一维ndarray或Series组成的字典(序列长度必须相同)、二维ndarray、字典组成的字典等df = pd.DataFrame(data, index=index)
df = pd.DataFrame({'one': [1., 2., 3., 5], 'two': [1., 2., 3., 4.]})
one two
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
注:以列表组成的字典形式创建,每个序列成为DataFrame的一列。不支持单一列表创建df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]}),因为list为unhashable类型
df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=['a', 'b'],columns=['one','two','three','four'])
one two three four
a 1.0 2.0 3.0 5.0
b 1.0 2.0 3.0 4.0
注:以嵌套列表组成形式创建2行4列的表格,通过index和 columns参数指定了索引和列名
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
[(0, 0., '') (0, 0., '')]
注:zeros(shape, dtype=float, order='C')返回一个给定形状和类型的用0填充的数组
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
df = pd.DataFrame(data)
A B C
0 1 2.0 Hello
1 2 3.0 World
df = pd.DataFrame(data, index=['first', 'second'])
A B C
first 1 2.0 Hello
second 2 3.0 World
df = pd.DataFrame(data, columns=['C', 'A', 'B'])
C A B
0 Hello 1 2.0
1 World 2 3.0
注:同Series相同,未指定索引时DataFrame会自动加上索引,指定列则按指定顺序进行排列
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
注:以Series组成的字典形式创建时,每个Series成为一列,如果没有显示指定索引,则各Series的索引被合并成结果的行索引。NaN代替缺失的列数据
df = pd.DataFrame(data,index=['d', 'b', 'a'])
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0
df = pd.DataFrame(data,index=['d', 'b', 'a'], columns=['two', 'three'])
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data2)
a b c
0 1 2 NaN
1 5 10 20.0
注:以字典的列表形式创建时,各项成为DataFrame的一行,字典键索引的并集成为DataFrame的列标
df = pd.DataFrame(data2, index=['first', 'second'])
a b c
first 1 2 NaN
second 5 10 20.0
df = pd.DataFrame(data2, columns=['a', 'b'])
a b
0 1 2
1 5 10
df = pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
a b
a b c a b
A B 4.0 1.0 5.0 8.0 10.0
C 3.0 2.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
注:以字典的字典形式创建时,列索引由外层的键合并成结果的列索引,各内层字典成为一列,内层的键会被合并成结果的行索引。
(3)访问DataFrame中的元素和索引
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
df['one']或df.one
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
注:通过类似字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引,且其name属性也被相应设置。
df[0:1]
one two
a 1.0 1.0
注:返回前两列数据
df.loc['a']
one 1.0
two 1.0
Name: a, dtype: float64
df.loc[:,['one','two'] ]
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
df.loc[['a',],['one','two']]
one two
a 1.0 1.0
df.loc['a','one']
1.0
注:loc是通过标签来选择数据
df.iloc[0:2,0:1]
one
a 1.0
b 2.0
df.iloc[0:2]
one two
a 1.0 1.0
b 2.0 2.0
df.iloc[[0,2],[0,1]]#自由选取行位置,和列位置对应的数据
one two
a 1.0 1.0
c 3.0 3.0
注:iloc通过位置来选择数据
df.ix['a']
one 1.0
two 1.0
Name: a, dtype: float64
df.ix['a',['one','two']]
one 1.0
two 1.0
Name: a, dtype: float64
df.ix['a',[0,1]]
one 1.0
two 1.0
Name: a, dtype: float64
df.ix[['a','b'],[0,1]]
one two
a 1.0 1.0
b 2.0 2.0
df.ix[1,[0,1]]
one 2.0
two 2.0
Name: b, dtype: float64
df.ix[[0,1],[0,1]]
one two
a 1.0 1.0
b 2.0 2.0
注:通过索引字段ix和名称结合的方式获取行数据
df.ix[df.one>1,:1]
one
b 2.0
c 3.0
注:使用条件来选择,选取one列中大于1的行和第一列
df['one']=16.8
one two
a 16.8 1.0
b 16.8 2.0
c 16.8 3.0
d 16.8 4.0
val = pd.Series([2,2,2],index=['b', 'c', 'd'])
df['one']=val
one two
a NaN 1.0
b 2.0 2.0
c 2.0 3.0
d 2.0 4.0
注:列可以通过赋值方式修改,将列表或数组赋值给某个列时长度必须和DataFrame的长度相匹配。Series赋值时会精确匹配DataFrame的索引,空位以NaN填充。
df['four']=[3,3,3,3]
one two four
a NaN 1.0 3
b 2.0 2.0 3
c 2.0 3.0 3
d 2.0 4.0 3
注:对不存在的列赋值会创建新列
df.index.get_loc('a')
0
df.index.get_loc('b')
1
df.columns.get_loc('one')
0
注:通过行/列索引获取整数形式位置
更多python量化交易内容互动请加微信公众号:PythonQT-YuanXiao
欢迎订阅量化交易课程: [链接地址]