一组数据以及一组与之相关的数据标签(索引)组成
创建
In [27]: obj = pd.Series([4, 5, -6, 2])
In [28]: obj
Out[28]:
0 4
1 5
2 -6
3 2
dtype: int64
不进行索引设置则自动使用0、1…补全,设置索引使用index参数
In [29]: obj = pd.Series([2,3,1,-5], index=['a','b','c','d'])
In [30]: obj
Out[30]:
a 2
b 3
c 1
d -5
dtype: int64
如果数据已经被存放在字典中,则可以直接使用字典作为数据源:
In [31]: sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
In [32]: obj = pd.Series(sdata)
In [33]: obj
Out[33]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
如果使用索引设置,则会寻找字典中可对应的表项,不存在则为NaN:
In [34]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [35]: obj = pd.Series(sdata, index=states)
In [36]: obj
Out[36]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
读取
通过Series的values、index属性获取数组表示形式和索引对象:
In [37]: obj = pd.Series([1,2,4,-3])
In [38]: obj.values
Out[38]: array([ 1, 2, 4, -3], dtype=int64)
In [39]: obj.index
Out[39]: RangeIndex(start=0, stop=4, step=1)
如果Series带有自定义的索引,则index属性获取的结果为实际的可对各个数据进行标记的索引表对象:
In [40]: obj2.index
Out[40]: Index(['a', 'b', 'c', 'd'], dtype='object')
可以通过values或者索引方式读取1或多个数据:
In [42]: obj2.values[0]
Out[42]: 1
In [43]: obj2['a']
Out[43]: 1
In [45]: obj2.values[0:2]
Out[45]: array([1, 3], dtype=int64)
In [46]: obj2[['a','b','c']]
Out[46]:
a 1
b 3
c -5
dtype: int64
运算
In [47]: obj2
Out[47]:
a 1
b 3
c -5
d 3
dtype: int64
In [51]: obj3
Out[51]:
b 2
c 3
d -9
e 2
dtype: int64
过滤:
In [48]: obj2[obj2 > 0]
Out[48]:
a 1
b 3
d 3
dtype: int64
加法:Series在算数运算中会自动对齐不同索引的数据
In [52]: obj4 = obj2 + obj3
In [53]: obj4
Out[53]:
a NaN
b 5.0
c -2.0
d -6.0
e NaN
dtype: float64
乘法:
In [54]: obj5 = obj2 * 2
In [55]: obj5
Out[55]:
a 2
b 6
c -10
d 6
dtype: int64
判断数据是否为空:
In [56]: pd.isnull(obj4)
Out[56]:
a True
b False
c False
d False
e True
dtype: bool
In [57]: pd.notnull(obj4)
Out[57]:
a False
b True
c True
d True
e False
dtype: bool
其他
name属性:
Series对象本身及索引都有一个name属性:
In [58]: obj2.name = 'population'
In [60]: obj2.index.name = 'index'
In [61]: obj2
Out[61]:
index
a 1
b 3
c -5
d 3
Name: population, dtype: int64
引用自《利用Python进行数据分析》:
DataFrame是表格型数据结构,含有一组有序的列,每列可以是不同的值类型。DadaFrame既有行索引也有列索引,它可以被看作是由Series组成的字典(索引须相同)。
创建
传入等长列表或NumPy数组组成的字典:
In [1]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
...: 'year':[2000,2001,2002,2001,2002],
...: 'pop':[1.5,1.7,3.6,2.4,2.9]}
In [5]: frame = pd.DataFrame(data)
In [6]: frame
Out[6]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
以上各列随机排序,可以通过columns指定列顺序:
In [7]: frame2 = pd.DataFrame(data, columns=['year','state','pop'])
In [8]: frame2
Out[8]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
如果指定的列在数据源中无法获取,表项会被置为NaN:
In [9]: frame2 = pd.DataFrame(data, columns=['year','state','pop','num'])
In [10]: frame2
Out[10]:
year state pop num
0 2000 Ohio 1.5 NaN
1 2001 Ohio 1.7 NaN
2 2002 Ohio 3.6 NaN
3 2001 Nevada 2.4 NaN
4 2002 Nevada 2.9 NaN
同Series相同,也可以使用index指定行索引:
In [11]: frame3 = pd.DataFrame(data, columns=['year','state','pop','num'], index=['1','2','3','4','5'])
In [12]: frame3
Out[12]:
year state pop num
1 2000 Ohio 1.5 NaN
2 2001 Ohio 1.7 NaN
3 2002 Ohio 3.6 NaN
4 2001 Nevada 2.4 NaN
5 2002 Nevada 2.9 NaN
读取
前述可以将DataFrame理解为Series组成的字典,那么读取则为通过类似字典的方式通过标识项获取到一个Series
In [13]: frame2['state']
Out[13]:
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object
列读取如前所示,行读取可以使用ix利用位置或名称进行获取:
In [14]: frame3.ix[0]
C:\ProgramData\Anaconda3\Scripts\ipython-script.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
if __name__ == '__main__':
Out[14]:
year 2000
state Ohio
pop 1.5
num NaN
Name: 1, dtype: object
In [15]: frame3.ix['1']
Out[15]:
year 2000
state Ohio
pop 1.5
num NaN
Name: 1, dtype: object
可以看到在使用位置读取时,提示ix不被推荐使用,而推荐loc(行索引)、iloc(行号):
In [20]: frame3.loc['1']
Out[20]:
year 2000
state Ohio
pop 1.5
num NaN
Name: 1, dtype: object
In [21]: frame3.iloc[1]
Out[21]:
year 2001
state Ohio
pop 1.7
num NaN
Name: 2, dtype: object