数据分析(一)--pandas数据结构

Series

一组数据以及一组与之相关的数据标签(索引)组成


创建

In [27]: obj = pd.Series([4, 5, -6, 2])

In [28]: obj
Out[28]:
0    4
1    5
2   -6
3    2
dtype: int64

不进行索引设置则自动使用0、1…补全,设置索引使用index参数

In [29]: obj = pd.Series([2,3,1,-5], index=['a','b','c','d'])

In [30]: obj
Out[30]:
a    2
b    3
c    1
d   -5
dtype: int64

如果数据已经被存放在字典中,则可以直接使用字典作为数据源:

In [31]: sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}

In [32]: obj = pd.Series(sdata)

In [33]: obj
Out[33]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

如果使用索引设置,则会寻找字典中可对应的表项,不存在则为NaN:

In [34]: states = ['California', 'Ohio', 'Oregon', 'Texas']

In [35]: obj = pd.Series(sdata, index=states)

In [36]: obj
Out[36]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

读取
通过Series的values、index属性获取数组表示形式和索引对象:

In [37]: obj = pd.Series([1,2,4,-3])

In [38]: obj.values
Out[38]: array([ 1,  2,  4, -3], dtype=int64)

In [39]: obj.index
Out[39]: RangeIndex(start=0, stop=4, step=1)

如果Series带有自定义的索引,则index属性获取的结果为实际的可对各个数据进行标记的索引表对象:

In [40]: obj2.index
Out[40]: Index(['a', 'b', 'c', 'd'], dtype='object')

可以通过values或者索引方式读取1或多个数据:

In [42]: obj2.values[0]
Out[42]: 1

In [43]: obj2['a']
Out[43]: 1

In [45]: obj2.values[0:2]
Out[45]: array([1, 3], dtype=int64)

In [46]: obj2[['a','b','c']]
Out[46]:
a    1
b    3
c   -5
dtype: int64

运算

In [47]: obj2
Out[47]:
a    1
b    3
c   -5
d    3
dtype: int64

In [51]: obj3
Out[51]:
b    2
c    3
d   -9
e    2
dtype: int64

过滤:
In [48]: obj2[obj2 > 0]
Out[48]:
a    1
b    3
d    3
dtype: int64

加法:Series在算数运算中会自动对齐不同索引的数据
In [52]: obj4 = obj2 + obj3

In [53]: obj4
Out[53]:
a    NaN
b    5.0
c   -2.0
d   -6.0
e    NaN
dtype: float64

乘法:
In [54]: obj5 = obj2 * 2

In [55]: obj5
Out[55]:
a     2
b     6
c   -10
d     6
dtype: int64

判断数据是否为空:
In [56]: pd.isnull(obj4)
Out[56]:
a     True
b    False
c    False
d    False
e     True
dtype: bool

In [57]: pd.notnull(obj4)
Out[57]:
a    False
b     True
c     True
d     True
e    False
dtype: bool

其他
name属性:
Series对象本身及索引都有一个name属性:

In [58]: obj2.name = 'population'
In [60]: obj2.index.name = 'index'
In [61]: obj2
Out[61]:
index
a    1
b    3
c   -5
d    3
Name: population, dtype: int64

DataFrame

引用自《利用Python进行数据分析》:
DataFrame是表格型数据结构,含有一组有序的列,每列可以是不同的值类型。DadaFrame既有行索引也有列索引,它可以被看作是由Series组成的字典(索引须相同)。


创建
传入等长列表或NumPy数组组成的字典:

In [1]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
   ...: 'year':[2000,2001,2002,2001,2002],
   ...: 'pop':[1.5,1.7,3.6,2.4,2.9]}

In [5]: frame  = pd.DataFrame(data)

In [6]: frame
Out[6]:
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

以上各列随机排序,可以通过columns指定列顺序:

In [7]: frame2  = pd.DataFrame(data, columns=['year','state','pop'])

In [8]: frame2
Out[8]:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

如果指定的列在数据源中无法获取,表项会被置为NaN:

In [9]: frame2  = pd.DataFrame(data, columns=['year','state','pop','num'])

In [10]: frame2
Out[10]:
   year   state  pop  num
0  2000    Ohio  1.5  NaN
1  2001    Ohio  1.7  NaN
2  2002    Ohio  3.6  NaN
3  2001  Nevada  2.4  NaN
4  2002  Nevada  2.9  NaN

同Series相同,也可以使用index指定行索引:

In [11]: frame3  = pd.DataFrame(data, columns=['year','state','pop','num'], index=['1','2','3','4','5'])

In [12]: frame3
Out[12]:
   year   state  pop  num
1  2000    Ohio  1.5  NaN
2  2001    Ohio  1.7  NaN
3  2002    Ohio  3.6  NaN
4  2001  Nevada  2.4  NaN
5  2002  Nevada  2.9  NaN

读取
前述可以将DataFrame理解为Series组成的字典,那么读取则为通过类似字典的方式通过标识项获取到一个Series

In [13]: frame2['state']
Out[13]:
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

列读取如前所示,行读取可以使用ix利用位置或名称进行获取:

In [14]: frame3.ix[0]
C:\ProgramData\Anaconda3\Scripts\ipython-script.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  if __name__ == '__main__':
Out[14]:
year     2000
state    Ohio
pop       1.5
num       NaN
Name: 1, dtype: object

In [15]: frame3.ix['1']
Out[15]:
year     2000
state    Ohio
pop       1.5
num       NaN
Name: 1, dtype: object

可以看到在使用位置读取时,提示ix不被推荐使用,而推荐loc(行索引)、iloc(行号):

In [20]: frame3.loc['1']
Out[20]:
year     2000
state    Ohio
pop       1.5
num       NaN
Name: 1, dtype: object

In [21]: frame3.iloc[1]
Out[21]:
year     2001
state    Ohio
pop       1.7
num       NaN
Name: 2, dtype: object

你可能感兴趣的:(数据分析基础)