python——pandas库之DataFrame数据结构基础

DataFrame简介

dataframe表示的是矩阵的数据表,每一列可以是不同的值类型,可以看作一个共享相同索引的Series字典。在dataframe中,数据被存储为一个以上的二维块

1.利用等长度列表或numpy数组的字典来形成dataframe

import pandas as pd
import numpy as np
data = {
     'state' : ['ohio', 'ohio', 'ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year' : [2000, 2001, 2002, 2001, 2002, 2003],
        'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame
state year pop
0 ohio 2000 1.5
1 ohio 2001 1.7
2 ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2

2.对于大型DataFrame,head()方法将会只选出头部的五行

frame.head()
state year pop
0 ohio 2000 1.5
1 ohio 2001 1.7
2 ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9

3.如果指定了列的顺序,DataFrame的列将会按照指定顺序排列

pd.DataFrame(data, columns = ['year', 'state', 'pop'])
year state pop
0 2000 ohio 1.5
1 2001 ohio 1.7
2 2002 ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
# 如果传入的列不再字典中,将会在结果中出现缺失值
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'], index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2
year state pop debt
one 2000 ohio 1.5 NaN
two 2001 ohio 1.7 NaN
three 2002 ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

4.DataFrame中的一列,可以按照字典型标记或属性那样检索为Series

frame2['state']
one        ohio
two        ohio
three      ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

5.通过位置或属性loc进行行的选取

frame2.loc['three']
year     2002
state    ohio
pop       3.6
debt      NaN
Name: three, dtype: object

6.修改列的值

frame2['debt'] = 16.5
frame2
year state pop debt
one 2000 ohio 1.5 16.5
two 2001 ohio 1.7 16.5
three 2002 ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
frame2['ddebt'] = np.arange(6.)
frame2
year state pop debt ddebt
one 2000 ohio 1.5 16.5 0.0
two 2001 ohio 1.7 16.5 1.0
three 2002 ohio 3.6 16.5 2.0
four 2001 Nevada 2.4 16.5 3.0
five 2002 Nevada 2.9 16.5 4.0
six 2003 Nevada 3.2 16.5 5.0

7.当你将列表或数组赋值给一个列时,值的长度必须和DataFrame的长度相匹配,如果你将Series赋值给一列,Series的索引将会按照DataFrame的索引重新排序,并在孔雀的地方填充空缺值

val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
frame2
year state pop debt ddebt
one 2000 ohio 1.5 NaN 0.0
two 2001 ohio 1.7 -1.2 1.0
three 2002 ohio 3.6 NaN 2.0
four 2001 Nevada 2.4 -1.5 3.0
five 2002 Nevada 2.9 -1.7 4.0
six 2003 Nevada 3.2 NaN 5.0

如果被赋值的列并不存在,则会生成一个新的列,del关键字可以删除列

frame2['eastern'] = frame2.state == 'ohio'
frame2
year state pop debt ddebt eastern
one 2000 ohio 1.5 NaN 0.0 True
two 2001 ohio 1.7 -1.2 1.0 True
three 2002 ohio 3.6 NaN 2.0 True
four 2001 Nevada 2.4 -1.5 3.0 False
five 2002 Nevada 2.9 -1.7 4.0 False
six 2003 Nevada 3.2 NaN 5.0 False
del frame2['eastern']
frame2
year state pop debt ddebt
one 2000 ohio 1.5 NaN 0.0
two 2001 ohio 1.7 -1.2 1.0
three 2002 ohio 3.6 NaN 2.0
four 2001 Nevada 2.4 -1.5 3.0
five 2002 Nevada 2.9 -1.7 4.0
six 2003 Nevada 3.2 NaN 5.0

8.字典的嵌套字典

如果嵌套字典被赋值给dataframe,pandas会将字典的键作为列,将内部字典的键作为行索引

pop = {
     'Nevada':{
     2001: 2.4, 2002: 2.9}, 'Ohio' : {
     2000:1.5, 2001:1.7, 2002:3.6}}
frame3 = pd.DataFrame(pop)
frame3
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
# 显示指明索引的顺序
pd.DataFrame(pop, index = [2001, 2002, 2003])
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN

包含Series的字典也可以用于构造DataFrame

pdata = {
     'Ohio' : frame3['Ohio'][:-1], 'Nevada' : frame3['Nevada'][:2]}
pd.DataFrame(pdata)
Ohio Nevada
2001 1.7 2.4
2002 3.6 2.9

DataFrame的索引和列拥有name属性

frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
state Nevada Ohio
year
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5

values属性得到dataframe中的值

frame3.values
array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])
frame2.values
array([[2000, 'ohio', 1.5, nan, 0.0],
       [2001, 'ohio', 1.7, -1.2, 1.0],
       [2002, 'ohio', 3.6, nan, 2.0],
       [2001, 'Nevada', 2.4, -1.5, 3.0],
       [2002, 'Nevada', 2.9, -1.7, 4.0],
       [2003, 'Nevada', 3.2, nan, 5.0]], dtype=object)

你可能感兴趣的:(python,python)