dataframe表示的是矩阵的数据表,每一列可以是不同的值类型,可以看作一个共享相同索引的Series字典。在dataframe中,数据被存储为一个以上的二维块
1.利用等长度列表或numpy数组的字典来形成dataframe
import pandas as pd
import numpy as np
data = {
'state' : ['ohio', 'ohio', 'ohio', 'Nevada', 'Nevada', 'Nevada'],
'year' : [2000, 2001, 2002, 2001, 2002, 2003],
'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame
state | year | pop | |
---|---|---|---|
0 | ohio | 2000 | 1.5 |
1 | ohio | 2001 | 1.7 |
2 | ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
2.对于大型DataFrame,head()方法将会只选出头部的五行
frame.head()
state | year | pop | |
---|---|---|---|
0 | ohio | 2000 | 1.5 |
1 | ohio | 2001 | 1.7 |
2 | ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
3.如果指定了列的顺序,DataFrame的列将会按照指定顺序排列
pd.DataFrame(data, columns = ['year', 'state', 'pop'])
year | state | pop | |
---|---|---|---|
0 | 2000 | ohio | 1.5 |
1 | 2001 | ohio | 1.7 |
2 | 2002 | ohio | 3.6 |
3 | 2001 | Nevada | 2.4 |
4 | 2002 | Nevada | 2.9 |
5 | 2003 | Nevada | 3.2 |
# 如果传入的列不再字典中,将会在结果中出现缺失值
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'], index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | ohio | 1.5 | NaN |
two | 2001 | ohio | 1.7 | NaN |
three | 2002 | ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
six | 2003 | Nevada | 3.2 | NaN |
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
4.DataFrame中的一列,可以按照字典型标记或属性那样检索为Series
frame2['state']
one ohio
two ohio
three ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
six 2003
Name: year, dtype: int64
5.通过位置或属性loc进行行的选取
frame2.loc['three']
year 2002
state ohio
pop 3.6
debt NaN
Name: three, dtype: object
6.修改列的值
frame2['debt'] = 16.5
frame2
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | ohio | 1.5 | 16.5 |
two | 2001 | ohio | 1.7 | 16.5 |
three | 2002 | ohio | 3.6 | 16.5 |
four | 2001 | Nevada | 2.4 | 16.5 |
five | 2002 | Nevada | 2.9 | 16.5 |
six | 2003 | Nevada | 3.2 | 16.5 |
frame2['ddebt'] = np.arange(6.)
frame2
year | state | pop | debt | ddebt | |
---|---|---|---|---|---|
one | 2000 | ohio | 1.5 | 16.5 | 0.0 |
two | 2001 | ohio | 1.7 | 16.5 | 1.0 |
three | 2002 | ohio | 3.6 | 16.5 | 2.0 |
four | 2001 | Nevada | 2.4 | 16.5 | 3.0 |
five | 2002 | Nevada | 2.9 | 16.5 | 4.0 |
six | 2003 | Nevada | 3.2 | 16.5 | 5.0 |
7.当你将列表或数组赋值给一个列时,值的长度必须和DataFrame的长度相匹配,如果你将Series赋值给一列,Series的索引将会按照DataFrame的索引重新排序,并在孔雀的地方填充空缺值
val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
frame2
year | state | pop | debt | ddebt | |
---|---|---|---|---|---|
one | 2000 | ohio | 1.5 | NaN | 0.0 |
two | 2001 | ohio | 1.7 | -1.2 | 1.0 |
three | 2002 | ohio | 3.6 | NaN | 2.0 |
four | 2001 | Nevada | 2.4 | -1.5 | 3.0 |
five | 2002 | Nevada | 2.9 | -1.7 | 4.0 |
six | 2003 | Nevada | 3.2 | NaN | 5.0 |
如果被赋值的列并不存在,则会生成一个新的列,del关键字可以删除列
frame2['eastern'] = frame2.state == 'ohio'
frame2
year | state | pop | debt | ddebt | eastern | |
---|---|---|---|---|---|---|
one | 2000 | ohio | 1.5 | NaN | 0.0 | True |
two | 2001 | ohio | 1.7 | -1.2 | 1.0 | True |
three | 2002 | ohio | 3.6 | NaN | 2.0 | True |
four | 2001 | Nevada | 2.4 | -1.5 | 3.0 | False |
five | 2002 | Nevada | 2.9 | -1.7 | 4.0 | False |
six | 2003 | Nevada | 3.2 | NaN | 5.0 | False |
del frame2['eastern']
frame2
year | state | pop | debt | ddebt | |
---|---|---|---|---|---|
one | 2000 | ohio | 1.5 | NaN | 0.0 |
two | 2001 | ohio | 1.7 | -1.2 | 1.0 |
three | 2002 | ohio | 3.6 | NaN | 2.0 |
four | 2001 | Nevada | 2.4 | -1.5 | 3.0 |
five | 2002 | Nevada | 2.9 | -1.7 | 4.0 |
six | 2003 | Nevada | 3.2 | NaN | 5.0 |
8.字典的嵌套字典
如果嵌套字典被赋值给dataframe,pandas会将字典的键作为列,将内部字典的键作为行索引
pop = {
'Nevada':{
2001: 2.4, 2002: 2.9}, 'Ohio' : {
2000:1.5, 2001:1.7, 2002:3.6}}
frame3 = pd.DataFrame(pop)
frame3
Nevada | Ohio | |
---|---|---|
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
2000 | NaN | 1.5 |
# 显示指明索引的顺序
pd.DataFrame(pop, index = [2001, 2002, 2003])
Nevada | Ohio | |
---|---|---|
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
2003 | NaN | NaN |
包含Series的字典也可以用于构造DataFrame
pdata = {
'Ohio' : frame3['Ohio'][:-1], 'Nevada' : frame3['Nevada'][:2]}
pd.DataFrame(pdata)
Ohio | Nevada | |
---|---|---|
2001 | 1.7 | 2.4 |
2002 | 3.6 | 2.9 |
DataFrame的索引和列拥有name属性
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
state | Nevada | Ohio |
---|---|---|
year | ||
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
2000 | NaN | 1.5 |
values属性得到dataframe中的值
frame3.values
array([[2.4, 1.7],
[2.9, 3.6],
[nan, 1.5]])
frame2.values
array([[2000, 'ohio', 1.5, nan, 0.0],
[2001, 'ohio', 1.7, -1.2, 1.0],
[2002, 'ohio', 3.6, nan, 2.0],
[2001, 'Nevada', 2.4, -1.5, 3.0],
[2002, 'Nevada', 2.9, -1.7, 4.0],
[2003, 'Nevada', 3.2, nan, 5.0]], dtype=object)