数据结构Series
构造和初始化Series
import pandas as pd
import numpy as np
Series是一个一维的数据结构
s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'])
s
0 7
1 Beijing
2 2.17
3 -1232
4 Happy birthday!
dtype: object
pandas会默认用0到n来作为Series的index,但是我们也可以自己指定index。
s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'],
index = ['A', 'B', 'C', 'D', 'E'])
type(s)
pandas.core.series.Series
还可以用dictionary来构造一个Series,因为Series本来就是key value pairs。
cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000, 'Hangzhou':20000, \
'Guangzhou': 25000, 'Suzhou': None}
apts = pd.Series(cities)
print(apts)
print(type(apts))
Beijing 55000.0
Guangzhou 25000.0
Hangzhou 20000.0
Shanghai 60000.0
Shenzhen 50000.0
Suzhou NaN
dtype: float64
选择数据
前面定义的index就是用来选择数据的
apts["Hangzhou"]
20000.0
apts[["Hangzhou", "Beijing", "Shenzhen"]]
Hangzhou 20000.0
Beijing 55000.0
Shenzhen 50000.0
dtype: float64
boolean indexing在pandas当中也可以使用。
下面我再详细展示一下这个boolean indexing是如何工作的
apts[apts < 50000]
Guangzhou 25000.0
Hangzhou 20000.0
dtype: float64
less_than_50000 = apts < 50000
print(less_than_50000)
Beijing False
Guangzhou True
Hangzhou True
Shanghai False
Shenzhen False
Suzhou False
dtype: bool
apts[less_than_50000]
Guangzhou 25000.0
Hangzhou 20000.0
dtype: float64
Series元素赋值
Series的元素可以被赋值
print("Old value:", apts['Shenzhen'])
Old value: 50000.0
apts['Shenzhen'] = 55000
apts['Shenzhen']
55000.0
前面的boolean indexing在赋值的时候也可以用
print(apts[apts < 50000])
Guangzhou 25000.0
Hangzhou 20000.0
dtype: float64
apts[apts <= 50000] = 40000
apts
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 40000.0
Shanghai 60000.0
Shenzhen 55000.0
Suzhou NaN
dtype: float64
数学运算
下面我们来讲一些基本的数学运算。
apts / 2
Beijing 27500.0
Guangzhou 20000.0
Hangzhou 20000.0
Shanghai 30000.0
Shenzhen 27500.0
Suzhou NaN
dtype: float64
apts * 2
Beijing 110000.0
Guangzhou 80000.0
Hangzhou 80000.0
Shanghai 120000.0
Shenzhen 110000.0
Suzhou NaN
dtype: float64
np.square(apts)
Beijing 3.025000e+09
Guangzhou 1.600000e+09
Hangzhou 1.600000e+09
Shanghai 3.600000e+09
Shenzhen 3.025000e+09
Suzhou NaN
dtype: float64
square也可以写成 **
apts ** 2
Beijing 3.025000e+09
Guangzhou 1.600000e+09
Hangzhou 1.600000e+09
Shanghai 3.600000e+09
Shenzhen 3.025000e+09
Suzhou NaN
dtype: float64
我们再定义一个新的Series做加法
cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000, \
'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
cars
Beijing 300000
Chongqing 150000
Guangzhou 200000
Shanghai 400000
Shenzhen 300000
Tianjin 200000
dtype: int64
print(cars + apts * 100)
Beijing 5800000.0
Chongqing NaN
Guangzhou 4200000.0
Hangzhou NaN
Shanghai 6400000.0
Shenzhen 5800000.0
Suzhou NaN
Tianjin NaN
dtype: float64
数据缺失
'Hangzhou' in apts
True
'Hangzhou' in cars
False
apts.notnull()
Beijing True
Guangzhou True
Hangzhou True
Shanghai True
Shenzhen True
Suzhou False
dtype: bool
apts.isnull()
Beijing False
Guangzhou False
Hangzhou False
Shanghai False
Shenzhen False
Suzhou True
dtype: bool
apts[apts.isnull()]
Suzhou NaN
dtype: float64
apts[apts.notnull()]
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 40000.0
Shanghai 60000.0
Shenzhen 55000.0
dtype: float64
apts[apts.isnull() == False]
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 40000.0
Shanghai 60000.0
Shenzhen 55000.0
dtype: float64
数据结构Dataframe
一个Dataframe就是一张表格,Series表示的是一维数组,Dataframe则是一个二维数组,可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。
创建一个DataFrame
dataframe可以由一个dictionary构造得到。
data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
'year': [2016,2017,2016,2017,2016, 2016],
'population': [2100, 2300, 1000, 700, 500, 500]}
pd.DataFrame(data)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
city |
population |
year |
0 |
Beijing |
2100 |
2016 |
1 |
Shanghai |
2300 |
2017 |
2 |
Guangzhou |
1000 |
2016 |
3 |
Shenzhen |
700 |
2017 |
4 |
Hangzhou |
500 |
2016 |
5 |
Chongqing |
500 |
2016 |
columns的名字和顺序可以指定
pd.DataFrame(data, columns = ['year', 'city', 'population'])
|
year |
city |
population |
0 |
2016 |
Beijing |
2100 |
1 |
2017 |
Shanghai |
2300 |
2 |
2016 |
Guangzhou |
1000 |
3 |
2017 |
Shenzhen |
700 |
4 |
2016 |
Hangzhou |
500 |
5 |
2016 |
Chongqing |
500 |
pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'])
|
year |
city |
population |
debt |
0 |
2016 |
Beijing |
2100 |
NaN |
1 |
2017 |
Shanghai |
2300 |
NaN |
2 |
2016 |
Guangzhou |
1000 |
NaN |
3 |
2017 |
Shenzhen |
700 |
NaN |
4 |
2016 |
Hangzhou |
500 |
NaN |
5 |
2016 |
Chongqing |
500 |
NaN |
frame2 = pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'],
index=['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN
从DataFrame里选择数据
frame2['city']
one Beijing
two Shanghai
three Guangzhou
four Shenzhen
five Hangzhou
six Chongqing
Name: city, dtype: object
type(frame2['city'])
pandas.core.series.Series
print(frame2.city)
one Beijing
two Shanghai
three Guangzhou
four Shenzhen
five Hangzhou
six Chongqing
Name: city, dtype: object
frame2.ix['three']
year 2016
city Guangzhou
population 1000
debt NaN
Name: three, dtype: object
type(frame2.ix['three'])
pandas.core.series.Series
print(frame2.ix[2])
year 2016
city Guangzhou
population 1000
debt NaN
Name: three, dtype: object
下面这种方法默认用来选列而不是选行
DataFrame元素赋值
frame2["population"]["one"] = 2200
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""Entry point for launching an IPython kernel.
frame2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
year |
city |
population |
debt |
one |
2016 |
Beijing |
2200 |
NaN |
two |
2017 |
Shanghai |
2300 |
NaN |
three |
2016 |
Guangzhou |
1000 |
NaN |
four |
2017 |
Shenzhen |
700 |
NaN |
five |
2016 |
Hangzhou |
500 |
NaN |
six |
2016 |
Chongqing |
500 |
NaN |
可以给一整列赋值
frame2['debt'] = 100000000
frame2
|
year |
city |
population |
debt |
one |
2016 |
Beijing |
2200 |
100000000 |
two |
2017 |
Shanghai |
2300 |
100000000 |
three |
2016 |
Guangzhou |
1000 |
100000000 |
four |
2017 |
Shenzhen |
700 |
100000000 |
five |
2016 |
Hangzhou |
500 |
100000000 |
six |
2016 |
Chongqing |
500 |
100000000 |
frame2.ix['six'] = 0
frame2
|
year |
city |
population |
debt |
one |
2016 |
Beijing |
2200 |
100000000 |
two |
2017 |
Shanghai |
2300 |
100000000 |
three |
2016 |
Guangzhou |
1000 |
100000000 |
four |
2017 |
Shenzhen |
700 |
100000000 |
five |
2016 |
Hangzhou |
500 |
100000000 |
six |
0 |
0 |
0 |
0 |
frame2 = pd.DataFrame(data, \
columns = ['year', 'city', 'population', 'debt'],
index = ['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN
frame2.debt = np.arange(6)
frame2
|
year |
city |
population |
debt |
one |
2016 |
Beijing |
2100 |
0 |
two |
2017 |
Shanghai |
2300 |
1 |
three |
2016 |
Guangzhou |
1000 |
2 |
four |
2017 |
Shenzhen |
700 |
3 |
five |
2016 |
Hangzhou |
500 |
4 |
six |
2016 |
Chongqing |
500 |
5 |
还可以用Series来指定需要修改的index以及相对应的value,没有指定的默认用NaN.
val = pd.Series([200, 300, 500], index=['two', 'three', 'five'])
val
frame2['debt'] = val
print(frame2)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 200.0
three 2016 Guangzhou 1000 300.0
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 500.0
six 2016 Chongqing 500 NaN
frame2['western'] = (frame2.city == 'Chongqing')
print(frame2)
year city population debt western
one 2016 Beijing 2100 NaN False
two 2017 Shanghai 2300 200.0 False
three 2016 Guangzhou 1000 300.0 False
four 2017 Shenzhen 700 NaN False
five 2016 Hangzhou 500 500.0 False
six 2016 Chongqing 500 NaN True
frame2.columns
Index(['year', 'city', 'population', 'debt', 'western'], dtype='object')
frame2.index
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')
一个DataFrame就和一个numpy 2d array一样,可以被转置
frame2.T
|
one |
two |
three |
four |
five |
six |
year |
2016 |
2017 |
2016 |
2017 |
2016 |
2016 |
city |
Beijing |
Shanghai |
Guangzhou |
Shenzhen |
Hangzhou |
Chongqing |
population |
2100 |
2300 |
1000 |
700 |
500 |
500 |
debt |
NaN |
200 |
300 |
NaN |
500 |
NaN |
western |
False |
False |
False |
False |
False |
True |
pop = {'Beijing': {2016: 2100, 2017:2200},
'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print(frame3)
Beijing Shanghai
2015 NaN 2400
2016 2100.0 2500
2017 2200.0 2600
print(frame3.T)
2015 2016 2017
Beijing NaN 2100.0 2200.0
Shanghai 2400.0 2500.0 2600.0
指定index的顺序,以及使用切片初始化数据
frame3['Beijing'][1:3]
2016 2100.0
2017 2200.0
Name: Beijing, dtype: float64
frame3['Shanghai'][:-1]
2015 2400
2016 2500
Name: Shanghai, dtype: int64
pdata = {'Beijing': frame3['Beijing'][:-1], 'Shanghai':frame3['Shanghai'][:-1]}
print(pd.DataFrame(pdata))
Beijing Shanghai
2015 NaN 2400
2016 2100.0 2500
我们还可以指定index的名字和列的名字
frame3.index.name = 'year'
frame3.columns.name = 'city'
frame3
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
city |
Beijing |
Shanghai |
year |
|
|
2015 |
NaN |
2400 |
2016 |
2100.0 |
2500 |
2017 |
2200.0 |
2600 |
print(frame3.values)
type(frame3.values)
[[ nan 2400.]
[2100. 2500.]
[2200. 2600.]]
numpy.ndarray
Index
index object
obj = pd.Series(range(3), index=['a', 'b', 'c'])
print(obj)
index = obj.index
print(index)
print(index[1:])
a 0
b 1
c 2
dtype: int64
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')
index的值是不能被更改的
index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
----> 1 index[1] = 'd'
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
1722
1723 def __setitem__(self, key, value):
-> 1724 raise TypeError("Index does not support mutable operations")
1725
1726 def __getitem__(self, key):
TypeError: Index does not support mutable operations
index = pd.Index(np.arange(3))
obj2 = pd.Series([2,5,7], index=index)
print(obj2)
print(obj2.index is index)
0 2
1 5
2 7
dtype: int64
True
pop = {'Beijing': {2016: 2100, 2017:2200},
'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print('Shanghai' in frame3.columns)
print(2015 in frame3.columns)
True
False
针对index进行索引和切片
obj = pd.Series(np.arange(4), index=['a','b','c','d'])
print(obj)
a 0
b 1
c 2
d 3
dtype: int32
obj['b']
obj[['b', 'a']]
b 1
a 0
dtype: int32
默认的数字index依旧可以使用
obj[[0, 2]]
a 0
c 2
dtype: int32
下面介绍如何对Series进行切片
obj[1:3]
b 1
c 2
dtype: int32
obj['b':'d'] = 5
print(obj)
a 0
b 5
c 5
d 5
dtype: int32
对DataFrame进行Indexing与Series基本相同
frame = pd.DataFrame(np.arange(9).reshape(3,3),
index = ['a', 'c', 'd'],
columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame['Hangzhou']
a 0
c 3
d 6
Name: Hangzhou, dtype: int32
frame[:2]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Hangzhou |
Shenzhen |
Nanjing |
a |
0 |
1 |
2 |
c |
3 |
4 |
5 |
frame.ix['a']
Hangzhou 0
Shenzhen 1
Nanjing 2
Name: a, dtype: int32
frame.ix['a':'d']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Hangzhou |
Shenzhen |
Nanjing |
a |
0 |
1 |
2 |
c |
3 |
4 |
5 |
d |
6 |
7 |
8 |
frame['a':'d']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Hangzhou |
Shenzhen |
Nanjing |
a |
0 |
1 |
2 |
c |
3 |
4 |
5 |
d |
6 |
7 |
8 |
frame['Hangzhou':'Najing']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Hangzhou |
Shenzhen |
Nanjing |
frame.ix[['a','d'], ['Shenzhen', 'Nanjing']]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Shenzhen |
Nanjing |
a |
1 |
2 |
d |
7 |
8 |
frame.ix[:, 'Shenzhen':'Nanjing']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Shenzhen |
Nanjing |
a |
1 |
2 |
c |
4 |
5 |
d |
7 |
8 |
frame.ix[:'c', 'Hangzhou']
a 0
c 3
Name: Hangzhou, dtype: int32
DataFrame也可以用condition selection
frame[frame.Hangzhou > 1]
|
Hangzhou |
Shenzhen |
Nanjing |
c |
3 |
4 |
5 |
d |
6 |
7 |
8 |
frame[frame < 5] = 0
print(frame)
Hangzhou Shenzhen Nanjing
a 0 0 0
c 0 0 5
d 6 7 8
reindex
把一个Series或者DataFrame按照新的index顺序进行重排
obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
print(obj)
d 4.5
b 7.2
a -5.3
c 3.2
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'])
a -5.3
b 7.2
c 3.2
d 4.5
e NaN
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)
a -5.3
b 7.2
c 3.2
d 4.5
e 0.0
dtype: float64
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0,2,4])
print(obj3)
0 blue
2 purple
4 yellow
dtype: object
obj3.reindex(range(6), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
obj3.reindex(range(6), method='bfill')
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
既然我们可以对Series进行reindex,相应地,我们也可以用同样的方法对DataFrame进行reindex。
frame = pd.DataFrame(np.arange(9).reshape(3,3),
index = ['a', 'c', 'd'],
columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame.reindex(['a' , 'b', 'c', 'd'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Hangzhou |
Shenzhen |
Nanjing |
a |
0.0 |
1.0 |
2.0 |
b |
NaN |
NaN |
NaN |
c |
3.0 |
4.0 |
5.0 |
d |
6.0 |
7.0 |
8.0 |
在reindex的同时,我们还可以重新指定columns
frame.reindex(columns = ['Shenzhen', 'Hangzhou', 'Chongqing'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Shenzhen |
Hangzhou |
Chongqing |
a |
1 |
0 |
NaN |
c |
4 |
3 |
NaN |
d |
7 |
6 |
NaN |
frame.reindex(index =['a', 'b', 'c', 'd'],
columns = ['Shenzhen', 'Hangzhou', 'Chongqing'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Shenzhen |
Hangzhou |
Chongqing |
a |
1.0 |
0.0 |
NaN |
b |
NaN |
NaN |
NaN |
c |
4.0 |
3.0 |
NaN |
d |
7.0 |
6.0 |
NaN |
下面介绍如何用drop来删除Series和DataFrame中的index
print(obj3)
print('*************')
obj4 = obj3.drop(2)
print(obj4)
0 blue
2 purple
4 yellow
dtype: object
*************
0 blue
4 yellow
dtype: object
obj3.drop([2, 4])
0 blue
dtype: object
删除DataFrame中的内容
frame
|
Hangzhou |
Shenzhen |
Nanjing |
a |
0 |
1 |
2 |
c |
3 |
4 |
5 |
d |
6 |
7 |
8 |
frame.drop(['a', 'c'])
|
Hangzhou |
Shenzhen |
Nanjing |
d |
6 |
7 |
8 |
drop不仅仅可以删除行,还可以删除列
frame.drop('Shenzhen', axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
Hangzhou |
Nanjing |
a |
0 |
2 |
c |
3 |
5 |
d |
6 |
8 |
frame.drop(['Shenzhen', 'Hangzhou'], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
hierarchical index
Series的hierarchical indexing
data = pd.Series(np.random.randn(10),
index=[['a','a','a','b','b','c','c','c','d','d'], \
[1,2,3,1,2,1,2,3,1,2]])
data
a 1 -0.535940
2 0.190444
3 0.137448
b 1 2.442986
2 0.211423
c 1 1.439100
2 -1.690219
3 1.338686
d 1 1.274225
2 -0.238439
dtype: float64
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2, 2, 3, 3], [0, 1, 2, 0, 1, 0, 1, 2, 0, 1]])
print(data["b"])
1 2.442986
2 0.211423
dtype: float64
data['b':'d']
b 1 2.442986
2 0.211423
c 1 1.439100
2 -1.690219
3 1.338686
d 1 1.274225
2 -0.238439
dtype: float64
data[1:4]
a 2 0.190444
3 0.137448
b 1 2.442986
dtype: float64
unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。
data.unstack()
|
1 |
2 |
3 |
a |
-0.535940 |
0.190444 |
0.137448 |
b |
2.442986 |
0.211423 |
NaN |
c |
1.439100 |
-1.690219 |
1.338686 |
d |
1.274225 |
-0.238439 |
NaN |
type(data.unstack())
pandas.core.frame.DataFrame
data.unstack().stack()
a 1 -0.535940
2 0.190444
3 0.137448
b 1 2.442986
2 0.211423
c 1 1.439100
2 -1.690219
3 1.338686
d 1 1.274225
2 -0.238439
dtype: float64
DataFrame的hierarchical indexing
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index = [['a','a','b','b'], [1,2,1,2]],
columns = [['Beijing', 'Beijing', 'Shanghai'], ['apts', 'cars', 'apts']])
print(frame)
Beijing Shanghai
apts cars apts
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names = ['alpha', 'number']
frame.columns.names = ['city', 'type']
print(frame)
city Beijing Shanghai
type apts cars apts
alpha number
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.ix['a', 1]
city type
Beijing apts 0
cars 1
Shanghai apts 2
Name: (a, 1), dtype: int32
frame.ix['a', 2]['Beijing']['apts']
3
关于Merge, Join和Concatenate
concatenate
df1 = pd.DataFrame({'apts': [55000, 60000],
'cars': [200000, 300000],},
index = ['Shanghai', 'Beijing'])
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
Shanghai |
55000 |
200000 |
Beijing |
60000 |
300000 |
df2 = pd.DataFrame({'apts': [25000, 20000],
'cars': [150000, 120000],},
index = ['Hangzhou', 'Najing'])
print(df2)
apts cars
Hangzhou 25000 150000
Najing 20000 120000
df3 = pd.DataFrame({'apts': [30000, 10000],
'cars': [180000, 100000],},
index = ['Guangzhou', 'Chongqing'])
print(df3)
apts cars
Guangzhou 30000 180000
Chongqing 10000 100000
result = pd.concat([df1, df2, df3])
result
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
Shanghai |
55000 |
200000 |
Beijing |
60000 |
300000 |
Hangzhou |
25000 |
150000 |
Najing |
20000 |
120000 |
Guangzhou |
30000 |
180000 |
Chongqing |
10000 |
100000 |
在concatenate的时候可以指定keys,这样可以给每一个部分加上一个Key。
以下的例子就构造了一个hierarchical index。
frames = [df1, df2, df3]
frames
[ apts cars
Shanghai 55000 200000
Beijing 60000 300000, apts cars
Hangzhou 25000 150000
Najing 20000 120000, apts cars
Guangzhou 30000 180000
Chongqing 10000 100000]
result2 = pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
result2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
|
apts |
cars |
x |
Shanghai |
55000 |
200000 |
Beijing |
60000 |
300000 |
y |
Hangzhou |
25000 |
150000 |
Najing |
20000 |
120000 |
z |
Guangzhou |
30000 |
180000 |
Chongqing |
10000 |
100000 |
result2.ix['y']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
Hangzhou |
25000 |
150000 |
Najing |
20000 |
120000 |
result2.ix[3:6]
|
|
apts |
cars |
y |
Najing |
20000 |
120000 |
z |
Guangzhou |
30000 |
180000 |
Chongqing |
10000 |
100000 |
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
index = ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
df4
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
salaries |
Suzhou |
10000 |
Beijing |
30000 |
Shanghai |
30000 |
Guangzhou |
20000 |
Tianjin |
15000 |
result3 = pd.concat([result, df4], axis=1)
result3
|
apts |
cars |
salaries |
Beijing |
60000.0 |
300000.0 |
30000.0 |
Chongqing |
10000.0 |
100000.0 |
NaN |
Guangzhou |
30000.0 |
180000.0 |
20000.0 |
Hangzhou |
25000.0 |
150000.0 |
NaN |
Najing |
20000.0 |
120000.0 |
NaN |
Shanghai |
55000.0 |
200000.0 |
30000.0 |
Suzhou |
NaN |
NaN |
10000.0 |
Tianjin |
NaN |
NaN |
15000.0 |
复习一下前面讲过的stack
result3.stack()
Beijing apts 60000.0
cars 300000.0
salaries 30000.0
Chongqing apts 10000.0
cars 100000.0
Guangzhou apts 30000.0
cars 180000.0
salaries 20000.0
Hangzhou apts 25000.0
cars 150000.0
Najing apts 20000.0
cars 120000.0
Shanghai apts 55000.0
cars 200000.0
salaries 30000.0
Suzhou salaries 10000.0
Tianjin salaries 15000.0
dtype: float64
用inner可以去掉NaN的数据,只留下匹配的数据
result3 = pd.concat([result, df4], axis=1, join='inner')
result3
|
apts |
cars |
salaries |
Beijing |
60000 |
300000 |
30000 |
Shanghai |
55000 |
200000 |
30000 |
Guangzhou |
30000 |
180000 |
20000 |
用append来做concatenation
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
Shanghai |
55000 |
200000 |
Beijing |
60000 |
300000 |
df1.append(df2)
|
apts |
cars |
Shanghai |
55000 |
200000 |
Beijing |
60000 |
300000 |
Hangzhou |
25000 |
150000 |
Najing |
20000 |
120000 |
df1.append(df4)
|
apts |
cars |
salaries |
Shanghai |
55000.0 |
200000.0 |
NaN |
Beijing |
60000.0 |
300000.0 |
NaN |
Suzhou |
NaN |
NaN |
10000.0 |
Beijing |
NaN |
NaN |
30000.0 |
Shanghai |
NaN |
NaN |
30000.0 |
Guangzhou |
NaN |
NaN |
20000.0 |
Tianjin |
NaN |
NaN |
15000.0 |
Series和DataFrame还可以被一起concatenate,这时候Series会先被转成DataFrame然后做Join,因为Series本来就是一个只有一维的DataFrame对吧。
s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
print(s1)
Shanghai 60
Beijing 50
Name: meal, dtype: int64
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
Shanghai |
55000 |
200000 |
Beijing |
60000 |
300000 |
pd.concat([df1, s1], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
meal |
Shanghai |
55000 |
200000 |
60 |
Beijing |
60000 |
300000 |
50 |
下面讲如何append一个row到DataFrame里。
s2 = pd.Series([18000, 120000], index=['apts', 'cars'], name='Xiamen')
s2
apts 18000
cars 120000
Name: Xiamen, dtype: int64
df1.append(s2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
Shanghai |
55000 |
200000 |
Beijing |
60000 |
300000 |
Xiamen |
18000 |
120000 |
pd.concat([df1,s2],axis=0)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
0 |
Shanghai |
55000.0 |
200000.0 |
NaN |
Beijing |
60000.0 |
300000.0 |
NaN |
apts |
NaN |
NaN |
18000.0 |
cars |
NaN |
NaN |
120000.0 |
Merge(Join)
df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
'cars': [200000, 300000,250000],
'cities': ['Shanghai', 'Beijing','Shenzhen']})
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
cities |
0 |
55000 |
200000 |
Shanghai |
1 |
60000 |
300000 |
Beijing |
2 |
58000 |
250000 |
Shenzhen |
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000],
'cities': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})
df4
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
cities |
salaries |
0 |
Suzhou |
10000 |
1 |
Beijing |
30000 |
2 |
Shanghai |
30000 |
3 |
Guangzhou |
20000 |
4 |
Tianjin |
15000 |
result = pd.merge(df1, df4, on='cities')
result
|
apts |
cars |
cities |
salaries |
0 |
55000 |
200000 |
Shanghai |
30000 |
1 |
60000 |
300000 |
Beijing |
30000 |
result2 = pd.merge(df1, df4, on='cities', how='outer')
result2
|
apts |
cars |
cities |
salaries |
0 |
55000.0 |
200000.0 |
Shanghai |
30000.0 |
1 |
60000.0 |
300000.0 |
Beijing |
30000.0 |
2 |
58000.0 |
250000.0 |
Shenzhen |
NaN |
3 |
NaN |
NaN |
Suzhou |
10000.0 |
4 |
NaN |
NaN |
Guangzhou |
20000.0 |
5 |
NaN |
NaN |
Tianjin |
15000.0 |
join on index
df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
'cars': [200000, 300000,250000]},
index=['Shanghai', 'Beijing','Shenzhen'])
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
Shanghai |
55000 |
200000 |
Beijing |
60000 |
300000 |
Shenzhen |
58000 |
250000 |
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
df4
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
salaries |
Suzhou |
10000 |
Beijing |
30000 |
Shanghai |
30000 |
Guangzhou |
20000 |
Tianjin |
15000 |
df1.join(df4)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
salaries |
Shanghai |
55000 |
200000 |
30000.0 |
Beijing |
60000 |
300000 |
30000.0 |
Shenzhen |
58000 |
250000 |
NaN |
df4.join(df1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
salaries |
apts |
cars |
Suzhou |
10000 |
NaN |
NaN |
Beijing |
30000 |
60000.0 |
300000.0 |
Shanghai |
30000 |
55000.0 |
200000.0 |
Guangzhou |
20000 |
NaN |
NaN |
Tianjin |
15000 |
NaN |
NaN |
df1.join(df4, how='outer')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
salaries |
Beijing |
60000.0 |
300000.0 |
30000.0 |
Guangzhou |
NaN |
NaN |
20000.0 |
Shanghai |
55000.0 |
200000.0 |
30000.0 |
Shenzhen |
58000.0 |
250000.0 |
NaN |
Suzhou |
NaN |
NaN |
10000.0 |
Tianjin |
NaN |
NaN |
15000.0 |
也可以用merge来写
pd.merge(df1, df4,left_index=True,right_index=True,how='outer')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
|
apts |
cars |
salaries |
Beijing |
60000.0 |
300000.0 |
30000.0 |
Guangzhou |
NaN |
NaN |
20000.0 |
Shanghai |
55000.0 |
200000.0 |
30000.0 |
Shenzhen |
58000.0 |
250000.0 |
NaN |
Suzhou |
NaN |
NaN |
10000.0 |
Tianjin |
NaN |
NaN |
15000.0 |