入门学习马上结束辽。
1.Pandas库
import pandas as pd
两个数据类型:Series,DataFrame
Series类型:数据+索引
自定义索引
b = pd.Series([9,8,7,6],index=['a','b','c','d']) b Out[3]: a 9 b 8 c 7 d 6 dtype: int64
从标量值创建
s = pd.Series(25,index=['a','b','c'])#index=不能省略 s Out[7]: a 25 b 25 c 25 dtype: int64
从字典类型创建
d = pd.Series({'a':9,'b':8,'c':7}) d Out[9]: a 9 b 8 c 7 dtype: int64
从ndarray类型创建
import numpy as np n = pd.Series(np.arange(5)) n Out[12]: 0 0 1 1 2 2 3 3 4 4 dtype: int32
基本操作
b = pd.Series([9,8,7,6],['a','b','c','d']) b Out[14]: a 9 b 8 c 7 d 6 dtype: int64 b.index Out[15]: Index(['a', 'b', 'c', 'd'], dtype='object') b.values Out[17]: array([9, 8, 7, 6], dtype=int64)
b.get('d',100)
Out[18]: 6
Series对象和索引都可以有一个名字,存储在属性.name中
DataFrame类型:共用相同索引的多列数据
从二维ndarray对象创建
import pandas as pd import numpy as np d = pd.DataFrame(np.arange(10),reshape(2,5)) Traceback (most recent call last): File "", line 1, in d = pd.DataFrame(np.arange(10),reshape(2,5)) NameError: name 'reshape' is not defined d = pd.DataFrame(np.arange(10).reshape(2,5)) d Out[5]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
从一维ndarray对象字典创建
dt = {'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([9,8,7,6],index=['a','b','c','d'])} d = pd.DataFrame(dt) d Out[11]: one two a 1.0 9 b 2.0 8 c 3.0 7 d NaN 6 pd.DataFrame(dt,index=['b','c','d'],columns=['two','three']) Out[13]: two three b 8 NaN c 7 NaN d 6 NaN
从列表类型的字典创建
d1 = {'one':[1,2,3,4],'two':[9,8,7,6]} d = pd.DataFrame(d1,index=['a','b','c','d']) d Out[16]: one two a 1 9 b 2 8 c 3 7 d 4 6
数据类型操作
如何改变Series和DataFrame对象?
增加或重排:重新索引
.reindex()
import pandas as pd d1 = {'城市':['北京','上海','广州','深圳','沈阳'], '环比':[101.5,101.2,101.3,102.0,100.1], '同比':[101.5,101.2,101.3,102.0,100.1], '定基':[101.5,101.2,101.3,102.0,100.1]} d = pd.DataFrame(d1,index=[1,2,3,4,5]) d Out[4]: 同比 城市 定基 环比 1 101.5 北京 101.5 101.5 2 101.2 上海 101.2 101.2 3 101.3 广州 101.3 101.3 4 102.0 深圳 102.0 102.0 5 100.1 沈阳 100.1 100.1 d = d.reindex(index=[5,4,3,2,1]) d Out[6]: 同比 城市 定基 环比 5 100.1 沈阳 100.1 100.1 4 102.0 深圳 102.0 102.0 3 101.3 广州 101.3 101.3 2 101.2 上海 101.2 101.2 1 101.5 北京 101.5 101.5 d = d.reindex(columns=['城市','同比','环比','定基']) d Out[8]: 城市 同比 环比 定基 5 沈阳 100.1 100.1 100.1 4 深圳 102.0 102.0 102.0 3 广州 101.3 101.3 101.3 2 上海 101.2 101.2 101.2 1 北京 101.5 101.5 101.5
其他参数:
fill_value:重新索引中,勇于填充缺失位置的值
method:填充方法,fill当前值向前填充,bfill向后填充
limit:最大填充量
copy:默认True,生成新的对象,False时,新旧相等不复制
索引类型的常用方法:
.append(idx):连接另一个Index对象,产生新的Index对象
.diff(idx):计算差集,产生新的Index对象
.intersection(idx):计算交集
.union(idx):计算并集
.delete(loc):删除loc位置处的元素
.insert(loc,e):在loc位置增加一个元素e
nc = d.columns.delete(2) ni = d.index.insert(5,6) nd = d.reindex(index=ni,columns=nc,method='ffill') Traceback (most recent call last): File "", line 1, in nd = d.reindex(index=ni,columns=nc,method='ffill') File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2831, in reindex **kwargs) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2404, in reindex fill_value, copy).__finalize__(self) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2772, in _reindex_axes fill_value, limit, tolerance) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2794, in _reindex_columns tolerance=tolerance) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2833, in reindex tolerance=tolerance) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2538, in get_indexer indexer = self._get_fill_indexer(target, method, limit, tolerance) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2564, in _get_fill_indexer limit) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2585, in _get_fill_indexer_searchsorted side) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3394, in _searchsorted_monotonic raise ValueError('index must be monotonic increasing or decreasing') ValueError: index must be monotonic increasing or decreasing ni = d.index.insert(5,0) nd = d.reindex(index=ni,columns=nc,method='ffill') Traceback (most recent call last): File " ", line 1, in nd = d.reindex(index=ni,columns=nc,method='ffill') File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2831, in reindex **kwargs) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2404, in reindex fill_value, copy).__finalize__(self) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2772, in _reindex_axes fill_value, limit, tolerance) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2794, in _reindex_columns tolerance=tolerance) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2833, in reindex tolerance=tolerance) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2538, in get_indexer indexer = self._get_fill_indexer(target, method, limit, tolerance) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2564, in _get_fill_indexer limit) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2585, in _get_fill_indexer_searchsorted side) File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3394, in _searchsorted_monotonic raise ValueError('index must be monotonic increasing or decreasing') ValueError: index must be monotonic increasing or decreasing nd = d.reindex(index=ni,columns=nc).ffill() nd Out[15]: 城市 同比 定基 5 沈阳 100.1 100.1 4 深圳 102.0 102.0 3 广州 101.3 101.3 2 上海 101.2 101.2 1 北京 101.5 101.5 0 北京 101.5 101.5
ValueError: index must be monotonic increasing or decreasing
解决方法见代码
删除:drop
a = pd.Series([9,8,7,6],index=['a','b','c','d']) a Out[17]: a 9 b 8 c 7 d 6 dtype: int64 a.drop(['b','c']) Out[18]: a 9 d 6 dtype: int64
pandas库的数据类型运算:
import pandas as pd import numpy as np a = pd.DataFrame(np.arange(12),reshape(3,4)) Traceback (most recent call last): File "", line 1, in a = pd.DataFrame(np.arange(12),reshape(3,4)) NameError: name 'reshape' is not defined a = pd.DataFrame(np.arange(12).reshape(3,4)) a Out[23]: 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 b = pd.DataFrame(np.arange(20).reshape(4,5)) b Out[25]: 0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 a+b Out[26]: 0 1 2 3 4 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaN b.add(a,fill_value = 0) Out[27]: 0 1 2 3 4 0 0.0 2.0 4.0 6.0 4.0 1 9.0 11.0 13.0 15.0 9.0 2 18.0 20.0 22.0 24.0 14.0 3 15.0 16.0 17.0 18.0 19.0 a.mul(b,fill_value = 0) Out[28]: 0 1 2 3 4 0 0.0 1.0 4.0 9.0 0.0 1 20.0 30.0 42.0 56.0 0.0 2 80.0 99.0 120.0 143.0 0.0 3 0.0 0.0 0.0 0.0 0.0
不同维度间为广播运算:
b = pd.DataFrame(np.arange(20).reshape(4,5)) b Out[31]: 0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 c =pd.Series(np.arange(4)) c Out[33]: 0 0 1 1 2 2 3 3 dtype: int32 c-10 Out[34]: 0 -10 1 -9 2 -8 3 -7 dtype: int32 b-c Out[35]: 0 1 2 3 4 0 0.0 0.0 0.0 0.0 NaN 1 5.0 5.0 5.0 5.0 NaN 2 10.0 10.0 10.0 10.0 NaN 3 15.0 15.0 15.0 15.0 NaN
b.sub(c,axis=0)
Out[36]:
0 1 2 3 4
0 0 1 2 3 4
1 4 5 6 7 8
2 8 9 10 11 12
3 12 13 14 15 16
排序:
.sort_index()方法在指定轴上根据索引进行排序,默认升序。
.sort_index(axis=0,ascending=True)
import pandas as pd import numpy as np b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b']) b Out[4]: 0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 d 10 11 12 13 14 b 15 16 17 18 19 b.sort_index() Out[5]: 0 1 2 3 4 a 5 6 7 8 9 b 15 16 17 18 19 c 0 1 2 3 4 d 10 11 12 13 14 b.sort_index(ascending=False) Out[6]: 0 1 2 3 4 d 10 11 12 13 14 c 0 1 2 3 4 b 15 16 17 18 19 a 5 6 7 8 9
.sort_values()方法在指定轴上根据数值进行排序,默认升序
Series.sort_values(axis=0,ascending=True)
DataFrame(by,axis=0,ascending=True)
by:axis轴上某个索引或索引列表
NaN统一放到排序末尾
基本统计分析:
.describe()
a = pd.Series([9,8,7,6]) a Out[8]: 0 9 1 8 2 7 3 6 dtype: int64 a.describe() Out[9]: count 4.000000 mean 7.500000 std 1.290994 min 6.000000 25% 6.750000 50% 7.500000 75% 8.250000 max 9.000000 dtype: float64 a.describe()['count'] Out[10]: 4.0 b.describe() Out[11]: 0 1 2 3 4 count 4.000000 4.000000 4.000000 4.000000 4.000000 mean 7.500000 8.500000 9.500000 10.500000 11.500000 std 6.454972 6.454972 6.454972 6.454972 6.454972 min 0.000000 1.000000 2.000000 3.000000 4.000000 25% 3.750000 4.750000 5.750000 6.750000 7.750000 50% 7.500000 8.500000 9.500000 10.500000 11.500000 75% 11.250000 12.250000 13.250000 14.250000 15.250000 max 15.000000 16.000000 17.000000 18.000000 19.000000 b.describe()[2] Out[12]: count 4.000000 mean 9.500000 std 6.454972 min 2.000000 25% 5.750000 50% 9.500000 75% 13.250000 max 17.000000 Name: 2, dtype: float64
数据的累计统计分析:
.cumsum()依次给出前1、2、。。。n个数的和
.cumprod()积
.cummax()最大值
.cummin()最小值
b.cumsum() Out[13]: 0 1 2 3 4 c 0 1 2 3 4 a 5 7 9 11 13 d 15 18 21 24 27 b 30 34 38 42 46
滚动计算
.rolling(w).sum()依次计算相邻w个元素的和
.rolling(w).mean()算术平均值
.rolling(w).var()方差
.rolling(w).std()标准差
.rolling(w).min() .max()最小值、最大值
b.rolling(2).sum() Out[14]: 0 1 2 3 4 c NaN NaN NaN NaN NaN a 5.0 7.0 9.0 11.0 13.0 d 15.0 17.0 19.0 21.0 23.0 b 25.0 27.0 29.0 31.0 33.0
相关分析
.cov:计算协方差矩阵
.corr():计算相关系数矩阵,Pearson、Spearman、Kendall等系数
hprice = pd.Series([3.04,22.93,12.75,22.6,12.33],index=['2008','2009','2010','2011','2012']) m2 = pd.Series([8.18,18.38,9.13,7.82,6.69],index=['2008','2009','2010','2011','2012']) hprice.corr(m2) Out[18]: 0.5239439145220387