当有了滚动,扩展和ewm对象创建了以后,就有几种方法可以对数据执行聚合。
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2019', periods=10),
columns = ['A', 'B', 'C', 'D'])
>>>df
A B C D
2019-01-01 -0.901602 -1.778484 0.728295 -0.758108
2019-01-02 -0.826162 0.994140 0.976164 -0.918249
2019-01-03 0.260841 0.905993 1.505967 -0.124883
2019-01-04 -0.112230 -0.111885 0.702712 -0.871768
2019-01-05 -0.239969 1.435918 -0.160140 -0.547702
2019-01-06 -0.126897 -2.628206 -0.280658 0.167422
2019-01-07 0.367903 0.994337 -0.529830 0.195990
2019-01-08 -0.530872 -0.384915 -0.397150 -0.024074
2019-01-09 -0.418925 0.049046 -0.816616 0.308107
2019-01-10 -0.176857 2.573145 0.010211 -1.427078
>>>r = df.rolling(window=3,min_periods=1)
>>>r
Rolling [window=3,min_periods=1,center=False,axis=0]
可以通过向整个DataFrame传递一个函数来进行聚合,或者通过标准的获取项目方法来选择一个列。
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2020', periods=10),
columns = ['A', 'B', 'C', 'D'])
>>>df
A B C D
2020-01-01 1.069090 -0.802365 -0.323818 -1.994676
2020-01-02 0.190584 0.328272 -0.550378 0.559738
2020-01-03 0.044865 0.478342 -0.976129 0.106530
2020-01-04 -1.349188 -0.391635 -0.292740 1.412755
2020-01-05 0.057659 -1.331901 -0.297858 -0.500705
2020-01-06 2.651680 -1.459706 -0.726023 0.294283
2020-01-07 0.666481 0.679205 -1.511743 2.093833
2020-01-08 -0.284316 -1.079759 1.433632 0.534043
2020-01-09 1.115246 -0.268812 0.190440 -0.712032
2020-01-10 -0.121008 0.136952 1.279354 0.275773
>>>r = df.rolling(window=3,min_periods=1)
>>>r.aggregate(np.sum)
A B C D
2020-01-01 1.069090 -0.802365 -0.323818 -1.994676
2020-01-02 1.259674 -0.474093 -0.874197 -1.434938
2020-01-03 1.304539 0.004249 -1.850326 -1.328409
2020-01-04 -1.113739 0.414979 -1.819248 2.079023
2020-01-05 -1.246664 -1.245194 -1.566728 1.018580
2020-01-06 1.360151 -3.183242 -1.316621 1.206333
2020-01-07 3.375821 -2.112402 -2.535624 1.887411
2020-01-08 3.033846 -1.860260 -0.804134 2.922160
2020-01-09 1.497411 -0.669366 0.112329 1.915845
2020-01-10 0.709922 -1.211619 2.903427 0.097785
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
>>>df
A B C D
2000-01-01 -1.095530 -0.415257 -0.446871 -1.267795
2000-01-02 -0.405793 -0.002723 0.040241 -0.131678
2000-01-03 -0.136526 0.742393 -0.692582 -0.271176
2000-01-04 0.318300 -0.592146 -0.754830 0.239841
2000-01-05 -0.125770 0.849980 0.685083 0.752720
2000-01-06 1.410294 0.054780 0.297992 -0.034028
2000-01-07 0.463223 -1.239204 -0.056420 0.440893
2000-01-08 -2.244446 -0.516937 -2.039601 -0.680606
2000-01-09 0.991139 0.026987 -2.391856 0.585565
2000-01-10 0.112228 -0.701284 -1.139827 1.484032
>>>r = df.rolling(window=3,min_periods=1)
>>>r['A'].aggregate(np.sum)
2000-01-01 -1.095530
2000-01-02 -1.501323
2000-01-03 -1.637848
2000-01-04 -0.224018
2000-01-05 0.056004
2000-01-06 1.602824
2000-01-07 1.747747
2000-01-08 -0.370928
2000-01-09 -0.790084
2000-01-10 -1.141079
Freq: D, Name: A, dtype: float64
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2018', periods=10),
columns = ['A', 'B', 'C', 'D'])
>>>df
A B C D
2018-01-01 0.518897 0.988917 0.435691 -1.005703
2018-01-02 1.793400 0.130314 2.313787 0.870057
2018-01-03 -0.297601 0.504137 -0.951311 -0.146720
2018-01-04 0.282177 0.142360 -0.059013 0.633174
2018-01-05 2.095398 -0.153359 0.431514 -1.185657
2018-01-06 0.134847 0.188138 0.828329 -1.035120
2018-01-07 0.780541 0.138942 -1.001229 0.714896
2018-01-08 0.579742 -0.642858 0.835013 -1.504110
2018-01-09 -1.692986 -0.861327 -1.125359 0.006687
2018-01-10 -0.263689 1.182349 -0.916569 0.617476
>>>r = df.rolling(window=3,min_periods=1)
>>>r[['A','B']].aggregate(np.sum)
A B
2018-01-01 0.518897 0.988917
2018-01-02 2.312297 1.119232
2018-01-03 2.014697 1.623369
2018-01-04 1.777976 0.776811
2018-01-05 2.079975 0.493138
2018-01-06 2.512422 0.177140
2018-01-07 3.010786 0.173722
2018-01-08 1.495130 -0.315777
2018-01-09 -0.332703 -1.365242
2018-01-10 -1.376932 -0.321836
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('2019/01/01', periods=10),
columns = ['A', 'B', 'C', 'D'])
>>>df
A B C D
2019-01-01 1.022641 -1.431910 0.780941 -0.029811
2019-01-02 -0.302858 0.009886 -0.359331 -0.417708
2019-01-03 -1.396564 0.944374 -0.238989 -1.873611
2019-01-04 0.396995 -1.152009 -0.560552 -0.144212
2019-01-05 -2.513289 -1.085277 -1.016419 -1.586994
2019-01-06 -0.513179 0.823411 0.670734 1.196546
2019-01-07 -0.363239 -0.991799 0.587564 -1.100096
2019-01-08 1.474317 1.265496 -0.216486 -0.224218
2019-01-09 2.235798 -1.381457 -0.950745 -0.209564
2019-01-10 -0.061891 -0.025342 0.494245 -0.081681
>>>r = df.rolling(window=3,min_periods=1)
>>>r['A'].aggregate([np.sum,np.mean])
sum mean
2019-01-01 1.022641 1.022641
2019-01-02 0.719784 0.359892
2019-01-03 -0.676780 -0.225593
2019-01-04 -1.302427 -0.434142
2019-01-05 -3.512859 -1.170953
2019-01-06 -2.629473 -0.876491
2019-01-07 -3.389707 -1.129902
2019-01-08 0.597899 0.199300
2019-01-09 3.346876 1.115625
2019-01-10 3.648224 1.216075
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('2020/01/01', periods=10),
columns = ['A', 'B', 'C', 'D'])
>>>df
A B C D
2020-01-01 1.053702 0.355985 0.746638 -0.233968
2020-01-02 0.578520 -1.171843 -1.764249 -0.709913
2020-01-03 -0.491185 0.975212 0.200139 -3.372621
2020-01-04 -1.331328 0.776316 0.216623 0.202313
2020-01-05 -1.023147 -0.913686 1.457512 0.999232
2020-01-06 0.995328 -0.979826 -1.063695 0.057925
2020-01-07 0.576668 1.065767 -0.270744 -0.513707
2020-01-08 0.520258 0.969043 -0.119177 -0.125620
2020-01-09 -0.316480 0.549085 1.862249 1.091265
2020-01-10 0.461321 -0.368662 -0.988323 0.543011
>>>r = df.rolling(window=3,min_periods=1)
>>>r[['A','B']].aggregate([np.sum,np.mean])
A B
sum mean sum mean
2020-01-01 1.053702 1.053702 0.355985 0.355985
2020-01-02 1.632221 0.816111 -0.815858 -0.407929
2020-01-03 1.141037 0.380346 0.159354 0.053118
2020-01-04 -1.243993 -0.414664 0.579686 0.193229
2020-01-05 -2.845659 -0.948553 0.837843 0.279281
2020-01-06 -1.359146 -0.453049 -1.117195 -0.372398
2020-01-07 0.548849 0.182950 -0.827744 -0.275915
2020-01-08 2.092254 0.697418 1.054985 0.351662
2020-01-09 0.780445 0.260148 2.583896 0.861299
2020-01-10 0.665099 0.221700 1.149466 0.383155
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(3, 4),
index = pd.date_range('2020/01/01', periods=3),
columns = ['A', 'B', 'C', 'D'])
>>>df
A B C D
2020-01-01 -0.246302 -0.057202 0.923807 -1.019698
2020-01-02 0.285287 1.467206 -0.368735 -0.397260
2020-01-03 -0.163219 -0.401368 1.254569 0.580188
>>>r = df.rolling(window=3,min_periods=1)
>>>r.aggregate({'A' : np.sum,'B' : np.mean})
A B
2020-01-01 -0.246302 -0.057202
2020-01-02 0.038985 0.705002
2020-01-03 -0.124234 0.336212
数据丢失(缺失)在现实生活中总是一个问题。 机器学习和数据挖掘等领域由于数据缺失导致的数据质量差,在模型预测的准确性上面临着严重的问题。 在这些领域,缺失值处理是使模型更加准确和有效的重点。
为了更容易地检测缺失值(以及跨越不同的数组dtype),Pandas提供了isnull()和notnull()函数,它们也是Series和DataFrame对象的方法
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
>>>df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
>>>df['one'].isnull()
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
>>>df['one'].notnull()
a True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool
- 在求和数据时,NA将被视为0
- 如果数据全部是NA,那么结果将是NA
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
>>>df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
>>>df['one'].sum()
-2.6163354325445014
>>>df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
>>>df['one'].sum()
nan
Pandas提供了各种方法来清除缺失的值。fillna()函数可以通过几种方法用非空数据“填充”NA值
1.用标量值替换NaN
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three'])
>>>df = df.reindex(['a', 'b', 'c'])
>>>df
one two three
a -0.479425 -1.711840 -1.453384
b NaN NaN NaN
c -0.733606 -0.813315 0.476788
>>>df.fillna(0)
one two three
a -0.479425 -1.711840 -1.453384
b 0.000000 0.000000 0.000000
c -0.733606 -0.813315 0.476788
在这里填充零值; 当然,也可以填写任何其他的值。
2.填写NA前进和后退
使用重构索引章节讨论的填充概念,来填补缺失的值。
方法 | 动作 |
---|---|
pad/fill | 填充方法向前 |
bfill/backfill | 填充方法向后 |
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
>>>df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
>>>df.fillna(method='pad')
one two three
a 0.614938 -0.452498 -2.113057
b 0.614938 -0.452498 -2.113057
c -0.118390 1.333962 -0.037907
d -0.118390 1.333962 -0.037907
e 0.699733 0.502142 -0.243700
f 0.544225 -0.923116 -1.123218
g 0.544225 -0.923116 -1.123218
h -0.669783 1.187865 1.112835
>>>df.fillna(method='backfill')
one two three
a 2.278454 1.550483 -2.103731
b -0.779530 0.408493 1.247796
c -0.779530 0.408493 1.247796
d 0.262713 -1.073215 0.129808
e 0.262713 -1.073215 0.129808
f -0.600729 1.310515 -0.877586
g 0.395212 0.219146 -0.175024
h 0.395212 0.219146 -0.175024
3.丢失缺少的值
如果只想排除缺少的值,则使用dropna函数和axis参数。 默认情况下,axis = 0,即在行上应用,这意味着如果行内的任何值是NA,那么整个行被排除。
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
>>>df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
>>>df.dropna()
one two three
a -0.719623 0.028103 -1.093178
c 0.040312 1.729596 0.451805
e -1.029418 1.920933 1.289485
f 1.217967 1.368064 0.527406
h 0.667855 0.147989 -1.035978
>>>df.dropna(axis=1)
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]
4.替换丢失(或)通用值
很多时候,必须用一些具体的值取代一个通用的值。可以通过应用替换方法来实现这一点。
用标量值替换NA是fillna()函数的等效行为。
>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})
>>>df.replace({1000:10,2000:60})
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60