声明:本文根据《python for data analysis》整理
(1)sum方法
In [198]: df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'], columns=['one', 'two']) In [199]: df Out[199]: one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3
使用sum 方法计算总和
In [200]: df.sum() Out[200]: one 9.25 two -5.80
给sum方法传入 axis参数可以改变统计的 维度
In [201]: df.sum(axis=1) Out[201]: a 1.40 b 2.60 c NaN d -0.55
对于NaN值,可以使用参数skipnan来选择保留(默认是忽略的)
In [202]: df.mean(axis=1, skipna=False) Out[202]: a NaN b 1.300 c NaN d -0.275
(2)其他方法如:idxmax和idamin则返回结果的索引
In [203]: df.idxmax() Out[203]: one b two d
accumulations:
In [204]: df.cumsum() Out[204]: one two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8
描述方法describe(让我想起了R语言的 summary....)
对于数字型DataFrame
In [205]: df.describe() Out[205]: one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000
对于 字符型 DataFrame
In [206]: obj = Series(['a', 'a', 'b', 'c'] * 4) In [207]: obj.describe() Out[207]: count 16 unique 3 top a freq 8
(3)类似的方法和参数如下表:
Method Description
count Number of non-NA values describe Compute set of summary statistics for Series or each DataFrame column
min, max Compute minimum and maximum values
argmin, argmax Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax Compute index values at which minimum or maximum value obtained, respectively
quantile Compute sample quantile ranging from 0 to 1
sum Sum of values
mean Mean of values median Arithmetic median (50% quantile) of values
mad Mean absolute deviation from mean value
var Sample variance of values
std Sample standard deviation of values
skew Sample skewness (3rd moment) of values
kurt Sample kurtosis (4th moment) of values cumsum Cumulative sum of values
cummin, cummax Cumulative minimum or maximum of values, respectively
cumprod Cumulative product of values
diff Compute 1st arithmetic difference (useful for time series)
pct_change Compute percent changes
他们都是按照 成对的参数来计算结果的。
下面的代码用来获取 雅虎金融数据,部分公司股票数据
import pandas.io.data as web all_data = {} for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']: all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2000', '1/1/2010') price = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()}) volume = DataFrame({tic: data['Volume'] for tic, data in all_data.iteritems()})
计算改变率百分比,使用pct_change
In [209]: returns = price.pct_change() In [210]: returns.tail()
Out[210]: AAPL GOOG IBM MSFT Date 2009-12-24 0.034339 0.011117 0.004420 0.002747 2009-12-28 0.012294 0.007098 0.013282 0.005479 2009-12-29 -0.011861 -0.005571 -0.003474 0.006812 2009-12-30 0.012147 0.005376 0.005468 -0.013532 2009-12-31 -0.004300 -0.004416 -0.012609 -0.015432
(1)对于Series,corr用来计算 交叉的、非NaN数据、由索引关联的两组Series的相关性, cov来计算协方差
In [211]: returns.MSFT.corr(returns.IBM) Out[211]: 0.49609291822168838 In [212]: returns.MSFT.cov(returns.IBM) Out[212]: 0.00021600332437329015
(2)DataFrame的corr和cov返回的是全部数据的相关性和协方差
In [213]: returns.corr() Out[213]: AAPL GOOG IBM MSFT AAPL 1.000000 0.470660 0.410648 0.424550 GOOG 0.470660 1.000000 0.390692 0.443334 IBM 0.410648 0.390692 1.000000 0.496093 MSFT 0.424550 0.443334 0.496093 1.000000
In [214]: returns.cov() Out[214]: AAPL GOOG IBM MSFT AAPL 0.001028 0.000303 0.000252 0.000309 GOOG 0.000303 0.000580 0.000142 0.000205 IBM 0.000252 0.000142 0.000367 0.000216 MSFT 0.000309 0.000205 0.000216 0.000516
假若我们有Series
In [217]: obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
统计obj 中出现的不同值,使用的是unique
In [218]: uniques = obj.unique() In [219]: uniques Out[219]: array([c, a, d, b], dtype=object)
如果想统计不同值出现的次数,则使用value_counts
In [220]: obj.value_counts() Out[220]: c 3 a 3 b 2 d 1
isin 可以用来描述集合向量的包含关系
In [222]: mask = obj.isin(['b', 'c']) In [223]: mask In [224]: obj[mask] Out[223]: Out[224]: 0 True 0 c 1 False 5 b 2 False 6 b 3 False 7 c 4 False 8 c 5 True 6 True
在Pandas中把所有缺失值都以NaN表示,python内置的None值也会被当做NaN处理。
In [229]: string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado']) In [232]: string_data[0] = None In [233]: string_data.isnull() Out[233]: 0 True 1 False 2 True 3 False
(1)处理缺失值:删除
使用dropna(),或者使用data[data.notnull()]
In [234]: from numpy import nan as NA In [235]: data = Series([1, NA, 3.5, NA, 7]) In [236]: data.dropna() Out[236]: 0 1.0 2 3.5 4 7.0
由于dropna()方法会删除所有包含了空值(或缺失值)的行和列,我们可以使用参数 how来指定删除,如how='all'代表只删除全部一行都是缺失值(nan)值的
In [242]: data.dropna(how='all') Out[242]: 0 1 2 0 1 6.5 3 1 1 NaN NaN 3 NaN 6.5 3
我们也可以传入参数 axis=1来删除指定列
In [243]: data[4] = NA In [244]: data In [245]: data.dropna(axis=1, how='all') Out[244]: Out[245]: 0 1 2 4 0 1 2 0 1 6.5 3 NaN 0 1 6.5 3 1 1 NaN NaN NaN 1 1 NaN NaN 2 NaN NaN NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3 NaN 3 NaN 6.5 3
(2)处理缺失值:填充
使用方法fillna
数据对象 df:
In [248]: df 0 1 2 0 -0.577087 NaN NaN 1 0.523772 NaN NaN 2 -0.713544 NaN NaN 3 -1.860761 NaN 0.560145 4 -1.265934 NaN -1.063512 5 0.332883 -2.359419 -0.199543 6 -1.541996 -0.970736 -1.307030
使用 0 填充NaN值
In [250]: df.fillna(0) Out[250]: 0 1 2 0 -0.577087 0.000000 0.000000 1 0.523772 0.000000 0.000000 2 -0.713544 0.000000 0.000000 3 -1.860761 0.000000 0.560145 4 -1.265934 0.000000 -1.063512 5 0.332883 -2.359419 -0.199543 6 -1.541996 -0.970736 -1.307030
你也可以使用 dict(字典)对象作为填充策略,指定某一列的NaN数据用什么数字填充(下列代码:对1列的NaN用0.5填充,对3列(不存在)用-1填充)
In [251]: df.fillna({1: 0.5, 3: -1}) Out[251]: 0 1 2 0 -0.577087 0.500000 NaN 1 0.523772 0.500000 NaN 2 -0.713544 0.500000 NaN 3 -1.860761 0.500000 0.560145 4 -1.265934 0.500000 -1.063512 5 0.332883 -2.359419 -0.199543 6 -1.541996 -0.970736 -1.307030
还可以使用 带有返回值的方法来作为填充参数,我们还可以指定 axis来指定行或列
In [259]: data = Series([1., NA, 3.5, NA, 7]) In [260]: data.fillna(data.mean()) Out[260]: 0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000
层次索引是pandas的重要部分,它提供了以 低纬度处理高纬度数据的视角。
(1)简单使用
In [261]: data = Series(np.random.randn(10), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
In [262]: data Out[262]: a 1 0.670216 2 0.852965 3 -0.955869 b 1 -0.023493 2 -2.304234 3 -0.652469 c 1 -1.218302 2 -1.332610 d 2 1.074623 3 0.723642
(2)使用层次索引可以更精确的选择数据子集
In [265]: data['b':'c'] Out[265]: b 1 -0.023493 2 -2.304234 3 -0.652469 c 1 -1.218302 2 -1.332610
还可以选择更深层次的子集
In [267]: data[:, 2] Out[267]: a 0.852965 b -2.304234 c -1.332610 d 1.074623
(3) 使用stack方法和unstack 分别将层次索引的数据变成DataFrame类型数据和还原成层次索引数据
In [268]: data.unstack() Out[268]: 1 2 3 a 0.670216 0.852965 -0.955869 b -0.023493 -2.304234 -0.652469 c -1.218302 -1.332610 NaN d NaN 1.074623 0.723642
(4)任何维度都可以作为层次索引
In [270]: frame = DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) In [271]: frame Out[271]: Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11