Lipgrant_python

Pandas必要的基本功能

概述

Head和Tail

属性和原始的ndarray(Attributes and the raw ndarray(s))

操作性能加速(Accelerated operations)

灵活的二元操作(Flexible binary operations)

匹配/广播行为

缺失数据及其充填操作(Missing data / operations with fill values)

灵活的比较操作(Flexible Comparisons)

布尔值推断(Boolean Reductions)

比较对象是否相等(Comparing if objects are equivalent)

比较类数组对象(Comparing array-like objects)

组合重叠数据集(Combining overlapping data sets)

一般的DataFrame数据合并(General DataFrame Combine)

描述性统计

数据概要(Summarizing data: describe)

最小/最大值的索引(Index of Min/Max Values)

数据计数(直方图)/模式(Value counts (histogramming) / Mode)

离散化和分位数(Discretization and quantiling)

函数应用(Function application)

表级范围应用函数(Tablewise Function Application)

行级或列级范围函数应用(Row or Column-wise Function Application)

元素级范围应用函数(Applying Elementwise Functions)

Agg聚合API

多函数聚合(Aggregating with multiple functions)

字典聚合(Aggregating with a dict)

多类型数据聚合(Mixed Dtypes)

定制统计概要(Custom describe)

Transform API

多函数使用transform(Transform with multiple functions)

字典使用transform(Transforming with a dict)

重新索引和修改标签

与另外的对象进行重索引和对齐(Reindexing to align with another object)

使用align方法使对象彼此对齐(Aligning objects with each other with align)

重新索引时进行填充(Filling while reindexing)

重索引时的填充限制(Limits on filling while reindexing)

删除标签(Dropping labels from an axis)

重命名/映射标签(Renaming / mapping labels)

迭代(Iteration)

iteritems方法

iterrows方法

itertuples方法

.dt 访问器(.dt accessor)

矢量化字符方法(Vectorized string methods)

排序(Sorting)

索引排序(By Index)

按数据值排序(By Values)

索引和数据值结合排序(By Indexes and Values)

searchsorted方法

最大/最小值(smallest / largest values)

多层级标签排序(Sorting by a multi-index column)

复制(Copying)

Dtype

默认类型(defaults)

向上映射(upcasting)

astype方法

对象转换(object conversion)

性能和可延伸性(gotchas)

基于dtype选择列(Selecting columns based on dtype)

概述

本文将介绍Pandas中的一些必要的基础功能,同样只介绍与Series和DataFrame两种数据结构相关的.

先构造出用于演示的对象:

In [1]: index = pd.date_range('1/1/2000', periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index,
   ...:                   columns=['A', 'B', 'C'])
   ...:

Head和Tail

这两个方法可以快速的查看一组数据的小抽样,默认的设置是5行.当然也可以自行设定行数.

head表示表头部分,tail表示表尾部分.

In [5]: long_series = pd.Series(np.random.randn(1000))

In [6]: long_series.head()
Out[6]: 
0    0.229453
1    0.304418
2    0.736135
3   -0.859631
4   -0.424100
dtype: float64

In [7]: long_series.tail(3)
Out[7]: 
997   -0.351587
998    1.136249
999   -0.448789
dtype: float64

属性和原始的ndarray(Attributes and the raw ndarray(s))

Pandas对象都有一些属性值,用来查看对象的原始数据结构.

shape:返回对象的数据轴的维度

Axis Lable: 轴标签对于Series就是index,对于DataFrame就包含index和columns.

我们知道Pandas对象数据的不可变性,但对于这两个属性来说,是可以在原对象基础上被修改的,而不是返回新的视图.

In [8]: df[:2]
Out[8]: 
                   A         B        C
2000-01-01  0.048869 -1.360687 -0.47901
2000-01-02 -0.859661 -0.231595 -0.52775

In [9]: df.columns = [x.lower() for x in df.columns]

In [10]: df
Out[10]: 
                   a         b         c
2000-01-01  0.048869 -1.360687 -0.479010
2000-01-02 -0.859661 -0.231595 -0.527750
2000-01-03 -1.296337  0.150680  0.123836
2000-01-04  0.571764  1.555563 -0.823761
2000-01-05  0.535420 -1.032853  1.469725
2000-01-06  1.304124  1.449735  0.203109
2000-01-07 -1.032011  0.969818 -0.962723
2000-01-08  1.382083 -0.938794  0.669142

如果想要获取Pandas对象中的真正的数据,访问values属性即可:

In [11]: s.values
Out[11]: array([-1.9339,  0.3773,  0.7341,  2.1416, -0.0112])

In [12]: df.values
Out[12]: 
array([[ 0.0489, -1.3607, -0.479 ],
       [-0.8597, -0.2316, -0.5278],
       [-1.2963,  0.1507,  0.1238],
       [ 0.5718,  1.5556, -0.8238],
       [ 0.5354, -1.0329,  1.4697],
       [ 1.3041,  1.4497,  0.2031],
       [-1.032 ,  0.9698, -0.9627],
       [ 1.3821, -0.9388,  0.6691]])

从访问values属性得到的数据结果可以看是ndarray类型的.我们知道ndarray是有dtype属性的(dtype种类很多,可以参阅Numpy).

若DataFrame对象中包含多种数据类型,如果其中含有字符类型的话

那么values属性返回的ndarray的dtype属性就被自适应为object类型

如果仅包含整型和浮点型数字的话,那么ndarray的dtype属性会被调整为float64.

In [11]: df= pd.DataFrame({'a' : [1, 2, 1], 'b' : [1, 'B', 3] })

In [12]: df.values.dtype

Out[12]: object


In [13]: df= pd.DataFrame({'a' : [1, 1.8, 1], 'b' : [1, 2.5, 3] })

In [14]: df.values.dtype

Out[14]: float64

操作性能加速(Accelerated operations)

Pandas使用了三方的numexpr库和bottleneck库来对某些数据类型的二进制和bool值操作进行加速.

这些库在处理超大数据集时特别有用,大大的提高了操作速度.

采用100列 X 100000行的数据集测试性能结果如下:

Operation	0.11.0 (ms)	Prior Version (ms)	Ratio to Prior
`df1 > df2`	13.32	125.35	0.1063
`df1 * df2`	21.71	36.63	0.5928
`df1 + df2`	22.04	36.50	0.6039

因为性能的大幅提高,强烈建议安装这些三方库,而且这些第三方库的使用都是默认的,当然也可以通过设置关闭:

pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

灵活的二元操作(Flexible binary operations)

对pandas对象的结构数据进行二元操作,有两点是需要特别注意的:

1.从高维度数据向低维数据的广播(broadcasting )

2.缺失值(Missing Value)的计算中的处理

匹配/广播行为

DataFrame对象的二元操作体现在add(),sub(),mul(),div()等方法,以及与这些方法相关的radd(),rsub()........等等.

对于广播行为来说,Series的输入是最重要的.

在使用二元操作的相关方法中,可以使用axis参数来匹配广播是传播方向,按index传播或者按columns传播.

In [14]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   ....:                    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   ....:                    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   ....: 

In [15]: df
Out[15]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [16]: row = df.iloc[1]

In [17]: column = df['two']

In [18]: df.sub(row, axis='columns')
Out[18]: 
        one       two     three
a -0.924269 -1.362632       NaN
b  0.000000  0.000000  0.000000
c  0.639504 -2.973170  2.565487
d       NaN -2.943392 -0.588625

In [19]: df.sub(row, axis=1)
Out[19]: 
        one       two     three
a -0.924269 -1.362632       NaN
b  0.000000  0.000000  0.000000
c  0.639504 -2.973170  2.565487
d       NaN -2.943392 -0.588625

In [20]: df.sub(column, axis='index')
Out[20]: 
        one  two     three
a -2.226031  0.0       NaN
b -2.664393  0.0 -3.121397
c  0.948280  0.0  2.417260
d       NaN  0.0 -0.766631

In [21]: df.sub(column, axis=0)
Out[21]: 
        one  two     three
a -2.226031  0.0       NaN
b -2.664393  0.0 -3.121397
c  0.948280  0.0  2.417260
d       NaN  0.0 -0.766631

对于多级索引的数据来说,可以增加一个level参数以控制层级

In [22]: dfmi = df.copy()

In [23]: dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a')],
   ....:                                        names=['first','second'])
   ....: 

In [24]: dfmi.sub(column, axis=0, level='second')
Out[24]: 
                   one      two     three
first second                             
1     a      -2.226031  0.00000       NaN
      b      -2.664393  0.00000 -3.121397
      c       0.948280  0.00000  2.417260
2     a            NaN -1.58076 -2.347391

Pandas中的Series对象和Index对象可支持python内置的divmod()方法.该内置方法是以元组的方式返回商和余数.

但在处理Series对象和Index对象时,是以相同索引的Series方式返回结果:

In [28]: s = pd.Series(np.arange(10))

In [29]: s
Out[29]: 
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [30]: div, rem = divmod(s, 3)

In [31]: div
Out[31]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int64

In [32]: rem
Out[32]: 
0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int64

In [33]: idx = pd.Index(np.arange(10))

In [34]: idx
Out[34]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [35]: div, rem = divmod(idx, 3)

In [36]: div
Out[36]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [37]: rem
Out[37]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

内置divmod()方法同时也可以元素级别范围的应用:

In [38]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [39]: div
Out[39]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

In [40]: rem
Out[40]: 
0    0
1    1
2    2
3    0
4    0
5    1
6    1
7    2
8    2
9    3
dtype: int64

缺失数据及其充填操作(Missing data / operations with fill values)

在对Series对象或DataFrame对象进行算术函数方法时,fill_value参数可以指定缺失值的替换值.

需要强调两点,一是只有在相同shape属性的对象进行算术函数且对应位置的元素不同时为缺失值时,fill_value参数才起作用.

因为在Pandas中,两个缺失值之间的计算结果永远是缺失值.

二是存在广播行为的计算中,fill_value也是无起作用的.

In [41]: df
Out[41]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [42]: df2
Out[42]: 
        one       two     three
a -1.101558  1.124472  1.000000
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [43]: df + df2
Out[43]: 
        one       two     three
a -2.203116  2.248945       NaN
b -0.354579  4.974208 -1.268586
c  0.924429 -0.972131  3.862388
d       NaN -0.912575 -2.445837

In [44]: df.add(df2, fill_value=0)
Out[44]: 
        one       two     three
a -2.203116  2.248945  1.000000
b -0.354579  4.974208 -1.268586
c  0.924429 -0.972131  3.862388
d       NaN -0.912575 -2.445837

灵活的比较操作(Flexible Comparisons)

Pandas中提供了eq,ne,lt,gt,le,ge 等比较方法,操作方式与上面介绍的算术函数的基本一致,这些比较方法返回的都是与原比较数据同shape的bool类型的数据.

在比较中,NaN与任何值(包括标量和NaN)进行大小比较永远都是返回False

只有进行不等于比较永远返回True

np.nan 与 np.nan 进行相等比较时永远是返回False

In [45]: df.gt(df2)
Out[45]: 
     one    two  three
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False

In [46]: df2.ne(df)
Out[46]: 
     one    two  three
a  False  False   True
b  False  False  False
c  False  False  False
d   True  False  False

布尔值推断(Boolean Reductions)

在Pandas中提供了empty,any(),all(),bool()这几个方法进行布尔值推断.

在DataFrame中,默认是按columns方法进行推断的,可以通过axis参数控制推断方向.

补充下:布尔值推断并不是值判断标量的True或False,而是指仅对bool值进行判断.

注意下面例子中也是对仅含bool值的数据对象进行判断的.

In [47]: (df > 0).all()
Out[47]: 
one      False
two      False
three    False
dtype: bool

In [48]: (df > 0).any()
Out[48]: 
one      True
two      True
three    True
dtype: bool

也可以进行链式判断:

In [49]: (df > 0).any().any()
Out[49]: True

通过empty属性可以判断对象是否为空:

In [50]: df.empty
Out[50]: False

In [51]: pd.DataFrame(columns=list('ABC')).empty
Out[51]: True

推断仅含单个bool值元素的Pandas对象:

In [52]: pd.Series([True]).bool()
Out[52]: True

In [53]: pd.Series([False]).bool()
Out[53]: False

In [54]: pd.DataFrame([[True]]).bool()
Out[54]: True

In [55]: pd.DataFrame([[False]]).bool()
Out[55]: False

警告:

如果想python语法一样,尝试如下判断,都将报错.

if df:
     ...

df and df2

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

比较对象是否相等(Comparing if objects are equivalent)

一般情况下,我们通过不同的计算方式可以得到同样的结果,比如df+df 和df *2.结合之前的bool值推断方法,

为了判断这两种计算方式的结果是否相同(结果是否出乎意料?):

In [56]: df+df == df*2
Out[56]: 
     one   two  three
a   True  True  False
b   True  True   True
c   True  True   True
d  False  True   True

In [57]: (df+df == df*2).all()
Out[57]: 
one      False
two       True
three    False
dtype: bool

这是因为对象中包含的了缺失值.因为在Pandas中缺失值进行等于比较永远都是False.

In [58]: np.nan == np.nan
Out[58]: False

所以,Pandas中提供了equals()方法来判断对象是否相同或相等.只要对象中的缺失值都是在同样的位置,则判断为相同或相等.

In [59]: (df+df).equals(df*2)
Out[59]: True

需要注意的,进行相同或相等推断是,对象的index属性也必须一致.

In [60]: df1 = pd.DataFrame({'col':['foo', 0, np.nan]})

In [61]: df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])

In [62]: df1.equals(df2)
Out[62]: False

In [63]: df1.equals(df2.sort_index())
Out[63]: True

比较类数组对象(Comparing array-like objects)

注意,此处讨论的是array-like对象,在Pandas中,也就只有Series对象和Index对象才是array-like对象.

在元素级别范围上,比较array-like对象和标量是很简单的:

In [64]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[64]: 
0     True
1    False
2    False
dtype: bool

In [65]: pd.Index(['foo', 'bar', 'baz']) == 'foo'
Out[65]: array([ True, False, False], dtype=bool)

同样,在元素级别的范围上,也支持两个长度相等的array-like对象的比较:

In [66]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
Out[66]: 
0     True
1     True
2    False
dtype: bool

In [67]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])
Out[67]: 
0     True
1     True
2    False
dtype: bool

如果比较长度不同的两个array-like对象,将发生错误:

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare

注意,与Numpy库的方式不同,在Pandas中,ndarray对象比较也是可以进行广播的:

In [68]: np.array([1, 2, 3]) == np.array([2])
Out[68]: array([False,  True, False], dtype=bool)

但如果两个比较对象之间不能进行广播,则返回False:

In [69]: np.array([1, 2, 3]) == np.array([1, 2])
Out[69]: False

组合重叠数据集(Combining overlapping data sets)

当两个相似数据集进行重叠合并时,我们是希望将第一个数据中的缺失值替换为第一个数据中的已知的值.

这时可以使用combine_first方法:

 In [70]: df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
   ....:                     'B' : [np.nan, 2., 3., np.nan, 6.]})
   ....: 

In [71]: df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
   ....:                     'B' : [np.nan, np.nan, 3., 4., 6., 8.]})
   ....: 

In [72]: df1
Out[72]: 
     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0

In [73]: df2
Out[73]: 
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0

In [74]: df1.combine_first(df2)
Out[74]: 
     A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

一般的DataFrame数据合并(General DataFrame Combine)

实质上,combine_first方法是调用的DataFrame的combine方法.

该方法接收一个与要合并的Dataframe对象和一个合并函数

首先将输入的DataFrame对象进行数据对齐后,调用合并函数进行数据合并:

combiner = lambda x, y: np.where(pd.isna(x), y, x)

df1.combine(df,combiner)
Out[139]: 
    one  two  three
a   2.0    9   10.0
b   6.0    5    1.0
c   5.0    8  100.0
d  10.0    6    1.0

描述性统计

Pandas中提供许多的描述性统计方法,有些返回聚合数据后的结构,如sum,mean等

有些返回一个和原数据相同shape的数据集,如:cumsum,cumprod等

总得来说,这些方法都支持通过指定asix参数从而控制计算的轴方向.

axis参数可以是轴的名称,不如'index','columns',或者整数,如0,1.

n [77]: df
Out[77]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [78]: df.mean(0)
Out[78]: 
one     -0.272211
two      0.667306
three    0.024661
dtype: float64

In [79]: df.mean(1)
Out[79]: 
a    0.011457
b    0.558507
c    0.635781
d   -0.839603
dtype: float64

所以的方法都有一个skipna参数,默认为True,可以指定是否忽略缺失值.

In [80]: df.sum(0, skipna=False)
Out[80]: 
one           NaN
two      2.669223
three         NaN
dtype: float64

In [81]: df.sum(axis=1, skipna=True)
Out[81]: 
a    0.022914
b    1.675522
c    1.907343
d   -1.679206
dtype: float64

结合广播/算术行为，可以很简洁地描述各种统计过程，如标准化(使数据的均值为零，标准差为1):

In [82]: ts_stand = (df - df.mean()) / df.std()

In [83]: ts_stand.std()
Out[83]: 
one      1.0
two      1.0
three    1.0
dtype: float64

In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [85]: xs_stand.std(1)
Out[85]: 
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

注意cumsum()方法和comprod()方法都保留了缺失值NAN,而expanding()方法和rolling()方法则不同

关于这方面更多的信息,可以参阅: Pandas的计算工具.

下面是一个常用函数的快速参考汇总表。每个方法还接受一个可选的level参数，该参数仅适用于对象具有分层索引的情况。

注意，一些NumPy方法，如mean、std和sum，在默认情况下会排除缺失值:

In [87]: np.mean(df['one'])
Out[87]: -0.27221094480450114

In [88]: np.mean(df['one'].values)
Out[88]: nan

Series.nunique() 方法返回Series中非缺失值的唯一性数据的数量.

In [89]: series = pd.Series(np.random.randn(500))

In [90]: series[20:500] = np.nan

In [91]: series[10:20]  = 5

In [92]: series.nunique()
Out[92]: 11

数据概要(Summarizing data: describe)

使用describe()方法可以快速统计Series对象或者DataFrame对象的摘要数据:

In [93]: series = pd.Series(np.random.randn(1000))

In [94]: series[::2] = np.nan

In [95]: series.describe()
Out[95]: 
count    500.000000
mean      -0.032127
std        1.067484
min       -3.463789
25%       -0.725523
50%       -0.053230
75%        0.679790
max        3.120271
dtype: float64

In [96]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [97]: frame.iloc[::2] = np.nan

In [98]: frame.describe()
Out[98]: 
                a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean    -0.045109   -0.052045    0.024520    0.006117    0.001141
std      1.029268    1.002320    1.042793    1.040134    1.005207
min     -2.915767   -3.294023   -3.610499   -2.907036   -3.010899
25%     -0.763783   -0.720389   -0.609600   -0.665896   -0.682900
50%     -0.086033   -0.048843    0.006093    0.043191   -0.001651
75%      0.663399    0.620980    0.728382    0.735973    0.656439
max      3.400646    2.925597    3.416896    3.331522    3.007143

同时可以通过percentiles参数指定自定义的分位数(中位数总是默认显示的):

In [99]: series.describe(percentiles=[.05, .25, .75, .95])
Out[99]: 
count    500.000000
mean      -0.032127
std        1.067484
min       -3.463789
5%        -1.733545
25%       -0.725523
50%       -0.053230
75%        0.679790
95%        1.854383
max        3.120271
dtype: float64

对应非数字类型的Series对象,describe()方法将给出唯一值和最常见值的数量:

In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [101]: s.describe()
Out[101]: 
count     9
unique    4
top       a
freq      5
dtype: object

注意,在多类型混合的DataFrame对象中,describle()方法将只对数字类型的columns进行摘要统计

如不存在数字类型的columns,则类似Series对象一样,返回出唯一值和最常见值的数量

In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})

In [103]: frame.describe()
Out[103]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

可以通过include和exclude两个参数指定列表的方式控制describle方法的这种行为

特殊值all也可以作为include和exclude两个参数的值:

In [104]: frame.describe(include=['object'])
Out[104]: 
          a
count     4
unique    2
top     Yes
freq      2

In [105]: frame.describe(include=['number'])
Out[105]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

In [106]: frame.describe(include='all')
Out[106]: 
          a         b
count     4  4.000000
unique    2       NaN
top     Yes       NaN
freq      2       NaN
mean    NaN  1.500000
std     NaN  1.290994
min     NaN  0.000000
25%     NaN  0.750000
50%     NaN  1.500000
75%     NaN  2.250000
max     NaN  3.000000

最小/最大值的索引(Index of Min/Max Values)

使用idxmax()/idxmin()方法查找Series对象或DataFrame对象中的最大/最小值的索引值(Index):

In [107]: s1 = pd.Series(np.random.randn(5))

In [108]: s1
Out[108]: 
0   -1.649461
1    0.169660
2    1.246181
3    0.131682
4   -2.001988
dtype: float64

In [109]: s1.idxmin(), s1.idxmax()
Out[109]: (4, 2)

In [110]: df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])

In [111]: df1
Out[111]: 
          A         B         C
0 -1.273023  0.870502  0.214583
1  0.088452 -0.173364  1.207466
2  0.546121  0.409515 -0.310515
3  0.585014 -0.490528 -0.054639
4 -0.239226  0.701089  0.228656

In [112]: df1.idxmin(axis=0)
Out[112]: 
A    0
B    3
C    2
dtype: int64

In [113]: df1.idxmax(axis=1)
Out[113]: 
0    B
1    C
2    A
3    A
4    B
dtype: object

当多行或者多列同时存在查找的最值时,仅返回第一个最值的索引值:

In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [115]: df3
Out[115]: 
     A
e  2.0
d  1.0
c  1.0
b  3.0
a  NaN

In [116]: df3['A'].idxmin()
Out[116]: 'd'

其实idxmax()/idxmin()方法是调用的Numpy中的argmax和agrmin方法.

数据计数(直方图)/模式(Value counts (histogramming) / Mode)

value_counts方法可以进行元素计数统计,但仅适用于Series对象或者维度为1的序列,DataFrame不适用:

In [117]: data = np.random.randint(0, 7, size=50)

In [118]: data
Out[118]: 
array([3, 3, 0, 2, 1, 0, 5, 5, 3, 6, 1, 5, 6, 2, 0, 0, 6, 3, 3, 5, 0, 4, 3,
       3, 3, 0, 6, 1, 3, 5, 5, 0, 4, 0, 6, 3, 6, 5, 4, 3, 2, 1, 5, 0, 1, 1,
       6, 4, 1, 4])

In [119]: s = pd.Series(data)

In [120]: s.value_counts()
Out[120]: 
3    11
0     9
5     8
6     7
1     7
4     5
2     3
dtype: int64

In [121]: pd.value_counts(data)
Out[121]: 
3    11
0     9
5     8
6     7
1     7
4     5
2     3
dtype: int64

而mode方法可以统计最高频的元素,该方法可以适用Seies和DataFrame:

s = pd.Series([1, 1, 3, 3, 3, 3, 5, 7, 7, 7])

s.mode()
Out[189]: 
0    3
dtype: int64

df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
   .....:                     "B": np.random.randint(-10, 15, size=50)})
                    

df5.mode()
Out[193]: 
   A  B
0  2 -7

离散化和分位数(Discretization and quantiling)

连续数据可以使用cut方法(基于值)和qcut方法(基于分位数)将其离散化

In [126]: arr = np.random.randn(20)

In [127]: factor = pd.cut(arr, 4)

In [128]: factor
Out[128]: 
[(-2.611, -1.58], (0.473, 1.499], (-2.611, -1.58], (-1.58, -0.554], (-0.554, 0.473], ..., (0.473, 1.499], (0.473, 1.499], (-0.554, 0.473], (-0.554, 0.473], (-0.554, 0.473]]
Length: 20
Categories (4, interval[float64]): [(-2.611, -1.58] < (-1.58, -0.554] < (-0.554, 0.473] <
                                    (0.473, 1.499]]

In [129]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [130]: factor
Out[130]: 
[(-5, -1], (0, 1], (-5, -1], (-1, 0], (-1, 0], ..., (1, 5], (1, 5], (-1, 0], (-1, 0], (-1, 0]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut方法计算样本分位数,例如我们可以将一些正态分布的数据分割成等大小的四分位数:

In [131]: arr = np.random.randn(30)

In [132]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])

In [133]: factor
Out[133]: 
[(0.544, 1.976], (0.544, 1.976], (-1.255, -0.375], (0.544, 1.976], (-0.103, 0.544], ..., (-0.103, 0.544], (0.544, 1.976], (-0.103, 0.544], (-1.255, -0.375], (-0.375, -0.103]]
Length: 30
Categories (4, interval[float64]): [(-1.255, -0.375] < (-0.375, -0.103] < (-0.103, 0.544] <
                                    (0.544, 1.976]]

In [134]: pd.value_counts(factor)
Out[134]: 
(0.544, 1.976]      8
(-1.255, -0.375]    8
(-0.103, 0.544]     7
(-0.375, -0.103]    7
dtype: int64

还可以传递无限个值来定义容器:

In [135]: arr = np.random.randn(20)

In [136]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [137]: factor
Out[137]: 
[(0.0, inf], (0.0, inf], (0.0, inf], (0.0, inf], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (0.0, inf], (-inf, 0.0], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

函数应用(Function application)

将自己编写的函数或者其他三方库的函数应用到Pandas数据对象上时,函数是否被合理的应用,

取决于函数要应用的范围,要搞清楚范围,那就必须清楚下面三个函数应用方法的作用范围.

1.表级范围应用函数:pipe()

2.行级或列级范围应用函数:apply()

3.元素级范围应用函数:applymap()

4.聚合API方法:agg()和transform()

表级范围应用函数(Tablewise Function Application)

Pandas的数据对象Series和DataFrame当然也可以作为参数,传递给其他函数使用.

但如果某个函数需要链式调用的话,伪代码如下:

# f, g, and h are functions taking and returning ``DataFrames``
 f(g(h(df), arg1=1), arg2=2, arg3=3)

这种情况可以考虑使用pipe()方法了,以下代码与之前的伪代码功能一致:

(df.pipe(h)
       .pipe(g, arg1=1)
       .pipe(f, arg2=2, arg3=3)
    )

在Pandas中,更鼓励使用第二种方式.使用pipe方法来进行链式调用会更加的简单和清晰.

在上面的例子中，函数f、g和h都期望DataFrame作为第一个位置参数。

如果希望应用的函数将其数据作为第二个参数，该怎么办?

在这种情况下，pipe方法提供一个元组(callble，data_keyword),可以将指定的DataFrame数据的路径传入.

例如，我们可以使用statsmodels拟合回归。它们的API首先需要一个公式，然后第二个参数是DataFrame，即data.

我们将元组(sm.ols, 'data')作为参数传入给pipe方法:

In [138]: import statsmodels.formula.api as sm

In [139]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [140]: (bb.query('h > 0')
   .....:    .assign(ln_h = lambda df: np.log(df.h))
   .....:    .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
   .....:    .fit()
   .....:    .summary()
   .....: )
   .....: 
Out[140]: 

"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Sun, 05 Aug 2018   Prob (F-statistic):           3.48e-15
Time:                        11:57:36   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

pipe方法受到了niux系统pipe方法和新的数据操作的语法(dplyr和magrittr)的影响,如果有兴趣可以阅读pipe的源码.

行级或列级范围函数应用(Row or Column-wise Function Application)

可以使用apply()方法沿着DataFrame的轴应用任意函数，它与描述性统计方法一样，提供了一个可选的axis参数:

In [141]: df.apply(np.mean)
Out[141]: 
one     -0.272211
two      0.667306
three    0.024661
dtype: float64

In [142]: df.apply(np.mean, axis=1)
Out[142]: 
a    0.011457
b    0.558507
c    0.635781
d   -0.839603
dtype: float64

In [143]: df.apply(lambda x: x.max() - x.min())
Out[143]: 
one      1.563773
two      2.973170
three    3.154112
dtype: float64

In [144]: df.apply(np.cumsum)
Out[144]: 
        one       two     three
a -1.101558  1.124472       NaN
b -1.278848  3.611576 -0.634293
c -0.816633  3.125511  1.296901
d       NaN  2.669223  0.073983

In [145]: df.apply(np.exp)
Out[145]: 
        one        two    three
a  0.332353   3.078592      NaN
b  0.837537  12.026397  0.53031
c  1.587586   0.615041  6.89774
d       NaN   0.633631  0.29437

apply方法还可以接收一个描述性统计方法的名称字符为参数:

In [146]: df.apply('mean')
Out[146]: 
one     -0.272211
two      0.667306
three    0.024661
dtype: float64

In [147]: df.apply('mean', axis=1)
Out[147]: 
a    0.011457
b    0.558507
c    0.635781
d   -0.839603
dtype: float64

传递给apply()方法的函数的返回值类型影响DataFrame最终输出的类型。默认为规则如下:

1.如果被应用的函数返回一个Series，那么最后输出就是一个DataFrame。列标签会与返回的Series的index进行匹配

2.如果被应用的函数返回其他任何类型，则最终输出是一个Series.

可以使用result_type参数来覆盖此默认行为，该参数接受三个选项:reduce、broadcast和expand。

这将决定like-list类型的返回值如何扩展(或不扩展)到一个DataFrame。

apply()结合一些巧妙的方法,可以用来查询数据集的数据。例如，想提取每个列的最大值发生的日期:

In [148]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],
   .....:                     index=pd.date_range('1/1/2000', periods=1000))
   .....: 

In [149]: tsdf.apply(lambda x: x.idxmax())
Out[149]: 
A   2001-04-25
B   2002-05-31
C   2002-09-25
dtype: datetime64[ns]

还可以将其他参数和关键字参数传递给apply()方法。例如，想应用如下的功能:

def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

可以这样使用:

df.apply(subtract_and_divide, args=(5,), divide=3)

另一个有用的特性是能够传递Series方法来对每一列或每一行执行一些Series操作:

In [150]: tsdf
Out[150]: 
                   A         B         C
2000-01-01 -0.720299  0.546303 -0.082042
2000-01-02  0.200295 -0.577554 -0.908402
2000-01-03  0.102533  1.653614  0.303319
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.532566  0.341548  0.150493
2000-01-09  0.330418  1.761200  0.567133
2000-01-10 -0.251020  1.020099  1.893177

In [151]: tsdf.apply(pd.Series.interpolate)
Out[151]: 
                   A         B         C
2000-01-01 -0.720299  0.546303 -0.082042
2000-01-02  0.200295 -0.577554 -0.908402
2000-01-03  0.102533  1.653614  0.303319
2000-01-04  0.188539  1.391201  0.272754
2000-01-05  0.274546  1.128788  0.242189
2000-01-06  0.360553  0.866374  0.211624
2000-01-07  0.446559  0.603961  0.181059
2000-01-08  0.532566  0.341548  0.150493
2000-01-09  0.330418  1.761200  0.567133
2000-01-10 -0.251020  1.020099  1.893177

最后补充一点，apply()接受一个默认为False的raw参数，该参数在应用函数之前将每一行或每一列转换成一个Series对象。当设置为True时，传递的函数将接收一个ndarray对象，如果不需要Series的索引功能，那么raw设置为True将改善执行性能。

元素级范围应用函数(Applying Elementwise Functions)

不是所有的函数都可以向量化(接受NumPy数组并返回另一个数组或值)，因此DataFrame上的applymap()方法和Series上的map()方法类似,都可以接受参数为单个值且返回单个值的任何Python函数。例如:

In [188]: df4
Out[188]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [189]: f = lambda x: len(str(x))

In [190]: df4['one'].map(f)
Out[190]: 
a    19
b    20
c    18
d     3
Name: one, dtype: int64

In [191]: df4.applymap(f)
Out[191]: 
   one  two  three
a   19   18      3
b   20   18     19
c   18   20     18
d    3   19     19

Series.map()方法有一个额外的特性

可以用来“链接”或“映射”由定义的值。这与Pandas的合并/连接功能密切相关:

In [192]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
   .....:               index=['a', 'b', 'c', 'd', 'e'])
   .....: 

In [193]: t = pd.Series({'six' : 6., 'seven' : 7.})

In [194]: s
Out[194]: 
a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [195]: s.map(t)
Out[195]: 
a    6.0
b    7.0
c    6.0
d    7.0
e    6.0
dtype: float64

Agg聚合API

聚合API允许以一种简洁的方式进行可能的多个聚合操作。

聚合的入口点是DataFrame.aggregate()，或者别名DataFrame.agg()。

以之前使用过的类似的数据为例子:

In [152]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
   .....:                     index=pd.date_range('1/1/2000', periods=10))
   .....: 

In [153]: tsdf.iloc[3:7] = np.nan

In [154]: tsdf
Out[154]: 
                   A         B         C
2000-01-01  0.170247 -0.916844  0.835024
2000-01-02  1.259919  0.801111  0.445614
2000-01-03  1.453046  2.430373  0.653093
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08 -1.874526  0.569822 -0.609644
2000-01-09  0.812462  0.565894 -1.461363
2000-01-10 -0.985475  1.388154 -0.078747

和apply()方法类似,可以将函数或者函数名称字符作为参数传入,会返回一个聚合数据后的Series.

In [155]: tsdf.agg(np.sum)
Out[155]: 
A    0.835673
B    4.838510
C   -0.216025
dtype: float64

In [156]: tsdf.agg('sum')
Out[156]: 
A    0.835673
B    4.838510
C   -0.216025
dtype: float64

# 以上两个方法等同于:
In [157]: tsdf.sum()
Out[157]: 
A    0.835673
B    4.838510
C   -0.216025
dtype: float64

单个的Series聚合将返回一个标量值:

In [158]: tsdf.A.agg('sum')
Out[158]: 0.83567297915820504

多函数聚合(Aggregating with multiple functions)

可以将多个函数的名称字符以列表形式传递,传递的每个函数的结果都是返回的DataFrame中的一行,行名称是传入的函数名。

In [159]: tsdf.agg(['sum'])
Out[159]: 
            A        B         C
sum  0.835673  4.83851 -0.216025

In [160]: tsdf.agg(['sum', 'mean'])
Out[160]: 
             A         B         C
sum   0.835673  4.838510 -0.216025
mean  0.139279  0.806418 -0.036004

对应Series对象来说,返回值也是一个Series,由传入的函数名作为索引值.

In [161]: tsdf.A.agg(['sum', 'mean'])
Out[161]: 
sum     0.835673
mean    0.139279
Name: A, dtype: float64

如果传入的是匿名函数,那么就将以为名称

In [162]: tsdf.A.agg(['sum', lambda x: x.mean()])
Out[162]: 
sum         0.835673
    0.139279
Name: A, dtype: float64

如果传入的是自定义函数,那么函数名将作为名称:

In [163]: def mymean(x):
   .....:    return x.mean()
   .....: 

In [164]: tsdf.A.agg(['sum', mymean])
Out[164]: 
sum       0.835673
mymean    0.139279
Name: A, dtype: float64

字典聚合(Aggregating with a dict)

传入以列标签为主键,应用函数为值的字典,可以自定义哪些函数应用于哪些列。

请注意，结果有特定的顺序，可以使用OrderedDict来保证顺序.

In [165]: tsdf.agg({'A': 'mean', 'B': 'sum'})
Out[165]: 
A    0.139279
B    4.838510
dtype: float64

如果传入的应用函数是列表形式,那么将得到以函数名为行标签的DataFrame对象.

没有执行对应函数的列将被标记为缺失值.

In [166]: tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})
Out[166]: 
             A        B
mean  0.139279      NaN
min  -1.874526      NaN
sum        NaN  4.83851

多类型数据聚合(Mixed Dtypes)

当聚合多类型数据时,Agg方法只接受能够有效聚合的类型的数据,无法聚合的将被忽略.

In [167]: mdf = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [1., 2., 3.],
   .....:                     'C': ['foo', 'bar', 'baz'],
   .....:                     'D': pd.date_range('20130101', periods=3)})
   .....: 

In [168]: mdf.dtypes
Out[168]: 
A             int64
B           float64
C            object
D    datetime64[ns]
dtype: object

In [169]: mdf.agg(['min', 'sum'])
Out[169]: 
     A    B          C          D
min  1  1.0        bar 2013-01-01

定制统计概要(Custom describe)

使用agg方法自定义要应用的函数,可以起到类似于descirble方法的作用

In [170]: from functools import partial

In [171]: q_25 = partial(pd.Series.quantile, q=0.25)

In [172]: q_25.__name__ = '25%'

In [173]: q_75 = partial(pd.Series.quantile, q=0.75)

In [174]: q_75.__name__ = '75%'

In [175]: tsdf.agg(['count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max'])
Out[175]: 
               A         B         C
count   6.000000  6.000000  6.000000
mean    0.139279  0.806418 -0.036004
std     1.323362  1.100830  0.874990
min    -1.874526 -0.916844 -1.461363
25%    -0.696544  0.566876 -0.476920
median  0.491354  0.685467  0.183433
75%     1.148055  1.241393  0.601223
max     1.453046  2.430373  0.835024

Transform API

transform()方法返回与原始对象索引和size大小都相同的对象.

这个API允许您同时进行多个函数应用操作，而不是逐个应用操作。非常类似于Agg的API。

同样,创建一个用于举例的数据:

In [176]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
   .....:                     index=pd.date_range('1/1/2000', periods=10))
   .....: 

In [177]: tsdf.iloc[3:7] = np.nan

In [178]: tsdf
Out[178]: 
                   A         B         C
2000-01-01 -0.578465 -0.503335 -0.987140
2000-01-02 -0.767147 -0.266046  1.083797
2000-01-03  0.195348  0.722247 -0.894537
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08 -0.556397  0.542165 -0.308675
2000-01-09 -1.010924 -0.672504 -1.139222
2000-01-10  0.354653  0.563622 -0.365106

同样,也支持传入函数名称字符,函数名,自定义函数已经匿名函数:

In [179]: tsdf.transform(np.abs)
Out[179]: 
                   A         B         C
2000-01-01  0.578465  0.503335  0.987140
2000-01-02  0.767147  0.266046  1.083797
2000-01-03  0.195348  0.722247  0.894537
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.556397  0.542165  0.308675
2000-01-09  1.010924  0.672504  1.139222
2000-01-10  0.354653  0.563622  0.365106

In [180]: tsdf.transform('abs')
Out[180]: 
                   A         B         C
2000-01-01  0.578465  0.503335  0.987140
2000-01-02  0.767147  0.266046  1.083797
2000-01-03  0.195348  0.722247  0.894537
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.556397  0.542165  0.308675
2000-01-09  1.010924  0.672504  1.139222
2000-01-10  0.354653  0.563622  0.365106

In [181]: tsdf.transform(lambda x: x.abs())
Out[181]: 
                   A         B         C
2000-01-01  0.578465  0.503335  0.987140
2000-01-02  0.767147  0.266046  1.083797
2000-01-03  0.195348  0.722247  0.894537
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.556397  0.542165  0.308675
2000-01-09  1.010924  0.672504  1.139222
2000-01-10  0.354653  0.563622  0.365106

上面例子中,transform接收的是单个函数,其作用类似于使用Numpy的一元函数:

In [182]: np.abs(tsdf)
Out[182]: 
                   A         B         C
2000-01-01  0.578465  0.503335  0.987140
2000-01-02  0.767147  0.266046  1.083797
2000-01-03  0.195348  0.722247  0.894537
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.556397  0.542165  0.308675
2000-01-09  1.010924  0.672504  1.139222
2000-01-10  0.354653  0.563622  0.365106

同样的,对Series对象运用 transform方法也将得到一个Series对象:

In [183]: tsdf.A.transform(np.abs)
Out[183]: 
2000-01-01    0.578465
2000-01-02    0.767147
2000-01-03    0.195348
2000-01-04         NaN
2000-01-05         NaN
2000-01-06         NaN
2000-01-07         NaN
2000-01-08    0.556397
2000-01-09    1.010924
2000-01-10    0.354653
Freq: D, Name: A, dtype: float64

多函数使用transform(Transform with multiple functions)

为transform方法传递多个函数将生成一个多重列标签的DataFrame。

第一层是原始的DataFrame对象的列名;第二层是转换应用函数的名称。

In [184]: tsdf.transform([np.abs, lambda x: x+1])
Out[184]: 
                   A                   B                   C          
            absolute    absolute    absolute  
2000-01-01  0.578465  0.421535  0.503335  0.496665  0.987140  0.012860
2000-01-02  0.767147  0.232853  0.266046  0.733954  1.083797  2.083797
2000-01-03  0.195348  1.195348  0.722247  1.722247  0.894537  0.105463
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.556397  0.443603  0.542165  1.542165  0.308675  0.691325
2000-01-09  1.010924 -0.010924  0.672504  0.327496  1.139222 -0.139222
2000-01-10  0.354653  1.354653  0.563622  1.563622  0.365106  0.634894

对Series对象进行多个函数的transform方法将得到一个DataFrame,列名将是应用的函数。

In [185]: tsdf.A.transform([np.abs, lambda x: x+1])
Out[185]: 
            absolute  
2000-01-01  0.578465  0.421535
2000-01-02  0.767147  0.232853
2000-01-03  0.195348  1.195348
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.556397  0.443603
2000-01-09  1.010924 -0.010924
2000-01-10  0.354653  1.354653

字典使用transform(Transforming with a dict)

与Agg类似,传入以列标签为主键,应用函数为值的字典,可以自定义哪些函数应用于哪些列。

In [186]: tsdf.transform({'A': np.abs, 'B': lambda x: x+1})
Out[186]: 
                   A         B
2000-01-01  0.578465  0.496665
2000-01-02  0.767147  0.733954
2000-01-03  0.195348  1.722247
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.556397  1.542165
2000-01-09  1.010924  0.327496
2000-01-10  0.354653  1.563622

若字典的键值是多个应用函数构造的列表,则

In [187]: tsdf.transform({'A': np.abs, 'B': [lambda x: x+1, 'sqrt']})
Out[187]: 
                   A         B          
            absolute        sqrt
2000-01-01  0.578465  0.496665       NaN
2000-01-02  0.767147  0.733954       NaN
2000-01-03  0.195348  1.722247  0.849851
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.556397  1.542165  0.736318
2000-01-09  1.010924  0.327496       NaN
2000-01-10  0.354653  1.563622  0.750748

重新索引和修改标签

reindex()是Pandas中数据对齐的基本方法。几乎所有其他依赖于标签对齐功能的实现都依赖于它.

重新索引意味着所有数据要根据给定索引在指定的轴方向上进行重新匹配.

reindex方法可以实现以下三个方面:

1.重新排序现有数据以匹配新标签

2.在匹配不到新标签的数据的位置重新填充缺失值

3.如果有指定逻辑,则逻辑方法重新填充缺失数据

看一个例子:

In [216]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [217]: s
Out[217]: 
a   -0.454087
b   -0.360309
c   -0.951631
d   -0.535459
e    0.835231
dtype: float64

In [218]: s.reindex(['e', 'b', 'f', 'd'])
Out[218]: 
e    0.835231
b   -0.360309
f         NaN
d   -0.535459
dtype: float64

此处,f标签并不在原数据中,所以重新匹配索引后被标记为缺失值了.

对DataFrame数据,可以重新索引行标签和列标签

In [219]: df
Out[219]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [220]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
Out[220]: 
      three       two       one
c  1.931194 -0.486066  0.462215
f       NaN       NaN       NaN
b -0.634293  2.487104 -0.177289

也可以使用axis参数:

In [221]: df.reindex(['c', 'f', 'b'], axis='index')
Out[221]: 
        one       two     three
c  0.462215 -0.486066  1.931194
f       NaN       NaN       NaN
b -0.177289  2.487104 -0.634293

注意，实际上索引对象可以在各个对象之间共享。因此，如果我们有一个Series和一个DataFrame，可以完成以下操作:

In [222]: rs = s.reindex(df.index)

In [223]: rs
Out[223]: 
a   -0.454087
b   -0.360309
c   -0.951631
d   -0.535459
dtype: float64

In [224]: rs.index is df.index
Out[224]: True

这意味着重新索引的Series的索引与DataFrame的索引是相同的对象。

DataFrame.reindex() 方法也有一种约定性的调用,可以直接使用axis参数指定应用的轴方向.

In [225]: df.reindex(['c', 'f', 'b'], axis='index')
Out[225]: 
        one       two     three
c  0.462215 -0.486066  1.931194
f       NaN       NaN       NaN
b -0.177289  2.487104 -0.634293

In [226]: df.reindex(['three', 'two', 'one'], axis='columns')
Out[226]: 
      three       two       one
a       NaN  1.124472 -1.101558
b -0.634293  2.487104 -0.177289
c  1.931194 -0.486066  0.462215
d -1.222918 -0.456288       NaN

与另外的对象进行重索引和对齐(Reindexing to align with another object)

当希望将一个对象按照另外一个对象的标签进行重新索引和对齐时

reindex_like()方法将简化操作:

In [227]: df2
Out[227]: 
        one       two
a -1.101558  1.124472
b -0.177289  2.487104
c  0.462215 -0.486066

In [228]: df3
Out[228]: 
        one       two
a -0.829347  0.082635
b  0.094922  1.445267
c  0.734426 -1.527903

In [229]: df.reindex_like(df2)
Out[229]: 
        one       two
a -1.101558  1.124472
b -0.177289  2.487104
c  0.462215 -0.486066

使用align方法使对象彼此对齐(Aligning objects with each other with `align`)

align()方法是同时对齐两个对象的最快方法。它支持一个join参数,该参数有几个选项:

1.outer:按两个对象索引的并集对齐

2.left:使用调用对象的索引对齐

3.right:使用传入的对象的索引对齐

4.inner:按两个对象索引的交集对齐

对于Series对象,align()方法返回的是包含两个重新对齐的Series对象的元组.

In [230]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [231]: s1 = s[:4]

In [232]: s2 = s[1:]

In [233]: s1.align(s2)
Out[233]: 
(a    0.505453
 b    1.788110
 c   -0.405908
 d   -0.801912
 e         NaN
 dtype: float64, a         NaN
 b    1.788110
 c   -0.405908
 d   -0.801912
 e    0.768460
 dtype: float64)

In [234]: s1.align(s2, join='inner')
Out[234]: 
(b    1.788110
 c   -0.405908
 d   -0.801912
 dtype: float64, b    1.788110
 c   -0.405908
 d   -0.801912
 dtype: float64)

In [235]: s1.align(s2, join='left')
Out[235]: 
(a    0.505453
 b    1.788110
 c   -0.405908
 d   -0.801912
 dtype: float64, a         NaN
 b    1.788110
 c   -0.405908
 d   -0.801912
 dtype: float64)

对于DataFrame对象,默认情况下是index和colmuns两条轴方向都要对齐的.

In [236]: df.align(df2, join='inner')
Out[236]: 
(        one       two
 a -1.101558  1.124472
 b -0.177289  2.487104
 c  0.462215 -0.486066,         one       two
 a -1.101558  1.124472
 b -0.177289  2.487104
 c  0.462215 -0.486066)

当然也可以通过axis参数指定对齐的轴:

In [237]: df.align(df2, join='inner', axis=0)
Out[237]: 
(        one       two     three
 a -1.101558  1.124472       NaN
 b -0.177289  2.487104 -0.634293
 c  0.462215 -0.486066  1.931194,         one       two
 a -1.101558  1.124472
 b -0.177289  2.487104
 c  0.462215 -0.486066)

如果是一个DataFrame对象和一个Series对象进行对齐的话,必须要指定axis参数.

In [238]: df.align(df2.iloc[0], axis=1)
Out[238]: 
(        one     three       two
 a -1.101558       NaN  1.124472
 b -0.177289 -0.634293  2.487104
 c  0.462215  1.931194 -0.486066
 d       NaN -1.222918 -0.456288, one     -1.101558
 three         NaN
 two      1.124472
 Name: a, dtype: float64)

重新索引时进行填充(Filling while reindexing)

reindex()方法还有一个可选的method参数,该参数支持选择进行数据填充的方式,具体选择如下:

方式选择	含义
pad/ffill	向前填充
bfill/backfill	向后填充
nearest	用最邻近的index填充

用一个Series对象来展示下用法:

In [239]: rng = pd.date_range('1/3/2000', periods=8)

In [240]: ts = pd.Series(np.random.randn(8), index=rng)

In [241]: ts2 = ts[[0, 3, 6]]

In [242]: ts
Out[242]: 
2000-01-03    0.466284
2000-01-04   -0.457411
2000-01-05   -0.364060
2000-01-06    0.785367
2000-01-07   -1.463093
2000-01-08    1.187315
2000-01-09   -0.493153
2000-01-10   -1.323445
Freq: D, dtype: float64

In [243]: ts2
Out[243]: 
2000-01-03    0.466284
2000-01-06    0.785367
2000-01-09   -0.493153
dtype: float64

In [244]: ts2.reindex(ts.index)
Out[244]: 
2000-01-03    0.466284
2000-01-04         NaN
2000-01-05         NaN
2000-01-06    0.785367
2000-01-07         NaN
2000-01-08         NaN
2000-01-09   -0.493153
2000-01-10         NaN
Freq: D, dtype: float64

In [245]: ts2.reindex(ts.index, method='ffill')
Out[245]: 
2000-01-03    0.466284
2000-01-04    0.466284
2000-01-05    0.466284
2000-01-06    0.785367
2000-01-07    0.785367
2000-01-08    0.785367
2000-01-09   -0.493153
2000-01-10   -0.493153
Freq: D, dtype: float64

In [246]: ts2.reindex(ts.index, method='bfill')
Out[246]: 
2000-01-03    0.466284
2000-01-04    0.785367
2000-01-05    0.785367
2000-01-06    0.785367
2000-01-07   -0.493153
2000-01-08   -0.493153
2000-01-09   -0.493153
2000-01-10         NaN
Freq: D, dtype: float64

In [247]: ts2.reindex(ts.index, method='nearest')
Out[247]: 
2000-01-03    0.466284
2000-01-04    0.466284
2000-01-05    0.785367
2000-01-06    0.785367
2000-01-07    0.785367
2000-01-08   -0.493153
2000-01-09   -0.493153
2000-01-10   -0.493153
Freq: D, dtype: float64

需要注意的是,这些方法都要求index是单调递增或者单调递减的.

同时,除了nearest选项外,fillna()方法和interpolate()方法同样可以实现reindex的效果:

In [248]: ts2.reindex(ts.index).fillna(method='ffill')
Out[248]: 
2000-01-03    0.466284
2000-01-04    0.466284
2000-01-05    0.466284
2000-01-06    0.785367
2000-01-07    0.785367
2000-01-08    0.785367
2000-01-09   -0.493153
2000-01-10   -0.493153
Freq: D, dtype: float64

但是此处又有一个坑,reindex执行时必须要求index是单调递增或者递减的,如果不满足该要求则报错.

而fillna()方法和interpolate()方法执行时则不会坚持index的顺序.

重索引时的填充限制(Limits on filling while reindexing)

在重索引填充时,limit和tolerance参数提供了额外的控制.

Limit指定连续填充的最大计数

In [249]: ts2.reindex(ts.index, method='ffill', limit=1)
Out[249]: 
2000-01-03    0.466284
2000-01-04    0.466284
2000-01-05         NaN
2000-01-06    0.785367
2000-01-07    0.785367
2000-01-08         NaN
2000-01-09   -0.493153
2000-01-10   -0.493153
Freq: D, dtype: float64

与之不同的是,tolerance参数限制的是指定索引和索引值之间的最大距离

In [250]: ts2.reindex(ts.index, method='ffill', tolerance='1 day')
Out[250]: 
2000-01-03    0.466284
2000-01-04    0.466284
2000-01-05         NaN
2000-01-06    0.785367
2000-01-07    0.785367
2000-01-08         NaN
2000-01-09   -0.493153
2000-01-10   -0.493153
Freq: D, dtype: float64

请注意，当使用DatetimeIndex、TimedeltaIndex或PeriodIndex时，tolerance将转换为Timedelta

这样在指定tolerance参数的值是就可以是合适的字符了

删除标签(Dropping labels from an axis)

与reindex密切相关的方法是drop()方法。它从一个轴上移除一组指定的标签:

In [251]: df
Out[251]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [252]: df.drop(['a', 'd'], axis=0)
Out[252]: 
        one       two     three
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194

In [253]: df.drop(['one'], axis=1)
Out[253]: 
        two     three
a  1.124472       NaN
b  2.487104 -0.634293
c -0.486066  1.931194
d -0.456288 -1.222918

注意下面的方法也可以，但是逻辑上台复杂:

In [254]: df.reindex(df.index.difference(['a', 'd']))
Out[254]: 
        one       two     three
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194

重命名/映射标签(Renaming / mapping labels)

rename()方法允许基于某些数据对标签进行映射(比如dict),也可以使用函数的方式重新标记标签。

In [255]: s
Out[255]: 
a    0.505453
b    1.788110
c   -0.405908
d   -0.801912
e    0.768460
dtype: float64

In [256]: s.rename(str.upper)
Out[256]: 
A    0.505453
B    1.788110
C   -0.405908
D   -0.801912
E    0.768460
dtype: float64

如果使用函数，这个函数只能以每一个标签为参数并且必须要有唯一一个返回值。当然字典或序列也可以使用:

In [257]: df.rename(columns={'one': 'foo', 'two': 'bar'},
   .....:           index={'a': 'apple', 'b': 'banana', 'd': 'durian'})
   .....: 
Out[257]: 
             foo       bar     three
apple  -1.101558  1.124472       NaN
banana -0.177289  2.487104 -0.634293
c       0.462215 -0.486066  1.931194
durian       NaN -0.456288 -1.222918

如果字典的主键中与不现有数据的标签匹配不上,那么该键值对不会生效,但也不会报错.

对于DataFrame对象,也支持axis参数指定轴方向上的映射对象.

In [258]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
Out[258]: 
        foo       bar     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [259]: df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')
Out[259]: 
             one       two     three
apple  -1.101558  1.124472       NaN
banana -0.177289  2.487104 -0.634293
c       0.462215 -0.486066  1.931194
durian       NaN -0.456288 -1.222918

默认情况下,rename方法不会在原数据上进行修改,而是返回新的对象,这也是Pandas数据不可变性的一种体现.

但提供给了inplace参数,默认值为False.如果将该参数设置为True,则所有操作将在原数据上进行.

rename方法还可以接受标量字符,代表修改对象的name属性.

In [260]: s.rename("scalar-name")
Out[260]: 
a    0.505453
b    1.788110
c   -0.405908
d   -0.801912
e    0.768460
Name: scalar-name, dtype: float64

迭代(Iteration)

对Pandas对象迭代的结果,取决于对象的类型.

迭代Series对象,就类似于迭代数组,会得到每一个元素

迭代DataFrame对象,就类似对字典进行迭代,主键将相当于列名.

例如:

In [261]: df = pd.DataFrame({'col1' : np.random.randn(3), 'col2' : np.random.randn(3)},
   .....:                   index=['a', 'b', 'c'])
   .....: 

In [262]: for col in df:
   .....:     print(col)
   .....: 
col1
col2

Pandas对象也具有类似于dict.items()方法，该方法用于变量DataFrame的列标签,会返回一个元组,该元组的第一个元素是列标签,

第二个元素是该列的Series对象.

如果要遍历DataFrame对象的行,则可以采用如下两个方法:

iterrows()方法:得到一个元组,元组的第一个元素是行对应的索引值,第二个元素是一个Series对象,该Series对象由每行数据转换而来,且索引值为列标签.正是由于这样的转换,所以会有一些性能上的问题.

itertuples()方法:会将每行转换为一个以列标签为值命名的命名元组,也包括每行的索引值,这样方法在性能上想较与iterrows()会更快,一般用来迭代DataFrame的值.

但总的来说,迭代Pandas对象都会比较慢,一般情况下不建议也没必要对Pandas对象进行迭代,可以使用一下的方法代替:

1.寻找一个向量化的解决方案:许多操作可以使用内置的方法或NumPy函数执行.

2.当需要对整个DataFrame或者Series进行操作时,最好使用apply()等方法而不是遍历这些值.

3.如果确实需要对值进行迭代操作，但又要保证性能，那么可以考虑使用cython或numba编写内部循环。

注意,永远不要修改正在迭代的东西,因为不保证在所有修改都有效.据数据类型的不同，迭代器可能返回一个副本而不是一个视图，并且修改它不会有任何效果.如:

In [263]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})

In [264]: for index, row in df.iterrows():
   .....:     row['a'] = 10
   .....: 

In [265]: df
Out[265]: 
   a  b
0  1  a
1  2  b
2  3  c

iteritems方法

iteritems()通过键-值对进行迭代，这与类似于dict的接口一致.

Series: 返回 (index, 标量值)
DataFrame: 返回 (列标签, 当列的Series对象)


In [266]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

In [267]: for col, series in df.iteritems():
   .....:     print(col)
   .....:     print(series)
   .....: 

a
0    1
1    2
2    3
Name: a, dtype: int64
b
0    4
1    5
2    6
Name: b, dtype: int64

iterrows方法

iterrows方法对DataFrame的行数据进行迭代,返回值是每行索引值和每行数据转换而成的Series对象组成的原则

for row_index, row in df.iterrows():
   .....:     print('%s\n%s' % (row_index, row))
   .....: 
0
a    1
b    a
Name: 0, dtype: object
1
a    2
b    b
Name: 1, dtype: object
2
a    3
b    c
Name: 2, dtype: object

注意,因为iterrows()为每一行返回一个序列,所以它不会跨行保存dtypes信息,例如:

In [268]: df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

In [269]: df_orig.dtypes
Out[269]: 
int        int64
float    float64
dtype: object

In [270]: row = next(df_orig.iterrows())[1]

In [271]: row
Out[271]: 
int      1.0
float    1.5
Name: 0, dtype: float64

因为每行数据是作为一个Series对象返回的,所以整个Series对象的dtype都被转换成了浮点数对象了.

In [272]: row['int'].dtype
Out[272]: dtype('float64')

In [273]: df_orig['int'].dtype
Out[273]: dtype('int64')

如果要保留dtypes信息，最好使用itertuples()进行遍历,它返回值的名称元组，而且通常比iterrows()快得多。

例如,使用迭代的方法也可以达到转置的效果:

In [274]: df2 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

In [275]: print(df2)
   x  y
0  1  4
1  2  5
2  3  6

In [276]: print(df2.T)
   0  1  2
x  1  2  3
y  4  5  6

In [277]: df2_t = pd.DataFrame(dict((idx,values) for idx, values in df2.iterrows()))

In [278]: print(df2_t)
   0  1  2
x  1  2  3
y  4  5  6

itertuples方法

itertuples()方法将返回一个迭代器，为DataFrame中的每一行生成一个namedtuple。元组的第一个元素将是行对应的索引值，而其余的值是行的数据。

In [279]: for row in df.itertuples():
   .....:     print(row)
   .....: 
Pandas(Index=0, a=1, b='a')
Pandas(Index=1, a=2, b='b')
Pandas(Index=2, a=3, b='c')

此方法不将行数据转换为Series对象;它只返回namedtuple中的值。所以，itertuples()保留行数据的每一个数据类型，并且通常比iterrows()更快。

注意:如果列名是无效的Python标识符、重复的或以下划线开头的，则它们将被重命名为位置名称。对于大量的列(>255)，将返回常规元组。

.dt 访问器(.dt accessor)

如果是一个datetime/period类型的Series对象,则Series有一个访问器,可以简洁地返回该Series对象的值的datetime类属性,并且返回一个与现有Series对象相同索引的Series对象。

# datetime
In [280]: s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))

In [281]: s
Out[281]: 
0   2013-01-01 09:10:12
1   2013-01-02 09:10:12
2   2013-01-03 09:10:12
3   2013-01-04 09:10:12
dtype: datetime64[ns]

In [282]: s.dt.hour
Out[282]: 
0    9
1    9
2    9
3    9
dtype: int64

In [283]: s.dt.second
Out[283]: 
0    12
1    12
2    12
3    12
dtype: int64

In [284]: s.dt.day
Out[284]: 
0    1
1    2
2    3
3    4
dtype: int64

这样的访问方法,可以使表达式简化,例如:

In [285]: s[s.dt.day==2]
Out[285]: 
1   2013-01-02 09:10:12
dtype: datetime64[ns]

可以很轻松的生成时区转换:

In [286]: stz = s.dt.tz_localize('US/Eastern')

In [287]: stz
Out[287]: 
0   2013-01-01 09:10:12-05:00
1   2013-01-02 09:10:12-05:00
2   2013-01-03 09:10:12-05:00
3   2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [288]: stz.dt.tz
Out[288]:

也可以链式操作:

In [289]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[289]: 
0   2013-01-01 04:10:12-05:00
1   2013-01-02 04:10:12-05:00
2   2013-01-03 04:10:12-05:00
3   2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

也可以使用Series.dt.strftime方法类似于标准的strftime方法,将时间数据转换为字符:

# DatetimeIndex
In [290]: s = pd.Series(pd.date_range('20130101', periods=4))

In [291]: s
Out[291]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
dtype: datetime64[ns]

In [292]: s.dt.strftime('%Y/%m/%d')
Out[292]: 
0    2013/01/01
1    2013/01/02
2    2013/01/03
3    2013/01/04
dtype: object

# PeriodIndex
In [293]: s = pd.Series(pd.period_range('20130101', periods=4))

In [294]: s
Out[294]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
dtype: object

In [295]: s.dt.strftime('%Y/%m/%d')
Out[295]: 
0    2013/01/01
1    2013/01/02
2    2013/01/03
3    2013/01/04
dtype: object

dt访问器同样也适用于period 和 timedelta 类型的时间:

# period
In [296]: s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))

In [297]: s
Out[297]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
dtype: object

In [298]: s.dt.year
Out[298]: 
0    2013
1    2013
2    2013
3    2013
dtype: int64

In [299]: s.dt.day
Out[299]: 
0    1
1    2
2    3
3    4
dtype: int64

# timedelta
In [300]: s = pd.Series(pd.timedelta_range('1 day 00:00:05', periods=4, freq='s'))

In [301]: s
Out[301]: 
0   1 days 00:00:05
1   1 days 00:00:06
2   1 days 00:00:07
3   1 days 00:00:08
dtype: timedelta64[ns]

In [302]: s.dt.days
Out[302]: 
0    1
1    1
2    1
3    1
dtype: int64

In [303]: s.dt.seconds
Out[303]: 
0    5
1    6
2    7
3    8
dtype: int64

In [304]: s.dt.components
Out[304]: 
   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds
0     1      0        0        5             0             0            0
1     1      0        0        6             0             0            0
2     1      0        0        7             0             0            0
3     1      0        0        8             0             0            0

最后补充下,如果使用dt访问器访问非时间类型的数据会引发错误.

矢量化字符方法(Vectorized string methods)

Series具有一组字符串处理方法，可以轻松地对数组中的每个元素进行操作。最重要的是，这些方法自动排除缺失/NA值。可以通过该Series的str属性调用这些方法，这些方法的名称通常与内置字符串方法匹配。例如:

In [305]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [306]: s.str.lower()
Out[306]: 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

除此之外,还提供了强大的模式匹配方法,匹配方法一般是正则,信息信息可以参阅:Pandas的字符处理.

排序(Sorting)

Panda支持三种排序方法:按索引排序、按元素值排序、按两者的组合排序。

索引排序(By Index)

Series.sort_index方法和DataFrame.sort_index方法用于按索引排序.

其中ascending参数可以控制升降序,而对于DataFrame对象,axis参数控制排序的轴方向.

In [307]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   .....:                    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   .....:                    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   .....: 

In [308]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
   .....:                          columns=['three', 'two', 'one'])
   .....: 

In [309]: unsorted_df
Out[309]: 
      three       two       one
a       NaN  0.708543  0.036274
d -0.540166  0.586626       NaN
c  0.410238  1.121731  1.044630
b -0.282532 -2.038777 -0.490032

# DataFrame
In [310]: unsorted_df.sort_index()
Out[310]: 
      three       two       one
a       NaN  0.708543  0.036274
b -0.282532 -2.038777 -0.490032
c  0.410238  1.121731  1.044630
d -0.540166  0.586626       NaN

In [311]: unsorted_df.sort_index(ascending=False)
Out[311]: 
      three       two       one
d -0.540166  0.586626       NaN
c  0.410238  1.121731  1.044630
b -0.282532 -2.038777 -0.490032
a       NaN  0.708543  0.036274

In [312]: unsorted_df.sort_index(axis=1)
Out[312]: 
        one     three       two
a  0.036274       NaN  0.708543
d       NaN -0.540166  0.586626
c  1.044630  0.410238  1.121731
b -0.490032 -0.282532 -2.038777

# Series
In [313]: unsorted_df['three'].sort_index()
Out[313]: 
a         NaN
b   -0.282532
c    0.410238
d   -0.540166
Name: three, dtype: float64

按数据值排序(By Values)

Series.sort_values方法用于Series对象的数据值排序.

DataFrame.sort_values方法用于按行数据值或列数据值进行排序.by参数用于指定需要排序的一列或多列.

In [314]: df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})

In [315]: df1.sort_values(by='two')
Out[315]: 
   one  two  three
0    2    1      5
2    1    2      3
1    1    3      4
3    1    4      2

by参数可以接受列标签组成的列表:

In [316]: df1[['one', 'two', 'three']].sort_values(by=['one','two'])
Out[316]: 
   one  two  three
2    1    2      3
1    1    3      4
3    1    4      2
0    2    1      5

通过na_position参数对可以对NA值进行特殊的排序处理:

In [317]: s[2] = np.nan

In [318]: s.sort_values()
Out[318]: 
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2     NaN
5     NaN
dtype: object

In [319]: s.sort_values(na_position='first')
Out[319]: 
2     NaN
5     NaN
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: object

索引和数据值结合排序(By Indexes and Values)

列或索引的字符名称可以传递给by参数,从而指定排序规则

# Build MultiIndex
In [320]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
   .....:                                 ('b', 2), ('b', 1), ('b', 1)])
   .....: 

In [321]: idx.names = ['first', 'second']

# Build DataFrame
In [322]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
   .....:                         index=idx)
   .....: 

In [323]: df_multi
Out[323]: 
              A
first second   
a     1       6
      2       5
      2       4
b     2       3
      1       2
      1       1

按'second'索引和'A'列排序:

In [324]: df_multi.sort_values(by=['second', 'A'])
Out[324]: 
              A
first second   
b     1       1
      1       2
a     1       6
b     2       3
a     2       4
      2       5

注意:如果字符串同时匹配列名和索名，则发出警告，但列名优先级更高。

但Pandas会将来的版本中修正导致歧义的错误。

searchsorted方法

该方法类似于numpy.ndarray.searchsorted方法,在维持序列顺序的情况下,查找要插入的元素的索引.

In [325]: ser = pd.Series([1, 2, 3])

In [326]: ser.searchsorted([0, 3])
Out[326]: array([0, 2])

In [327]: ser.searchsorted([0, 4])
Out[327]: array([0, 3])

In [328]: ser.searchsorted([1, 3], side='right')
Out[328]: array([1, 3])

In [329]: ser.searchsorted([1, 3], side='left')
Out[329]: array([0, 2])

In [330]: ser = pd.Series([3, 1, 2])

In [331]: ser.searchsorted([0, 3], sorter=np.argsort(ser))
Out[331]: array([0, 2])

最大/最小值(smallest / largest values)

Series对象有nsmallest和nlargest方法，它们返回指定个数的最小或最大的值。对于大型数，这比对整个数据排序后再对结果调用head(n)要快得多。

In [332]: s = pd.Series(np.random.permutation(10))

In [333]: s
Out[333]: 
0    8
1    2
2    9
3    5
4    6
5    0
6    1
7    7
8    4
9    3
dtype: int64

In [334]: s.sort_values()
Out[334]: 
5    0
6    1
1    2
9    3
8    4
3    5
4    6
7    7
0    8
2    9
dtype: int64

In [335]: s.nsmallest(3)
Out[335]: 
5    0
6    1
1    2
dtype: int64

In [336]: s.nlargest(3)
Out[336]: 
2    9
0    8
7    7
dtype: int64

DataFrame对象也有nsmallest和nlargest方法

In [337]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
   .....:                    'b': list('abdceff'),
   .....:                    'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
   .....: 

In [338]: df.nlargest(3, 'a')
Out[338]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN

In [339]: df.nlargest(5, ['a', 'c'])
Out[339]: 
    a  b    c
6  -1  f  4.0
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN
2   1  d  4.0

In [340]: df.nsmallest(3, 'a')
Out[340]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0

In [341]: df.nsmallest(5, ['a', 'c'])
Out[341]: 
   a  b    c
0 -2  a  1.0
2  1  d  4.0
4  8  e  NaN
1 -1  b  2.0
6 -1  f  4.0

多层级标签排序(Sorting by a multi-index column)

当列是一个多层级时，必须明确的指定要排序的所有级别。

In [342]: df1.columns = pd.MultiIndex.from_tuples([('a','one'),('a','two'),('b','three')])

In [343]: df1.sort_values(by=('a','two'))
Out[343]: 
    a         b
  one two three
0   2   1     5
2   1   2     3
1   1   3     4
3   1   4     2

复制(Copying)

Pandas的copy()方法会复制底层数据并返回一个新的对象,但是大多数情况下并不需要复制对象.

只需要使用部分方法就可以就地更改原对象:

插入、删除或修改列
修改索引或列属性
对于同构数据，可以通过values的属性或其他高级索引直接修改值

需要明确指出的是，任何Panda方法都不会修改原数据,几乎每个方法都返回一个新对象,不改变原始对象。如果原数据被修改，那一定是显式地修改了数据。

Dtype

Panda对象中存储的主要类型有float、int、bool、datetime64[ns]和datetime64[ns, tz]、timedelta[ns]、category和object。

此外，这些dtype具有所占的存储空间大小不一，例如储存int64的空间就要大于int32。

有关datetime64[ns, TZ] dtypes的详细信息，请参阅Pandas的时间序列/日期功能.

DataFrame的dtypes属性会以Series对象的方式返回所有列的类型信息

In [344]: dft = pd.DataFrame(dict(A = np.random.rand(3),
   .....:                         B = 1,
   .....:                         C = 'foo',
   .....:                         D = pd.Timestamp('20010102'),
   .....:                         E = pd.Series([1.0]*3).astype('float32'),
   .....:                                     F = False,
   .....:                                     G = pd.Series([1]*3,dtype='int8')))
   .....: 

In [345]: dft
Out[345]: 
          A  B    C          D    E      F  G
0  0.809585  1  foo 2001-01-02  1.0  False  1
1  0.128238  1  foo 2001-01-02  1.0  False  1
2  0.775752  1  foo 2001-01-02  1.0  False  1

In [346]: dft.dtypes
Out[346]: 
A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

对于Series对象,请使用dtype属性:

In [347]: dft['A'].dtype
Out[347]: dtype('float64')

如果Pandas对象中某一列含有多种类型,则会自动匹配一种可以概括全部类型的类型似来作为代表.

如果无法自动适应出适合的具体类型,则使用object类型.

# these ints are coerced to floats
In [348]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[348]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

# string data forces an ``object`` dtype
In [349]: pd.Series([1, 2, 3, 6., 'foo'])
Out[349]: 
0      1
1      2
2      3
3      6
4    foo
dtype: object

通过调用get_dtype_counts()可以找到DataFrame中每种类型的数量。

In [350]: dft.get_dtype_counts()
Out[350]: 
float64           1
float32           1
int64             1
int8              1
datetime64[ns]    1
bool              1
object            1
dtype: int64

数字类型的dtype可以传播,可以在数据中共存。如果指定了dtype(通过dtype关键字、传递ndarray或传递的Series)，那么dtype信息将保存在DataFrame操作中。此外，将不合并不同的数字dtype信息。下面的示例将可以体验一下。

In [351]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [352]: df1
Out[352]: 
          A
0  0.890400
1  0.283331
2 -0.303613
3 -1.192210
4  0.065420
5  0.455918
6  2.008328
7  0.188942

In [353]: df1.dtypes
Out[353]: 
A    float32
dtype: object

In [354]: df2 = pd.DataFrame(dict( A = pd.Series(np.random.randn(8), dtype='float16'),
   .....:                         B = pd.Series(np.random.randn(8)),
   .....:                         C = pd.Series(np.array(np.random.randn(8), dtype='uint8')) ))
   .....: 

In [355]: df2
Out[355]: 
          A         B    C
0 -0.454346  0.200071  255
1 -0.916504 -0.557756  255
2  0.640625 -0.141988    0
3  2.675781 -0.174060    0
4 -0.007866  0.258626    0
5 -0.204224  0.941688    0
6 -0.100098 -1.849045    0
7 -0.402100 -0.949458    0

In [356]: df2.dtypes
Out[356]: 
A    float16
B    float64
C      uint8
dtype: object

默认类型(defaults)

默认情况下，整数类型是int64，浮点类型是float64，与平台无关(32位或64位)。下面的结果都是int64类型

In [357]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[357]: 
a    int64
dtype: object

In [358]: pd.DataFrame({'a': [1, 2]}).dtypes
Out[358]: 
a    int64
dtype: object

In [359]: pd.DataFrame({'a': 1 }, index=list(range(2))).dtypes
Out[359]: 
a    int64
dtype: object

注意，在创建数组时，Numpy将类型的选择依赖于平台。下面将在32位平台上生成int32。

In [360]: frame = pd.DataFrame(np.array([1, 2]))

向上映射(upcasting)

当不同类型进行合并时,类型可能被向上映射或者提升.

In [361]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [362]: df3
Out[362]: 
          A         B      C
0  0.436054  0.200071  255.0
1 -0.633173 -0.557756  255.0
2  0.337012 -0.141988    0.0
3  1.483571 -0.174060    0.0
4  0.057555  0.258626    0.0
5  0.251695  0.941688    0.0
6  1.908231 -1.849045    0.0
7 -0.213158 -0.949458    0.0

In [363]: df3.dtypes
Out[363]: 
A    float32
B    float64
C    float64
dtype: object

属性values的dtype属性总是返回更低维度的类型,这意味着可以使用Numpy中的所有类型.

In [364]: df3.values.dtype
Out[364]: dtype('float64')

astype方法

使用astype()方法可以显式地将dtypes从一种类型转换为另一种类型。默认情况下，这些将返回一个副本，即使dtype没有更改(传递copy=False以更改此行为)。此外，如果astype操作无效，它们将引发异常。

向上映射(upcasting)总是遵循numpy规则。如果一个操作涉及两个不同的dtype，那么将使用更通用的dtype作为操作的结果。

In [365]: df3
Out[365]: 
          A         B      C
0  0.436054  0.200071  255.0
1 -0.633173 -0.557756  255.0
2  0.337012 -0.141988    0.0
3  1.483571 -0.174060    0.0
4  0.057555  0.258626    0.0
5  0.251695  0.941688    0.0
6  1.908231 -1.849045    0.0
7 -0.213158 -0.949458    0.0

In [366]: df3.dtypes
Out[366]: 
A    float32
B    float64
C    float64
dtype: object

# conversion of dtypes
In [367]: df3.astype('float32').dtypes
Out[367]: 
A    float32
B    float32
C    float32
dtype: object

使用astype()将列的子集转换为指定的类型。

In [368]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})

In [369]: dft[['a','b']] = dft[['a','b']].astype(np.uint8)

In [370]: dft
Out[370]: 
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9

In [371]: dft.dtypes
Out[371]: 
a    uint8
b    uint8
c    int64
dtype: object

使用dict可以将某些列转换为特定的dtype。

In [372]: dft1 = pd.DataFrame({'a': [1,0,1], 'b': [4,5,6], 'c': [7, 8, 9]})

In [373]: dft1 = dft1.astype({'a': np.bool, 'c': np.float64})

In [374]: dft1
Out[374]: 
       a  b    c
0   True  4  7.0
1  False  5  8.0
2   True  6  9.0

In [375]: dft1.dtypes
Out[375]: 
a       bool
b      int64
c    float64
dtype: object

当尝试使用astype()和loc()将列的子集转换为指定类型时，会发生向上映射(upcasting)。

因此，下面这段代码产生了意想不到的结果。

In [376]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})

In [377]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes
Out[377]: 
a    uint8
b    uint8
dtype: object

In [378]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)

In [379]: dft.dtypes
Out[379]: 
a    int64
b    int64
c    int64
dtype: object

对象转换(object conversion)

Panda提供了各种函数，可以强制将对象的dtype转换为其他类型。如果数据已经具有正确的类型，但是该类型是object,

就可以使用DataFrame.infer_objects()和Series.infer_objects()方法将数据软转换为正确的类型。

In [380]: import datetime

In [381]: df = pd.DataFrame([[1, 2],
   .....:                    ['a', 'b'],
   .....:                    [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)]])
   .....: 

In [382]: df = df.T

In [383]: df
Out[383]: 
   0  1                    2
0  1  a  2016-03-02 00:00:00
1  2  b  2016-03-02 00:00:00

In [384]: df.dtypes
Out[384]: 
0    object
1    object
2    object
dtype: object

因为数据被转置，导致原始推断的类型被储存为object, infer_objects将纠正这一错误。

In [385]: df.infer_objects().dtypes
Out[385]: 
0             int64
1            object
2    datetime64[ns]
dtype: object

以下函数可用于一维对象数组或标量对象硬转换到指定类型:

to_numeric:转换为数字类型

In [386]: m = ['1.1', 2, 3]

In [387]: pd.to_numeric(m)
Out[387]: array([ 1.1,  2. ,  3. ])

to_datetime:转换到datetime对象

In [388]: import datetime

In [389]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)]

In [390]: pd.to_datetime(m)
Out[390]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

to_timedelta:转换到datedelta对象

In [391]: m = ['5us', pd.Timedelta('1day')]

In [392]: pd.to_timedelta(m)
Out[392]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

如果要强制转换，我们可以传入一个errors参数，该参数指定Panda应该如何处理无法转换的对象。默认情况下，errors='raise'，这意味着在转换过程中遇到的任何错误都会被抛出。但是，如果error ='coerce'，这些错误将被忽略，panda将把有问题的元素转换为pd.NaT(datetime 和 timedelta))或np.nan(数字类型)。如果正在读取的数据大部分是所需的dtype(例如numeric、datetime)，但是偶尔会有不符合要求的元素混杂在一起，希望将其表示为缺失的数据，那么这种方法可能会很有用:

In [393]: import datetime

In [394]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [395]: pd.to_datetime(m, errors='coerce')
Out[395]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [396]: m = ['apple', 2, 3]

In [397]: pd.to_numeric(m, errors='coerce')
Out[397]: array([ nan,   2.,   3.])

In [398]: m = ['apple', pd.Timedelta('1day')]

In [399]: pd.to_timedelta(m, errors='coerce')
Out[399]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

errors参数还有第三个选项errors='ignore'，如果在转换到指定的数据类型时遇到任何错误，则返回传入的原数据:

In [400]: import datetime

In [401]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [402]: pd.to_datetime(m, errors='ignore')
Out[402]: array(['apple', datetime.datetime(2016, 3, 2, 0, 0)], dtype=object)

In [403]: m = ['apple', 2, 3]

In [404]: pd.to_numeric(m, errors='ignore')
Out[404]: array(['apple', 2, 3], dtype=object)

In [405]: m = ['apple', pd.Timedelta('1day')]

In [406]: pd.to_timedelta(m, errors='ignore')
Out[406]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

除了对象转换之外，to_numeric()还提供了另一个向下级类型转换的参数downcast，它提供了将数字数据向下转换为较小的dtype的选项，这可以节省内存:

In [407]: m = ['1', 2, 3]

In [408]: pd.to_numeric(m, downcast='integer')   # smallest signed int dtype
Out[408]: array([1, 2, 3], dtype=int8)

In [409]: pd.to_numeric(m, downcast='signed')    # same as 'integer'
Out[409]: array([1, 2, 3], dtype=int8)

In [410]: pd.to_numeric(m, downcast='unsigned')  # smallest unsigned int dtype
Out[410]: array([1, 2, 3], dtype=uint8)

In [411]: pd.to_numeric(m, downcast='float')     # smallest float dtype
Out[411]: array([ 1.,  2.,  3.], dtype=float32)

由于这些方法仅适用于一维数组、列表或标量;它们不能直接用于多维对象，如DataFrames。但是，使用apply()，我们可以有效地在每一列上“应用”该函数

In [412]: import datetime

In [413]: df = pd.DataFrame([['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')

In [414]: df
Out[414]: 
            0                    1
0  2016-07-09  2016-03-02 00:00:00
1  2016-07-09  2016-03-02 00:00:00

In [415]: df.apply(pd.to_datetime)
Out[415]: 
           0          1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02

In [416]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')

In [417]: df
Out[417]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [418]: df.apply(pd.to_numeric)
Out[418]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [419]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')

In [420]: df
Out[420]: 
     0                1
0  5us  1 days 00:00:00
1  5us  1 days 00:00:00

In [421]: df.apply(pd.to_timedelta)
Out[421]: 
                0      1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days

性能和可延伸性(gotchas)

对整数类型数据执行选择操作可以很容易地将数据上移到浮点类型。在不引入nans的情况下，将保留输入数据的dtype。

In [422]: dfi = df3.astype('int32')

In [423]: dfi['E'] = 1

In [424]: dfi
Out[424]: 
   A  B    C  E
0  0  0  255  1
1  0  0  255  1
2  0  0    0  1
3  1  0    0  1
4  0  0    0  1
5  0  0    0  1
6  1 -1    0  1
7  0  0    0  1

In [425]: dfi.dtypes
Out[425]: 
A    int32
B    int32
C    int32
E    int64
dtype: object

In [426]: casted = dfi[dfi>0]

In [427]: casted
Out[427]: 
     A   B      C  E
0  NaN NaN  255.0  1
1  NaN NaN  255.0  1
2  NaN NaN    NaN  1
3  1.0 NaN    NaN  1
4  NaN NaN    NaN  1
5  NaN NaN    NaN  1
6  1.0 NaN    NaN  1
7  NaN NaN    NaN  1

In [428]: casted.dtypes
Out[428]: 
A    float64
B    float64
C    float64
E      int64
dtype: object

而float dtypes是不变的

In [429]: dfa = df3.copy()

In [430]: dfa['A'] = dfa['A'].astype('float32')

In [431]: dfa.dtypes
Out[431]: 
A    float32
B    float64
C    float64
dtype: object

In [432]: casted = dfa[df2>0]

In [433]: casted
Out[433]: 
          A         B      C
0       NaN  0.200071  255.0
1       NaN       NaN  255.0
2  0.337012       NaN    NaN
3  1.483571       NaN    NaN
4       NaN  0.258626    NaN
5       NaN  0.941688    NaN
6       NaN       NaN    NaN
7       NaN       NaN    NaN

In [434]: casted.dtypes
Out[434]: 
A    float32
B    float64
C    float64
dtype: object

基于dtype选择列(Selecting columns based on `dtype`)

select_dtypes()方法实现基于列的dtype类型的子设置。

In [435]: df = pd.DataFrame({'string': list('abc'),
   .....:                    'int64': list(range(1, 4)),
   .....:                    'uint8': np.arange(3, 6).astype('u1'),
   .....:                    'float64': np.arange(4.0, 7.0),
   .....:                    'bool1': [True, False, True],
   .....:                    'bool2': [False, True, False],
   .....:                    'dates': pd.date_range('now', periods=3).values,
   .....:                    'category': pd.Series(list("ABC")).astype('category')})
   .....: 

In [436]: df['tdeltas'] = df.dates.diff()

In [437]: df['uint64'] = np.arange(3, 6).astype('u8')

In [438]: df['other_dates'] = pd.date_range('20130101', periods=3).values

In [439]: df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')

In [440]: df
Out[440]: 
  string  int64  uint8  float64  bool1  bool2                      dates category tdeltas  uint64 other_dates            tz_aware_dates
0      a      1      3      4.0   True  False 2018-08-05 11:57:39.507525        A     NaT       3  2013-01-01 2013-01-01 00:00:00-05:00
1      b      2      4      5.0  False   True 2018-08-06 11:57:39.507525        B  1 days       4  2013-01-02 2013-01-02 00:00:00-05:00
2      c      3      5      6.0   True  False 2018-08-07 11:57:39.507525        C  1 days       5  2013-01-03 2013-01-03 00:00:00-05:00

所有的types信息如下:

In [441]: df.dtypes
Out[441]: 
string                                object
int64                                  int64
uint8                                  uint8
float64                              float64
bool1                                   bool
bool2                                   bool
dates                         datetime64[ns]
category                            category
tdeltas                      timedelta64[ns]
uint64                                uint64
other_dates                   datetime64[ns]
tz_aware_dates    datetime64[ns, US/Eastern]
dtype: object

select_dtypes()有两个参数include和exclude,前者表示接受的类型,后者表示排除的类型

例如，选择bool列:

In [442]: df.select_dtypes(include=[bool])
Out[442]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

传入的也可以是Numpy中定义的类型的字符名称:

In [443]: df.select_dtypes(include=['bool'])
Out[443]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

select_dtypes()也可以用于泛型dtypes。

例如，要选择所有数字和布尔列，同时排除无符号整数:

In [444]: df.select_dtypes(include=['number', 'bool'], exclude=['unsignedinteger'])
Out[444]: 
   int64  float64  bool1  bool2 tdeltas
0      1      4.0   True  False     NaT
1      2      5.0  False   True  1 days
2      3      6.0   True  False  1 days

要选择字符串，必须使用object:

In [445]: df.select_dtypes(include=['object'])
Out[445]: 
  string
0      a
1      b
2      c

如要要查看泛型dtype(如numpy)的所有子dtype。你可以定义一个函数，将子类型继承序列:

In [446]: def subdtypes(dtype):
   .....:     subs = dtype.__subclasses__()
   .....:     if not subs:
   .....:         return dtype
   .....:     return [dtype, [subdtypes(dt) for dt in subs]]
   .....:

所有的NumPy类型都是NumPy .generic的子类:

In [447]: subdtypes(np.generic)
Out[447]: 
[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

注意:Panda还定义了types类别和datetime64[ns, tz],它们没有集成到常规的NumPy层次结构中,不会显示在上面的函数结果中。

你可能感兴趣的:(Pandas,Pandas基础,Pandas功能)

QQ群采集助手，精准引流必备神器 2401_87347160 其他经验分享
功能概述微信群查找与筛选工具是一款专为微信用户设计的辅助工具，它通过关键词搜索功能，帮助用户快速找到相关的微信群，并提供筛选是否需要验证的群组的功能。主要功能关键词搜索：用户可以输入关键词，工具将自动查找包含该关键词的微信群。筛选功能：工具提供筛选机制，用户可以选择是否只显示需要验证或不需要验证的群组。精准引流：通过上述功能，用户可以更精准地找到目标群组，进行有效的引流操作。3.设备需求该工具可以
微服务下功能权限与数据权限的设计与实现 nbsaas-boot 微服务 java 架构
在微服务架构下，系统的功能权限和数据权限控制显得尤为重要。随着系统规模的扩大和微服务数量的增加，如何保证不同用户和服务之间的访问权限准确、细粒度地控制，成为设计安全策略的关键。本文将讨论如何在微服务体系中设计和实现功能权限与数据权限控制。1.功能权限与数据权限的定义功能权限：指用户或系统角色对特定功能的访问权限。通常是某个用户角色能否执行某个操作，比如查看订单、创建订单、修改用户资料等。数据权限：
c++ 的iostream 和 c++的stdio的区别和联系黄卷青灯77 c++算法开发语言 iostream stdio
在C++中，iostream和C语言的stdio.h都是用于处理输入输出的库，但它们在设计、用法和功能上有许多不同。以下是两者的区别和联系：区别1.编程风格iostream（C++风格）：C++标准库中的输入输出流类库，支持面向对象的输入输出操作。典型用法是cin（输入）和cout（输出），使用>操作符来处理数据。更加类型安全，支持用户自定义类型的输入输出。#includeintmain(){in
如何在 Fork 的 GitHub 项目中保留自己的修改并同步上游更新？github_fork_update iBaoxing github
如何在Fork的GitHub项目中保留自己的修改并同步上游更新？在GitHub上Fork了一个项目后，你可能会对项目进行一些修改，同时原作者也在不断更新。如果想要在保留自己修改的基础上，同步原作者的最新更新，很多人会不知所措。本文将详细讲解如何在不丢失自己改动的情况下，将上游仓库的更新合并到自己的仓库中。问题描述假设你在GitHub上Fork了一个项目，并基于该项目做了一些修改，随后你发现原作者对
扫地机类清洁产品之直流无刷电机控制悟空胆好小清洁服务机器人单片机人工智能
扫地机类清洁产品之直流无刷电机控制1.1前言扫地机产品有很多的电机控制，滚刷电机1个，边刷电机1-2个，清水泵电机，风机一个，部分中高端产品支持抹布功能，也就是存在抹布盘电机，还有追觅科沃斯石头等边刷抬升电机，滚刷抬升电机等的，这些电机有直流有刷电机，直接无刷电机，步进电机，电磁阀，挪动泵等不同类型。电机的原理，驱动控制方式也不行。接下来一段时间的几个文章会作个专题分析分享。直流有刷电机会自动持续
Python数据分析与可视化实战指南 William数据分析 python python 数据
在数据驱动的时代，Python因其简洁的语法、强大的库生态系统以及活跃的社区，成为了数据分析与可视化的首选语言。本文将通过一个详细的案例，带领大家学习如何使用Python进行数据分析，并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前，我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
使用 FinalShell 进行远程连接（ssh 远程连接 Linux 服务器）编程经验分享开发工具服务器 ssh linux
目录前言基本使用教程新建远程连接连接主机自定义命令路由追踪前言后端开发，必然需要和服务器打交道，部署应用，排查问题，查看运行日志等等。一般服务器都是集中部署在机房中，也有一些直接是云服务器，总而言之，程序员不可能直接和服务器直接操作，一般都是通过ssh连接来登录服务器。刚接触远程连接时，使用的是XSHELL来远程连接服务器，连接上就能够操作远程服务器了，但是仅用XSHELL并没有上传下载文件的功能
四章-32-点要素的聚合彩云飘过
本文基于腾讯课堂老胡的课《跟我学Openlayers--基础实例详解》做的学习笔记，使用的openlayers5.3.xapi。源码见1032.html，对应的官网示例https://openlayers.org/en/latest/examples/cluster.htmlhttps://openlayers.org/en/latest/examples/earthquake-clusters.
高端密码学院笔记285 柚子_b4b4
高端幸福密码学院（高级班）幸福使者：李华第（598）期《幸福》之回归内在深层生命原动力基础篇——揭秘“激励”成长的喜悦心理案例分析主讲：刘莉一，知识扩充:成功=艰苦劳动+正确方法+少说空话。贪图省力的船夫，目标永远下游。智者的梦再美，也不如愚人实干的脚印。幸福早课堂2020.10.16星期五一笔记:1，重视和珍惜的前提是知道它的价值非常重要，当你珍惜了，你就真正定下来，真正的学到身上。2，大家需要
从0到500+，我是如何利用自媒体赚钱？一列脚印
运营公众号半个多月，从零基础的小白到现在慢慢懂了一些运营的知识。做好公众号是很不容易的，要做很多事情；排版、码字、引流…通通需要自己解决，业余时间全都花费在这上面涨这么多粉丝是真的不容易，对比知乎大佬来说，我们这种没资源，没人脉，还没钱的小透明来说，想要一个月涨粉上万，怕是今天没睡醒（不过你有的方法，算我piapia打脸）至少我是清醒的，自己慢慢努力，实现我的万粉目标！大家快来围观、支持我吧！孩子
使用LLaVa和Ollama实现多模态RAG示例 llzwxh888 python 人工智能开发语言
本文将详细介绍如何使用LLaVa和Ollama实现多模态RAG（检索增强生成），通过提取图像中的结构化数据、生成图像字幕等功能来展示这一技术的强大之处。安装环境首先，您需要安装以下依赖包：!pipinstallllama-index-multi-modal-llms-ollama!pipinstallllama-index-readers-file!pipinstallunstructured!p
使用Apify加载Twitter消息以进行微调的完整指南 nseejrukjhad twitter easyui 前端 python
#使用Apify加载Twitter消息以进行微调的完整指南##引言在自然语言处理领域，微调模型以适应特定任务是提升模型性能的常见方法。本文将介绍如何使用Apify从Twitter导出聊天信息，以便进一步进行微调。##主要内容###使用Apify导出推文首先，我们需要从Twitter导出推文。Apify可以帮助我们做到这一点。通过Apify的强大功能，我们可以批量抓取和导出数据，适用于各类应用场景。
如何部分格式化提示模板:LangChain中的高级技巧 nseejrukjhad langchain java 服务器 python
标题:如何部分格式化提示模板:LangChain中的高级技巧内容:如何部分格式化提示模板:LangChain中的高级技巧引言在使用大型语言模型(LLM)时,提示工程是一个关键环节。LangChain提供了强大的提示模板功能,让我们能更灵活地构建和管理提示。本文将介绍LangChain中一个高级特性-部分格式化提示模板,这个技巧可以让你的提示管理更加高效和灵活。什么是部分格式化提示模板?部分格式化提
数组去重好奇的猫猫猫
整理自js中基础数据结构数组去重问题思考？如何去除数组中重复的项例如数组：[1,3,4,3,5]我们在做去重的时候，一开始想到的肯定是，逐个比较，外面一层循环，内层后一个与前一个一比较，如果是久不将当前这一项放进新的数组，挨个比较完之后返回一个新的去过重复的数组不好的实践方式上述方法效率极低，代码量还多，思考？有没有更好的方法这时候不禁一想当然有了！！！hashtable啊，通过对象的hash办法
关于城市旅游的HTML网页设计——(旅游风景云南 5页)HTML+CSS+JavaScript 二挡起步 web前端期末大作业 javascript html css 旅游风景
⛵源码获取文末联系✈Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业|游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作|HTML期末大学生网页设计作业，Web大学生网页HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScrip
HTML网页设计制作大作业（div+css）云南我的家乡旅游景点带文字滚动二挡起步 web前端期末大作业 web设计网页规划与设计 html css javascript dreamweaver 前端
Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作HTML期末大学生网页设计作业HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScript：做与用户的交互行为文章目录前端学习路线
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
大伟说成语之唉声叹气求索大伟
＊大伟说成语＊【唉声叹气】叹气：因心里不痛快或不如意而吐出长气，发出声音。因为痛苦、憋闷或感伤而发出叹息的声音。【大伟说】情绪外露，非人类所特有，动物亦有情绪，悲哀和欢乐所表示的情绪亦是不一样的，会嗷嗷大叫也会低吟痛哭。不同的是，人类的情绪更复杂，更多样，更丰富。唉声叹气，可以说是最基础的情绪，因为无奈而举足无措，不知该如何如何化解，只有独自一人慢慢承受，长吁短叹不知如何是好，其实是无能无力的表现
Python快速入门 —— 第三节：类与对象孤华暗香 Python快速入门 python 开发语言
第三节：类与对象目标：了解面向对象编程的基础概念，并学会如何定义类和创建对象。内容：类与对象：定义类：class关键字。类的构造函数：__init__()。类的属性和方法。对象的创建与使用。示例：classStudent:def__init__(self,name,age,major):self.name&#
MongoDB Oplog 窗口喝醉酒的小白 MongoDB 运维
在MongoDB中，oplog（操作日志）是一个特殊的日志系统，用于记录对数据库的所有写操作。oplog允许副本集成员（通常是从节点）应用主节点上已经执行的操作，从而保持数据的一致性。它是MongoDB副本集实现数据复制的基础。MongoDBOplog窗口oplog窗口是指在MongoDB副本集中，从节点可以用来同步数据的时间范围。这个窗口通常由以下因素决定：Oplog大小：oplog的大小是有限
node.js学习小猿L node.js node.js 学习 vim
node.js学习实操及笔记温故node.js，node.js学习实操过程及笔记~node.js学习视频node.js官网node.js中文网实操笔记githubcsdn笔记为什么学node.js可以让别人访问我们编写的网页为后续的框架学习打下基础，三大框架vuereactangular离不开node.jsnode.js是什么官网：node.js是一个开源的、跨平台的运行JavaScript的运行
【华为OD技术面试真题 - 技术面】- python八股文真题题库（4) 算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选**1.Python中的`with`**用途和功能自动资源管理示例：文件操作上下文管理协议示例代码工作流程解析优点2.\_\_new\_\_和**\_\_init\_\_**区别__new____init__区别总结3.**切片（Slicing）操作**基本切片语法
【华为OD技术面试真题 - 技术面】-测试八股文真题题库（1）算法大师华为od 面试 python 算法前端
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选1.黑盒测试和白盒测试的区别2.假设我们公司现在开发一个类似于微信的软件1.0版本，现在要你测试这个功能：打开聊天窗口，输入文本，限制字数在200字以内。问你怎么提取测试点。功能测试性能测试安全性测试可用性测试跨平台兼容性测试网络环境测试3.接口测试的工具你了解哪些
数据仓库——维度表一致性墨染丶eye 背诵数据仓库
数据仓库基础笔记思维导图已经整理完毕，完整连接为：数据仓库基础知识笔记思维导图维度一致性问题从逻辑层面来看，当一系列星型模型共享一组公共维度时，所涉及的维度称为一致性维度。当维度表存在不一致时，短期的成功难以弥补长期的错误。维度时确保不同过程中信息集成起来实现横向钻取货活动的关键。造成横向钻取失败的原因维度结构的差别，因为维度的差别，分析工作涉及的领域从简单到复杂，但是都是通过复杂的报表来弥补设计
闲鱼鱼小铺怎么开通？鱼小铺开通需要哪些流程？高省APP大九
闲鱼鱼小铺是平台推出的一个专业程度的店铺，与普通店铺相比会有更多的权益，比如说发布的商品数量从50增加到500；拥有专业的店铺数据看板与分析的功能，这对于专门在闲鱼做生意的用户来说是非常有帮助的，那么鱼小铺每个人都能开通吗？大家好，我是高省APP联合创始人蓓蓓导师，高省APP是2021年推出的电商导购平台，0投资，0风险、高省APP佣金更高，模式更好，终端用户不流失。【高省】是一个可省钱佣金高，能
高级 ECharts 技巧：自定义图表主题与样式 SnowMan1993 echarts 信息可视化数据分析
ECharts是一个强大的数据可视化库，提供了多种内置主题和样式，但你也可以根据项目的设计需求，自定义图表的主题与样式。本文将介绍如何使用ECharts自定义图表主题，以提升数据可视化的吸引力和一致性。1.什么是ECharts主题？ECharts的主题是指定义图表样式的配置项，包括颜色、字体、线条样式等。通过预设主题，你可以快速更改图表的整体风格，而自定义主题则允许你在此基础上进行个性化设置。2.
01-Git初识 Meereen Git git
01-Git初识概念：一个免费开源，分布式的代码版本控制系统，帮助开发团队维护代码作用：记录代码内容。切换代码版本，多人开发时高效合并代码内容如何学：个人本机使用：Git基础命令和概念多人共享使用：团队开发同一个项目的代码版本管理Git配置用户信息配置：用户名和邮箱，应用在每次提交代码版本时表明自己的身份命令：查看git版本号git-v配置用户名gitconfig--globaluser.name
ARM驱动学习之基础小知识 JT灬新一 ARM 嵌入式 arm开发学习
ARM驱动学习之基础小知识•sch原理图工程师工作内容–方案–元器件选型–采购（能不能买到，价格）–原理图（涉及到稳定性）•layout画板工程师–layout（封装、布局，布线，log）（涉及到稳定性）–焊接的一部分工作（调试阶段板子的焊接）•驱动工程师–驱动，原理图，layout三部分的交集容易发生矛盾•PCB研发流程介绍–方案，原理图(网表)–layout工程师（gerber文件）–PCB板
基于CODESYS的多轴运动控制程序框架：逻辑与运动控制分离，快速开发灵活操作 GPJnCrbBdl python 开发语言
基于codesys开发的多轴运动控制程序框架，将逻辑与运动控制分离，将单轴控制封装成功能块，对该功能块的操作包含了所有的单轴控制（归零、点动、相对定位、绝对定位、设置当前位置、伺服模式切换等等）。程序框架由主程序按照状态调用分归零模式、手动模式、自动模式、故障模式，程序状态的跳转都已完成，只需要根据不同的工艺要求完成所需的动作即可。变量的声明、地址的规划都严格按照C++的标准定义，能帮助开发者快速
Java 并发包之线程池和原子计数 lijingyao8206 Java计数 ThreadPool 并发包 java线程池
对于大数据量关联的业务处理逻辑，比较直接的想法就是用JDK提供的并发包去解决多线程情况下的业务数据处理。线程池可以提供很好的管理线程的方式，并且可以提高线程利用率，并发包中的原子计数在多线程的情况下可以让我们避免去写一些同步代码。这里就先把jdk并发包中的线程池处理器ThreadPoolExecutor 以原子计数类AomicInteger 和倒数计时锁C
java编程思想抽象类和接口百合不是茶 java 抽象类接口
接口c++对接口和内部类只有简介的支持,但在java中有队这些类的直接支持 1 ,抽象类 : 如果一个类包含一个或多个抽象方法,该类必须限定为抽象类(否者编译器报错) 抽象方法 : 在方法中仅有声明而没有方法体 package com.wj.Interface;
[房地产与大数据]房地产数据挖掘系统 comsci 数据挖掘
随着一个关键核心技术的突破,我们已经是独立自主的开发某些先进模块,但是要完全实现,还需要一定的时间... 所以,除了代码工作以外,我们还需要关心一下非技术领域的事件..比如说房地产 &nb
数组队列总结沐刃青蛟数组队列
数组队列是一种大小可以改变，类型没有定死的类似数组的工具。不过与数组相比，它更具有灵活性。因为它不但不用担心越界问题，而且因为泛型（类似c++中模板的东西）的存在而支持各种类型。以下是数组队列的功能实现代码： import List.Student; public class
Oracle存储过程无法编译的解决方法 IT独行者 oracle 存储过程　
今天同事修改Oracle存储过程又导致2个过程无法被编译，流程规范上的东西，Dave 这里不多说，看看怎么解决问题。 1. 查看无效对象 XEZF@xezf(qs-xezf-db1)> select object_name,object_type,status from all_objects where status='IN
重装系统之后oracle恢复文强chu oracle
前几天正在使用电脑，没有暂停oracle的各种服务。突然win8.1系统奔溃，无法修复，开机时系统提示正在搜集错误信息，然后再开机，再提示的无限循环中。无耐我拿出系统u盘准备重装系统，没想到竟然无法从u盘引导成功。晚上到外面早了一家修电脑店，让人家给装了个系统，并且那哥们在我没反应过来的时候，直接把我的c盘给格式化了并且清理了注册表，再装系统。然后的结果就是我的oracl
python学习二（一些基础语法）小桔子 pthon 基础语法
紧接着把！昨天没看继续看django 官方教程，学了下python的基本语法与c类语言还是有些小差别： 1.ptyhon的源文件以UTF-8编码格式 2. / 除结果浮点型 // 除结果整形 % 除取余数 * 乘 ** 乘方 eg 5**2 结果是5的2次方25 _&
svn 常用命令 aichenglong SVN 版本回退
1 svn回退版本 1)在window中选择log,根据想要回退的内容,选择revert this version或revert chanages from this version 两者的区别: revert this version:表示回退到当前版本(该版本后的版本全部作废) revert chanages from this versio
某小公司面试归来 alafqq 面试
先填单子，还要写笔试题，我以时间为急，拒绝了它。。时间宝贵。老拿这些对付毕业生的东东来吓唬我。。面试官很刁难，问了几个问题，记录下； 1，包的范围。。。public,private,protect. --悲剧了 2，hashcode方法和equals方法的区别。谁覆盖谁.结果，他说我说反了。 3，最恶心的一道题，抽象类继承抽象类吗？（察，一般它都是被继承的啊） 4，stru
动态数组的存储速度比较集合框架百合不是茶集合框架
集合框架：自定义数据结构(增删改查等) package 数组; /** * 创建动态数组 * @author 百合 * */ public class ArrayDemo{ //定义一个数组来存放数据 String[] src = new String[0]; /** * 增加元素加入容器 * @param s要加入容器
用JS实现一个JS对象，对象里有两个属性一个方法 bijian1013 js对象
<html> <head> </head> <body> 用js代码实现一个js对象，对象里有两个属性，一个方法 </body> <script> var obj={a:'1234567',b:'bbbbbbbbbb',c:function(x){
探索JUnit4扩展：使用Rule bijian1013 java 单元测试 JUnit Rule
在上一篇文章中，讨论了使用Runner扩展JUnit4的方式，即直接修改Test Runner的实现(BlockJUnit4ClassRunner)。但这种方法显然不便于灵活地添加或删除扩展功能。下面将使用JUnit4.7才开始引入的扩展方式——Rule来实现相同的扩展功能。 1. Rule &n
[Gson一]非泛型POJO对象的反序列化 bit1129 POJO
当要将JSON数据串反序列化自身为非泛型的POJO时，使用Gson.fromJson(String, Class)方法。自身为非泛型的POJO的包括两种： 1. POJO对象不包含任何泛型的字段 2. POJO对象包含泛型字段，例如泛型集合或者泛型类 Data类 a.不是泛型类， b.Data中的集合List和Map都是泛型的 c.Data中不包含其它的POJO
【Kakfa五】Kafka Producer和Consumer基本使用 bit1129 kafka
0.Kafka服务器的配置一个Broker，一个Topic Topic中只有一个Partition（） 1. Producer： package kafka.examples.producers; import kafka.producer.KeyedMessage; import kafka.javaapi.producer.Producer; impor
lsyncd实时同步搭建指南——取代rsync+inotify ronin47
1. 几大实时同步工具比较 1.1 inotify + rsync 最近一直在寻求生产服务服务器上的同步替代方案，原先使用的是 inotify + rsync，但随着文件数量的增大到100W+，目录下的文件列表就达20M，在网络状况不佳或者限速的情况下，变更的文件可能10来个才几M，却因此要发送的文件列表就达20M，严重减低的带宽的使用效率以及同步效率；更为要紧的是，加入inotify
java-9. 判断整数序列是不是二元查找树的后序遍历结果 bylijinnan java
public class IsBinTreePostTraverse{ static boolean isBSTPostOrder(int[] a){ if(a==null){ return false; } /*1.只有一个结点时，肯定是查找树 *2.只有两个结点时，肯定是查找树。例如{5,6}对应的BST是 6 {6,5}对应的BST是
MySQL的sum函数返回的类型 bylijinnan java spring sql mysql jdbc
今天项目切换数据库时，出错访问数据库的代码大概是这样： String sql = "select sum(number) as sumNumberOfOneDay from tableName"; List<Map> rows = getJdbcTemplate().queryForList(sql); for (Map row : rows
java设计模式之单例模式 chicony java设计模式
在阎宏博士的《JAVA与模式》一书中开头是这样描述单例模式的：　　作为对象的创建模式，单例模式确保某一个类只有一个实例，而且自行实例化并向整个系统提供这个实例。这个类称为单例类。单例模式的结构　　单例模式的特点：单例类只能有一个实例。单例类必须自己创建自己的唯一实例。单例类必须给所有其他对象提供这一实例。　　饿汉式单例类 publ
javascript取当月最后一天 ctrain JavaScript
 <script language=javascript> var current = new Date(); var year = current.getYear(); var month = current.getMonth(); showMonthLastDay(year, mont
linux tune2fs命令详解 daizj linux tune2fs 查看系统文件块信息
一.简介： tune2fs是调整和查看ext2/ext3文件系统的文件系统参数，Windows下面如果出现意外断电死机情况，下次开机一般都会出现系统自检。Linux系统下面也有文件系统自检，而且是可以通过tune2fs命令，自行定义自检周期及方式。二.用法： Usage: tune2fs [-c max_mounts_count] [-e errors_behavior] [-g grou
做有中国特色的程序员 dcj3sjt126com 程序员
从出版业说起网络作品排到靠前的，都不会太难看，一般人不爱看某部作品也是因为不喜欢这个类型，而此人也不会全不喜欢这些网络作品。究其原因，是因为网络作品都是让人先白看的，看的好了才出了头。而纸质作品就不一定了，排行榜靠前的，有好作品，也有垃圾。许多大牛都是写了博客，后来出了书。这些书也都不次，可能有人让为不好，是因为技术书不像小说，小说在读故事，技术书是在学知识或温习知识，有
Android：TextView属性大全 dcj3sjt126com textview
android:autoLink 设置是否当文本为URL链接/email/电话号码/map时，文本显示为可点击的链接。可选值(none/web/email/phone/map/all) android:autoText 如果设置，将自动执行输入值的拼写纠正。此处无效果，在显示输入法并输
tomcat虚拟目录安装及其配置 eksliang tomcat配置说明 tomca部署web应用 tomcat虚拟目录安装
转载请出自出处：http://eksliang.iteye.com/blog/2097184 1.-------------------------------------------tomcat 目录结构 config：存放tomcat的配置文件 temp ：存放tomcat跑起来后存放临时文件用的 work ：当第一次访问应用中的jsp
浅谈：APP有哪些常被黑客利用的安全漏洞 gg163 APP
首先，说到APP的安全漏洞，身为程序猿的大家应该不陌生；如果抛开安卓自身开源的问题的话，其主要产生的原因就是开发过程中疏忽或者代码不严谨引起的。但这些责任也不能怪在程序猿头上，有时会因为BOSS时间催得紧等很多可观原因。由国内移动应用安全检测团队爱内测（ineice.com）的CTO给我们浅谈关于Android 系统的开源设计以及生态环境。 1. 应用反编译漏洞：APK 包非常容易被反编译成可读
C#根据网址生成静态页面 hvt Web .net C#asp.net hovertree
HoverTree开源项目中HoverTreeWeb.HVTPanel的Index.aspx文件是后台管理的首页。包含生成留言板首页，以及显示用户名，退出等功能。根据网址生成页面的方法： bool CreateHtmlFile(string url, string path) { //http://keleyi.com/a/bjae/3d10wfax.htm stri
SVG 教程（一）天梯梦 svg
SVG 简介 SVG 是使用 XML 来描述二维图形和绘图程序的语言。学习之前应具备的基础知识：继续学习之前，你应该对以下内容有基本的了解： HTML XML 基础如果希望首先学习这些内容，请在本站的首页选择相应的教程。什么是SVG？ SVG 指可伸缩矢量图形 (Scalable Vector Graphics) SVG 用来定义用于网络的基于矢量
一个简单的java栈 luyulong java 数据结构栈
public class MyStack { private long[] arr; private int top; public MyStack() { arr = new long[10]; top = -1; } public MyStack(int maxsize) { arr = new long[maxsize]; top
基础数据结构和算法八：Binary search sunwinner Algorithm Binary search
Binary search needs an ordered array so that it can use array indexing to dramatically reduce the number of compares required for each search, using the classic and venerable binary search algori
12个C语言面试题，涉及指针、进程、运算、结构体、函数、内存，看看你能做出几个！刘星宇 c 面试
12个C语言面试题，涉及指针、进程、运算、结构体、函数、内存，看看你能做出几个！ 1.gets()函数问：请找出下面代码里的问题： #include<stdio.h> int main(void) { char buff[10]; memset(buff,0,sizeof(buff));
ITeye 7月技术图书有奖试读获奖名单公布 ITeye管理员活动 ITeye 试读
ITeye携手人民邮电出版社图灵教育共同举办的7月技术图书有奖试读活动已圆满结束，非常感谢广大用户对本次活动的关注与参与。 7月试读活动回顾： http://webmaster.iteye.com/blog/2092746 本次技术图书试读活动的优秀奖获奖名单及相应作品如下（优秀文章有很多，但名额有限，没获奖并不代表不优秀）：《Java性能优化权威指南》