pandas-excel

import pandas as pd
import numpy as np
df = pd.DataFrame({'total_bill': [16.99, 10.34, 23.68, 23.68, 24.59],
                   'tip': [1.01, 1.66, 3.50, 3.31, 3.61],
                   'sex': ['Female', 'Male', 'Male', 'Male', 'Female']})

对于DataFrame，我们可以看到其固有属性

# data type of columns
df.dtypes
# indexes,行索引
df.index
# return pandas.Index，列名称（label）
df.columns
# each row, return array[array]
df.values
# a tuple representing the dimensionality of df
df.shape

select:
SQL中的select是根据列的名称来选取；Pandas则更为灵活，不但可根据列名称选取，还可以根据列所在的position选取。相关函数如下：
loc，基于列label，可选取特定行（根据行index）；

df.loc[1:3, ['total_bill', 'tip']]
df.loc[1:3, 'tip': 'total_bill']

有更为简洁的行/列选取方式：

df[1: 3]
df[['total_bill', 'tip']]
# df[1:2, ['total_bill', 'tip']]  # TypeError: unhashable type

where:

# and
df[(df['sex'] == 'Female') & (df['total_bill'] > 20)]
# or
df[(df['sex'] == 'Female') | (df['total_bill'] > 20)]
# in
df[df['total_bill'].isin([21.01, 23.68, 24.59])]
# not
df[-(df['sex'] == 'Male')]
df[-df['total_bill'].isin([21.01, 23.68, 24.59])]
# string function
df = df[(-df['app'].isin(sys_app)) & (-df.app.str.contains('^微信\d+$'))]

join:
Pandas中join的实现也有两种：

# 1.
df.join(df2, how='left'...)
# 2. 
pd.merge(df1, df2, how='left', left_on='app', right_on='app')

第一种方法是按DataFrame的index进行join的，而第二种方法才是按on指定的列做join。Pandas满足left、right、inner、full outer四种join方式。

order:
Pandas中支持多列order，并可以调整不同列的升序/降序，有更高的排序自由度：

df.sort_values(['total_bill', 'tip'], ascending=[False, True])

replace:
replace函数提供对dataframe全局修改，亦可通过where条件进行过滤修改（搭配loc）：

# overall replace
df.replace(to_replace='Female', value='Sansa', inplace=True)

# dict replace
df.replace({'sex': {'Female': 'Sansa', 'Male': 'Leone'}}, inplace=True)

# replace on where condition 
df.loc[df.sex == 'Male', 'sex'] = 'Leone'

distinct:

df.drop_duplicates(subset=['sex'], keep='first', inplace=True)
#subset，为选定的列做distinct，默认为所有列；
#keep，值选项{'first', 'last', False}，保留重复元素中的第一个、最后一个，或全部删除；
#inplace ，默认为False，返回一个新的dataframe；若为True，则返回去重后的原dataframe

group:
group一般会配合合计函数（Aggregate functions）使用，比如：count、avg等。Pandas对合计函数的支持有限，有count和size函数实现SQL的count：

df.groupby('sex').size()
df.groupby('sex').count()
df.groupby('sex')['tip'].count()

sql:

select sex, max(tip), sum(total_bill) as total
from tips_tb
group by sex;

实现在agg()中指定dict：

df.groupby('sex').agg({'tip': np.max, 'total_bill': np.sum})

# count(distinct **)
df.groupby('tip').agg({'sex': pd.Series.nunique})

as:

df.rename(columns={'total_bill': 'total', 'tip': 'pit', 'sex': 'xes'}, inplace=True)

作者：Treant

出处：http://www.cnblogs.com/en-heng/

pandas-excel

你可能感兴趣的:(pandas-excel)