Pandas练习

开公众号啦,「心智万花筒」欢迎一起读书交流成长。

import numpy as np
from numpy import random
import pandas as pd

索引练习

索引是不能更常见的操作了。默认选取列,比如df['Date'],如果对行,用标签索引ix,如df.ix['2015-01-05']等,还有lociloc,他们的区别是这样的。

loc is label based indexing so basically looking up a value in a row, iloc is integer row based indexing, ix is a general method that first performs label based, if that fails then it falls to integer based.

.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.

df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=['100', '200', '300'])
df
A B
100 a 54
200 b 67
300 c 89
df['A'] # 默认的列索引
100    a
200    b
300    c
Name: A, dtype: object
df.loc['100'] # 按标签行索引
A     a
B    54
Name: 100, dtype: object
df.iloc[0] # 更像numpy索引,按整数索引行
A     a
B    54
Name: 100, dtype: object
df.ix['200'] #更general的标签索引
A     b
B    67
Name: 200, dtype: object
df.ix[1] # 整合了loc和iloc的功能
# 注意当标签为整数时,会fail
A     b
B    67
Name: 200, dtype: object
df
A B
100 a 54
200 b 67
300 c 89

Text练习

对Series对象进行字符串操作。Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:

Series

s = pd.Series(['A', 'B', 'Aaba', 'Baca', np.nan, 'CABA', 'dog'])
# built in str 方法都可以直接用
s.str.lower()
0       a
1       b
2    aaba
3    baca
4     NaN
5    caba
6     dog
dtype: object
# element wise operation
s.str.len()
0    1.0
1    1.0
2    4.0
3    4.0
4    NaN
5    4.0
6    3.0
dtype: float64

Index

df.columns是一个Index对象,这样可以便捷地整理columns名字

idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
# strip() / lstrip() / rstrip()
idx.str.strip()
Index([u'jack', u'jill', u'jesse', u'frank'], dtype='object')
import numpy as np
df_str = pd.DataFrame(np.random.random((3, 2)), columns=[' Column A ', ' Column B '],index=range(3))
df_str.columns
Index([u' Column A ', u' Column B '], dtype='object')
df_str.columns.str.strip() # 依然是Index 对象
Index([u'Column A', u'Column B'], dtype='object')
df_str.columns = df_str.columns.str.strip().str.lower().str.replace(' ','_')
df_str
column_a column_b
0 0.444976 0.441526
1 0.846533 0.549270
2 0.492742 0.363660

Replace

replace and findall, regular expressions work

s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3.str.replace('^A|dog','XX')
0       XX
1        B
2        C
3    XXaba
4     Baca
5         
6      NaN
7     CABA
8       XX
9      cat
dtype: object

Pattern Match(contains)

The distinction between match and contains is strictness: match relies on strict re.match, while contains relies on re.search.

pattern = r'[a-z][0-9]'
s = pd.Series(['1', 'b2', '3a', '3b', 'c2c'])
s.str.contains(pattern)
0    False
1     True
2    False
3    False
4     True
dtype: bool

str方法汇总

url = 'http://pandas.pydata.org/pandas-docs/stable/text.html'
df_str_methond = pd.read_html(url,header=0,attrs={'id' : 'text-summary'})[0].set_index('Method')
df_str_methond.head()
Method Description
cat() Concatenate strings
split() Split strings on delimiter
rsplit() Split strings on delimiter working from the en...
get() Index into each element (retrieve i-th element)
join() Join strings in each element of the Series wit...

教程

参考

创建DataFrame

names = ['Bob','Jessica','Mary','John','Mel']
# Make a random list of 1,000 baby names using the five above
random.seed(500)
random_names = [names[random.randint(low=0,high=len(names))] for i in range(1000)]
random_names[:5]
['Mary', 'Jessica', 'Jessica', 'Bob', 'Jessica']
births = [random.randint(low=0,high=1000) for i in range(1000)]
births[:5]
[905, 52, 796, 217, 874]
# To merge these two lists together we will use the zip function.
BabyDataSet = zip(random_names,births) # return a list
BabyDataSet[:3]
[('Mary', 905), ('Jessica', 52), ('Jessica', 796)]
# data 接受list,字典等数据类型
df_baby = pd.DataFrame(data=BabyDataSet,columns=['Names','Births'])
df_baby.head()
Names Births
0 Mary 905
1 Jessica 52
2 Jessica 796
3 Bob 217
4 Jessica 874
df_baby['Names'].unique()
array(['Mary', 'Jessica', 'Bob', 'John', 'Mel'], dtype=object)
df_baby['Names'].describe()
count     1000
unique       5
top        Bob
freq       206
Name: Names, dtype: object

数据读写CSV

df_baby.to_csv('births1880.csv',index=False,header=False)
df_baby_read = pd.read_csv('births1880.csv',names=['Names','Births'])
df_baby_read.head()
Names Births
0 Mary 905
1 Jessica 52
2 Jessica 796
3 Bob 217
4 Jessica 874

数据读写Excel

d = {'Channel':[1], 'Number':[255]}
df = pd.DataFrame(d)
df
Channel Number
0 1 255
df.to_excel('test1.xlsx',sheet_name='test1',index=False)
df.to_excel('test2.xlsx', sheet_name = 'test2', index = False)
df.to_excel('test3.xlsx', sheet_name = 'test3', index = False)
import os
file_names = []
for file in os.listdir('.'):
    if file.endswith('.xlsx'):
        file_names.append(file)
file_names
['test1.xlsx', 'test2.xlsx', 'test3.xlsx']

把三个excel文件读到一起

def get_file(fname):
    df = pd.read_excel(fname,0)
    df['File'] = fname
    return df.set_index(['File'])

df_list = [get_file(fname) for fname in file_names]
df_list[0]
File Channel Number
test1.xlsx 1 255
big_df = pd.concat(df_list)
big_df
File Channel Number
test1.xlsx 1 255
test2.xlsx 1 255
test3.xlsx 1 255

Stack and Unstack

# Our small data set
d = {'one':[1,1],'two':[2,2]};i = ['a','b']
# Create dataframe
df = pd.DataFrame(data = d, index = i)
df
one two
a 1 2
b 1 2
stack = df.stack()
stack
a  one    1
   two    2
b  one    1
   two    2
dtype: int64
stack.index
MultiIndex(levels=[[u'a', u'b'], [u'one', u'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
unstack = df.unstack()
unstack
one  a    1
     b    1
two  a    2
     b    2
dtype: int64
unstack.index
MultiIndex(levels=[[u'one', u'two'], [u'a', u'b']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Groupby

d = {'one':[1,1,1,1,1],
     'two':[2,2,2,2,2],
     'letter':['a','a','b','b','c']}

# Create dataframe
df = pd.DataFrame(d)
df
letter one two
0 a 1 2
1 a 1 2
2 b 1 2
3 b 1 2
4 c 1 2
one = df.groupby('letter')
one.sum()
letter one two
a 2 4
b 2 4
c 1 2
letterone = df.groupby(['letter','one']).sum()
letterone
letter one two
a 1 4
b 1 4
c 1 2
letterone.index
MultiIndex(levels=[[u'a', u'b', u'c'], [1]],
           labels=[[0, 1, 2], [0, 0, 0]],
           names=[u'letter', u'one'])

Outliers

# Create a dataframe with dates as your index
States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'FL', 'FL'] 
data = [1.0, 1.2, 4, 5, 6, 7, 8, 9, 10]
idx = pd.date_range('1/1/2012', periods=9, freq='MS')
df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
df1['State'] = States

# Create a second dataframe
data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 6, 11]
idx2 = pd.date_range('1/1/2013', periods=9, freq='MS')
df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
df2['State'] = States
# conbine dataframes
df = pd.concat([df1,df2])
df.head(2).append(df.tail(2))
Revenue State
2012-01-01 1.0 NY
2012-02-01 1.2 NY
2013-08-01 6.0 FL
2013-09-01 11.0 FL

方法一

# a new copy
newdf = df.copy()
newdf['x-Mean'] = abs(newdf['Revenue']-newdf['Revenue'].mean())
newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
newdf
Revenue State x-Mean 1.96*std Outlier
2012-01-01 1.0 NY 6.177778 5.666103 True
2012-02-01 1.2 NY 5.977778 5.666103 True
2012-03-01 4.0 NY 3.177778 5.666103 False
2012-04-01 5.0 NY 2.177778 5.666103 False
2012-05-01 6.0 FL 1.177778 5.666103 False
2012-06-01 7.0 FL 0.177778 5.666103 False
2012-07-01 8.0 GA 0.822222 5.666103 False
2012-08-01 9.0 FL 1.822222 5.666103 False
2012-09-01 10.0 FL 2.822222 5.666103 False
2013-01-01 10.0 NY 2.822222 5.666103 False
2013-02-01 10.0 NY 2.822222 5.666103 False
2013-03-01 9.0 NY 1.822222 5.666103 False
2013-04-01 9.0 NY 1.822222 5.666103 False
2013-05-01 8.0 FL 0.822222 5.666103 False
2013-06-01 8.0 FL 0.822222 5.666103 False
2013-07-01 7.0 GA 0.177778 5.666103 False
2013-08-01 6.0 FL 1.177778 5.666103 False
2013-09-01 11.0 FL 3.822222 5.666103 False

方法二

Group by item,按州计算Outlier值

groupby后常用的方法区别.

Agg,对各个组的聚合运算 computing a summary statistic (or statistics) about each group. Some examples:

  • Compute group sums or means
  • Compute group sizes / counts

Transform,在各个组内进行运算,返回一个具有相同索引的对象 perform some group-specific computations and return a like-indexed. Some examples:

  • Standardizing data (zscore) within group
  • Filling NAs within groups with a value derived from each group

Filter,根据各个组信息筛选 discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

  • Discarding data that belongs to groups with only a few members
  • Filtering out data based on the group sum or mean
newdf = df.copy()
State = newdf.groupby('State')
newdf['x-Mean'] = State.transform(lambda x: abs(x-x.mean()))
newdf['1.96*std'] = State.transform(lambda x: 1.96*x.std())
newdf['Outlier'] = State.transform(lambda x: abs(x-x.mean()) > 1.96*x.std())
newdf
Revenue State x-Mean 1.96*std Outlier
2012-01-01 1.0 NY 5.150 7.513293 False
2012-02-01 1.2 NY 4.950 7.513293 False
2012-03-01 4.0 NY 2.150 7.513293 False
2012-04-01 5.0 NY 1.150 7.513293 False
2012-05-01 6.0 FL 2.125 3.543134 False
2012-06-01 7.0 FL 1.125 3.543134 False
2012-07-01 8.0 GA 0.500 1.385929 False
2012-08-01 9.0 FL 0.875 3.543134 False
2012-09-01 10.0 FL 1.875 3.543134 False
2013-01-01 10.0 NY 3.850 7.513293 False
2013-02-01 10.0 NY 3.850 7.513293 False
2013-03-01 9.0 NY 2.850 7.513293 False
2013-04-01 9.0 NY 2.850 7.513293 False
2013-05-01 8.0 FL 0.125 3.543134 False
2013-06-01 8.0 FL 0.125 3.543134 False
2013-07-01 7.0 GA 0.500 1.385929 False
2013-08-01 6.0 FL 2.125 3.543134 False
2013-09-01 11.0 FL 2.875 3.543134 False

你可能感兴趣的:(Pandas练习)