因为pandas含有使得数据分析工作变得更快和更简单的高级数据结构和操作工具;pandas是基于Numpy来进行创建的,让以numpy为中心的应用变得更加简单;
numpy能够帮我们处理处理数值型数据,但是这还不够
很多时候,我们的数据除了数值之外,还有字符串,还有时间序列等
比如:我们通过爬虫获取到了存储在数据库中的数据
比如:之前youtube的例子中除了数值之外还有国家的信息,视频的分类(tag)信息,标题信息等
所以,numpy能够帮助我们处理数值,但是pandas除了处理数值之外(基于numpy),还能够帮助我们处理其他类型的数据
Series类型:这就是一种类似于一维数组的对象,它是由一组数据以及一组与之相关的数组标签组成(索引)。仅由一组数据即可产生最简单的Series;
在Series里,用.tolist()方法,Series取值,可以直接t["A"]
import pandas as pd
import string
#输入列表,不给定index,生成Series
In [153]: pd.Series([1,2,3,4,5])
Out[153]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
#输入列表,给定index,生成Series
In [137]: obj = pd.Series([1,2,3,4,5], index=list(string.ascii_lowercase[:5]))
In [138]: obj
Out[138]:
a 1
b 2
c 3
d 4
e 5
dtype: int64
In [100]: [string.ascii_lowercase[:5]]
Out[100]: ['abcde']
In [101]: list(string.ascii_lowercase[:5])
Out[101]: ['a', 'b', 'c', 'd', 'e']
In [139]: obj[0]
Out[139]: 1
In [140]: obj['a']
Out[140]: 1
In [141]: obj.index
Out[141]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
In [142]: obj.values
Out[142]: array([1, 2, 3, 4, 5])
In [150]: obj.tolist()
Out[150]: [1, 2, 3, 4, 5]
#输入字典,生成Series,key作为index
In [16]: t = pd.Series({'name':'ethan', 'career':'AI engineer', 'lover':'jacky'})
In [17]: t
Out[17]:
name ethan
career AI engineer
lover jacky
dtype: object
In [21]: t[0]
Out[21]: 'ethan'
In [22]: t['name']
Out[22]: 'ethan'
In [23]: t.index
Out[23]: Index(['name', 'career', 'lover'], dtype='object')
In [24]: t.values
Out[24]: array(['ethan', 'AI engineer', 'jacky'], dtype=object)
In [25]: t.values[0]
Out[25]: 'ethan'
In [28]: t4 = pd.Series(np.arange(10), index=list(string.ascii_lowercase[:10]))
In [29]: t4
Out[29]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
h 7
i 8
j 9
dtype: int64
In [36]: t4[0]
Out[36]: 0
In [37]: t4.astype(np.float32)
Out[37]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
f 5.0
g 6.0
h 7.0
i 8.0
j 9.0
dtype: float32
>>> t
A 0
B 1
C 2
D 3
E 4
F 5
G 6
H 7
I 8
J 9
dtype: int64
>>> t[2:10:2]
C 2
E 4
G 6
I 8
dtype: int64
>>> t[t>4]
F 5
G 6
H 7
I 8
J 9
dtype: int64
>>> t[[2,5,7]]
C 2
F 5
H 7
dtype: int64
>>> t[['A','F','g']]
A 0.0
F 5.0
g NaN
dtype: float64
DataFrame类型:DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同值的类型、数值、字符串、布尔值都可以(因此是一种比Numpy更高级的数据结构);DataFrame本身有行索引,也有列索引;DataFrame也可以理解成是由Series组成的一个字典;
DataFrame对象既有行索引,又有列索引
行索引,表明不同行,横向索引,叫index,0轴,axis=0
列索引,表名不同列,纵向索引,叫columns,1轴,axis=1,1是竖着的,所以1是纵向索引
>>> pd.DataFrame(np.arange(12).reshape(3,4))
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>> pd.DataFrame(np.arange(12).reshape(3,4),index=list("abc"),columns=list("WXYZ"))
W X Y Z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
>>> d2 = [{"name":"xaoming","age":30,"tel":10086},{"name":"xiaogang","age":32,"tel":10010},{"name":"xiaoqiang","tel":191}]
>>> d2
[{'name': 'xaoming', 'age': 30, 'tel': 10086}, {'name': 'xiaogang', 'age': 32, 'tel': 10010}, {'name': 'xiaoqiang', 'tel': 191}]
>>> t2 = pd.DataFrame(d2)
>>> t2
name age tel
0 xaoming 30.0 10086
1 xiaogang 32.0 10010
2 xiaoqiang NaN 191
>>>t2.to_dict()
{'name': {0: 'xaoming', 1: 'xiaogang', 2: 'xiaoqiang'},
'age': {0: 30.0, 1: 32.0, 2: nan},
'tel': {0: 10086, 1: 10010, 2: 191}}
>>> t2.index
RangeIndex(start=0, stop=3, step=1)
>>> t2.columns
Index(['name', 'age', 'tel'], dtype='object')
>>> t2.values
array([['xaoming', 30.0, 10086],
['xiaogang', 32.0, 10010],
['xiaoqiang', nan, 191]], dtype=object)
>>> t2.shape
(3, 3)
>>> t2.dtypes
name object
age float64
tel int64
dtype: object
>>> t2.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
name 3 non-null object
age 2 non-null float64
tel 3 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
>>> t2.describe()
age tel
count 2.000000 3.000000
mean 31.000000 6762.333333
std 1.414214 5691.068470
min 30.000000 191.000000
25% 30.500000 5100.500000
50% 31.000000 10010.000000
75% 31.500000 10048.000000
max 32.000000 10086.000000
In [235]: t2.head(2)
Out[235]:
name age tel
0 xaoming 30.0 10086
1 xiaogang 32.0 10010
In [236]: t2.tail(2)
Out[236]:
name age tel
1 xiaogang 32.0 10010
2 xiaoqiang NaN 191
In [238]: t2['name'][0]
Out[238]: 'xaoming'
In [239]: t2.sort_values(by="age")
Out[239]:
name age tel
0 xaoming 30.0 10086
1 xiaogang 32.0 10010
2 xiaoqiang NaN 191
#默认ascending = True,即升序操作
In [240]: t2.sort_values(by="age", ascending = False)
Out[240]:
name age tel
1 xiaogang 32.0 10010
0 xaoming 30.0 10086
2 xiaoqiang NaN 191
In [243]: t1
Out[243]:
A B C D E F G H
a 0 1 2 3 4 5 6 7
b 8 9 10 11 12 13 14 15
c 16 17 18 19 20 21 22 23
d 24 25 26 27 28 29 30 31
e 32 33 34 35 36 37 38 39
f 40 41 42 43 44 45 46 47
In [97]: t1[:5]
Out[97]:
A B C D E F G H
a 0 1 2 3 4 5 6 7
b 8 9 10 11 12 13 14 15
c 16 17 18 19 20 21 22 23
d 24 25 26 27 28 29 30 31
e 32 33 34 35 36 37 38 39
In [98]: t1[:5]['A']
Out[98]:
a 0
b 8
c 16
d 24
e 32
Name: A, dtype: int64
In [99]: t1['A']
Out[99]:
a 0
b 8
c 16
d 24
e 32
f 2
Name: A, dtype: int64
In [100]: t1['A'][:5]
Out[100]:
a 0
b 8
c 16
d 24
e 32
Name: A, dtype: int64
df.loc通过标签索引行数据
df.iloc通过位置索引行数据
df.loc左右都是闭区间,左右的位置或索引都可以取到
df.iloc左开右闭,左能够取到,右取不到
In [111]: t1.loc["a"]
Out[111]:
A 0
B 1
C 2
D 3
E 4
F 5
G 6
H 7
Name: a, dtype: int64
In [112]: t1.loc["a", "A"]
Out[112]: 0
In [113]: t1.loc["a", ["A", "D"]]
Out[113]:
A 0
D 3
Name: a, dtype: int64
In [114]: t1.loc["a":"c", ["A", "D"]]
Out[114]:
A D
a 0 3
b 8 11
c 16 19
In [116]: t1.loc[["a","c"], ["A", "D"]]
Out[116]:
A D
a 0 3
c 16 19
In [118]: t1.iloc[1]
Out[118]:
A 8
B 9
C 10
D 11
E 12
F 13
G 14
H 15
Name: b, dtype: int64
In [119]: t1.iloc[1, 0]
Out[119]: 8
In [122]: t1.iloc[0:2, [0, 2,4]]
Out[122]:
A C E
a 0 2 4
b 8 10 12
In [123]: t1.dtypes
Out[123]:
A int64
B int64
C int64
D int64
E int64
F int64
G int64
H int64
dtype: object
#不进行数据转换也不会报错
In [124]: t1.loc['a':'c', ['A','D']] = np.nan
In [125]: t1
Out[125]:
A B C D E F G H
a NaN 1 2 NaN 4 5 6 7
b NaN 9 10 NaN 12 13 14 15
c NaN 17 18 NaN 20 21 22 23
d 24.0 25 26 27.0 28 29 30 31
e 32.0 33 34 35.0 36 37 38 39
f 2.0 41 42 43.0 44 45 46 47
In [126]: t1.dtypes
Out[126]:
A float64
B int64
C int64
D float64
E int64
F int64
G int64
H int64
dtype: object
bool索引
In [133]: t1[t1['B']>20]
Out[133]:
A B C D E F G H
d NaN 25 26 27.0 NaN NaN 30 31
e 32.0 33 34 35.0 36.0 37.0 38 39
f 2.0 41 42 43.0 44.0 45.0 46 47
In [140]: (t1['B']>20) & (t1['B']
Out[140]:
a False
b False
c False
d True
e True
f False
Name: B, dtype: bool
In [141]: t1[(t1['B']>20) & (t1['B']
Out[141]:
A B C D E F G H
d NaN 25 26 27.0 NaN NaN 30 31
e 32.0 33 34 35.0 36.0 37.0 38 39
In [144]: data
Out[144]:
{'name': ['lilei', 'hanmeimei', 'zhangwei'],
'age': [18, 17, 25],
'gender': ['male', 'female', 'unknown']}
In [145]: t2 = pd.DataFrame(data)
In [147]: t2
Out[147]:
name age gender
0 lilei 18 male
1 hanmeimei 17 female
2 zhangwei 25 unknown
In [148]: t2['name']
Out[148]:
0 lilei
1 hanmeimei
2 zhangwei
Name: name, dtype: object
In [150]: t2['name'].str.len()
Out[150]:
0 5
1 9
2 8
Name: name, dtype: int64
In [152]: t2[t2['name'].str.len()>5]
Out[152]:
name age gender
1 hanmeimei 17 female
2 zhangwei 25 unknown
In [177]: t2['name'].str.cat()
Out[177]: 'lileihanmeimeizhangwei'
In [153]: t2['name'].str.count('m')
Out[153]:
0 0
1 2
2 0
Name: name, dtype: int64
In [154]: t2['name'].str.contains('e')
Out[154]:
0 True
1 True
2 True
Name: name, dtype: bool
In [155]: t2['name'].str.count('e')
Out[155]:
0 1
1 2
2 1
Name: name, dtype: int64
In [156]: t2['name'].str.startswith('l')
Out[156]:
0 True
1 False
2 False
Name: name, dtype: bool
In [157]: t2['name'].str.get(3)
Out[157]:
0 e
1 m
2 n
Name: name, dtype: object
In [158]: t2['name'].str.upper()
Out[158]:
0 LILEI
1 HANMEIMEI
2 ZHANGWEI
Name: name, dtype: object
In [159]: t2['name'].str.repeat(3)
Out[159]:
0 lileilileililei
1 hanmeimeihanmeimeihanmeimei
2 zhangweizhangweizhangwei
Name: name, dtype: object
In [160]: t2['name'].str.replace('e', 'c')
Out[160]:
0 lilci
1 hanmcimci
2 zhangwci
Name: name, dtype: object
In [161]: a = [{'name': 'leon', 'actors': 'Ethan/Jacky/Kobe/Jordan'}, {'name': 'NBA', 'act
...: ors': 'Paul/Wall/Beal'}, {'name': 'Dunk', 'actors': 'Kevin/James/Durant'}]
In [162]: a
Out[162]:
[{'name': 'leon', 'actors': 'Ethan/Jacky/Kobe/Jordan'},
{'name': 'NBA', 'actors': 'Paul/Wall/Beal'},
{'name': 'Dunk', 'actors': 'Kevin/James/Durant'}]
In [163]: t3 = pd.DataFrame(a)
In [164]: t3
Out[164]:
name actors
0 leon Ethan/Jacky/Kobe/Jordan
1 NBA Paul/Wall/Beal
2 Dunk Kevin/James/Durant
In [165]: t3['actors'].str.split('/')
Out[165]:
0 [Ethan, Jacky, Kobe, Jordan]
1 [Paul, Wall, Beal]
2 [Kevin, James, Durant]
Name: actors, dtype: object
In [166]: t3['actors'].str.split('/').tolist()
Out[166]:
[['Ethan', 'Jacky', 'Kobe', 'Jordan'],
['Paul', 'Wall', 'Beal'],
['Kevin', 'James', 'Durant']]
我们的这组数据存在csv中,我们直接使用pd. read_csv即可,常用参数header=0,即第一行是表头
pd.read_excel(filepath, header=None)
和我们想象的有些差别,我们以为他会是一个Series类型,但是他是一个DataFrame,那么接下来我们就来了解这种数据类型
>>> import pandas as pd
>>> df = pd.read_csv("./data/dogNames2.csv")
>>> print (df)
Row_Labels Count_AnimalName
0 1 1
1 2 2
2 40804 1
3 90201 1
4 90203 1
... ... ...
16215 37916 1
16216 38282 1
16217 38583 1
16218 38948 1
16219 39743 1
[16220 rows x 2 columns]
>>> print (type(df))
一般情况下0,也是有意义的
对于NaN的数据,在numpy中我们是如何处理的?
在pandas中我们处理起来非常容易
判断数据是否为NaN:pd.isnull(df),pd.notnull(df)
处理方式1:删除NaN所在的行列dropna (axis=0, how='any', inplace=False)
处理方式2:填充数据,t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)
处理为0的数据:t[t==0]=np.nan
当然并不是每次为0的数据都需要处理
计算平均值等情况,nan是不参与计算的,但是0会
>>> import numpy as np
>>> t3=pd.DataFrame(np.arange(12).reshape(3,4), index=list("abc"), columns=list("WXYZ"))
>>> t3
W X Y Z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
>>> t3.iloc[1:2,2:] = np.nan
>>> t3
W X Y Z
a 0 1 2.0 3.0
b 4 5 NaN NaN
c 8 9 10.0 11.0
>>> pd.isnull(t3)
W X Y Z
a False False False False
b False False True True
c False False False False
>>> pd.notnull(t3)
W X Y Z
a True True True True
b True True False False
c True True True True
>>> pd.notnull(t3["Y"])
a True
b False
c True
>>> t3.loc[pd.notnull(t3["Y"])]
W X Y Z
a 0 1 2.0 3.0
c 8 9 10.0 11.0
>>> t3
W X Y Z
a 0 1 2.0 3.0
b 4 5 NaN NaN
c 8 9 10.0 11.0
>>> t3.dropna(axis=0)
W X Y Z
a 0 1 2.0 3.0
c 8 9 10.0 11.0
>>> t3.dropna(axis=1)
W X
a 0 1
b 4 5
c 8 9
#how 默认是any,只要有1个na就丢掉这一行或者列,也可以设置all,这一行或者列全部是na才丢掉
>>> t3.dropna(axis=1, how="any")
W X
a 0 1
b 4 5
c 8 9
>>> t3.dropna(axis=1, how="all")
W X Y Z
a 0 1 2.0 3.0
b 4 5 NaN NaN
c 8 9 10.0 11.0
#inplace是对源文件进行修改,默认是False
>>> t3.dropna(axis=0,how="any",inplace=True)
>>> t3
W X Y Z
a 0 1 2.0 3.0
c 8 9 10.0 11.0
>>> t3
W X Y Z
a 0 1 NaN NaN
b 4 5 NaN NaN
c 8 9 10.0 11.0
>>> t3.fillna(0)
W X Y Z
a 0 1 0.0 0.0
b 4 5 0.0 0.0
c 8 9 10.0 11.0
>>> t3.fillna(100)
W X Y Z
a 0 1 100.0 100.0
b 4 5 100.0 100.0
c 8 9 10.0 11.0
>>> t3
W X Y Z
a 0 1 NaN NaN
b 4 5 NaN NaN
c 8 9 10.0 11.0
>>> t3.mean()
W 4.0
X 5.0
Y 10.0
Z 11.0
dtype: float64
>>> t3.fillna(t3.mean())
W X Y Z
a 0 1 10.0 11.0
b 4 5 10.0 11.0
c 8 9 10.0 11.0
>>> t3["Y"].fillna(t3["Y"].mean())
a 10.0
b 10.0
c 10.0
Name: Y, dtype: float64
>>> t3["Y"] = t3["Y"].fillna(t3["Y"].mean())
>>> t3
W X Y Z
a 0 1 10.0 NaN
b 4 5 10.0 NaN
c 8 9 10.0 11.0
>>> t3
W X Y Z
a 0 1 10.0 NaN
b 4 5 10.0 NaN
c 8 9 10.0 11.0
#求均值不会受NAN影响
>>> t3["Z"].mean()
11.0
注意:pandas处理完缺失数据后,索引通常不连续,所以需要更新索引:
pandas中的reset_index()
数据清洗时,会将带空值的行删除,此时DataFrame或Series类型的数据不再是连续的索引,可以使用reset_index()重置索引。
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(20).reshape(5,4),index=[1,3,4,6,8]) print(df)
0 1 2 3 1 0 1 2 3 3 4 5 6 7 4 8 9 10 11 6 12 13 14 15 8 16 17 18 19
reset_index()重置索引:
print(df.reset_index())
index 0 1 2 3 0 1 0 1 2 3 1 3 4 5 6 7 2 4 8 9 10 11 3 6 12 13 14 15 4 8 16 17 18 19
在获得新的index,原来的index变成数据列,保留下来。
不想保留原来的index,使用参数 drop=True,默认 False。
print(df.reset_index(drop=True))
0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19
In [180]: t2
Out[180]:
name age gender
0 lilei 18 male
1 hanmeimei 17 female
2 zhangwei 25 unknown
In [181]: t2['age'].mean()
Out[181]: 20.0
In [182]: t2['age'].max()
Out[182]: 25
In [183]: t2['age'].argmax()
Out[183]: 2
In [184]: t2['age'].min()
Out[184]: 17
In [185]: t2['age'].argmin()
Out[185]: 1
In [186]: t2['age'].median()
Out[186]: 18.0
In [188]: t3
Out[188]:
name actors
0 leon Ethan/Jacky/Kobe/Jordan
1 NBA Paul/Wall/Beal
2 Dunk Kevin/James/Durant
In [189]: t3['actors'].str.split('/')
Out[189]:
0 [Ethan, Jacky, Kobe, Jordan]
1 [Paul, Wall, Beal]
2 [Kevin, James, Durant]
Name: actors, dtype: object
In [191]: a = t3['actors'].str.split('/').tolist()
In [193]: a
Out[193]:
[['Ethan', 'Jacky', 'Kobe', 'Jordan'],
['Paul', 'Wall', 'Beal'],
['Kevin', 'James', 'Durant']]
In [192]: print(len(set([i for j in a for i in j])))
10
In [195]: t2
Out[195]:
name age gender
0 lilei 18 male
1 hanmeimei 17 female
2 zhangwei 25 unknown
In [196]: t2.loc[2, 'gender']
Out[196]: 'unknown'
In [197]: t2.loc[2, 'gender'] = 'male'
In [198]: t2
Out[198]:
name age gender
0 lilei 18 male
1 hanmeimei 17 female
2 zhangwei 25 male
In [199]: t2['gender'].unique()
Out[199]: array(['male', 'female'], dtype=object)
需求:假设现在我们有一组从2006年到2016年1000部最流行的电影数据,我们想知道这些电影数据中评分的平均分,导演的人数等信息,我们应该怎么获取?
数据来源:https://www.kaggle.com/damianpanek/sunday-eda/data
>>> df = pd.read_csv("data/IMDB-Movie-Data.csv")
>>> df
Rank Title Genre ... Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi ... 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi ... 485820 126.46 65.0
2 3 Split Horror,Thriller ... 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family ... 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy ... 393727 325.02 40.0
.. ... ... ... ... ... ... ...
995 996 Secret in Their Eyes Crime,Drama,Mystery ... 27585 NaN 45.0
996 997 Hostel: Part II Horror ... 73152 17.54 46.0
997 998 Step Up 2: The Streets Drama,Music,Romance ... 70699 58.01 50.0
998 999 Search Party Adventure,Comedy ... 4881 NaN 22.0
999 1000 Nine Lives Comedy,Family,Fantasy ... 12435 19.64 11.0
[1000 rows x 12 columns]
>>> print(df.info())
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
Rank 1000 non-null int64
Title 1000 non-null object
Genre 1000 non-null object
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null int64
Runtime (Minutes) 1000 non-null int64
Rating 1000 non-null float64
Votes 1000 non-null int64
Revenue (Millions) 872 non-null float64
Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None
>>> print(df.head(1))
Rank Title Genre ... Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi ... 757074 333.13 76.0
[1 rows x 12 columns]
#获取所有电影平均评分
>>> print(df["Rating"].mean())
6.723199999999999
#获取所有导演数量
>>> print(len(set(df["Director"].tolist())))
644
>>> print(len(df["Director"].unique()))
644
#获取所有演员的数量
#先查看数据类型,是,分割的字符串
>>> print(df["Actors"].head())
0 Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
1 Noomi Rapace, Logan Marshall-Green, Michael Fa...
2 James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
3 Matthew McConaughey,Reese Witherspoon, Seth Ma...
4 Will Smith, Jared Leto, Margot Robbie, Viola D...
Name: Actors, dtype: object
>>> tmp_list = df["Actors"].str.split(",").tolist()
#tmp_list是大列表套小列表
>>> print(tmp_list)
... ['Robert Hoffman', ' Briana Evigan', ' Cassie Ventura', ' Adam G. Sevani'], ['Adam Pally', ' T.J. Miller', ' Thomas Middleditch', 'Shannon Woodward'], ['Kevin Spacey', ' Jennifer Garner', ' Robbie Amell', 'Cheryl Hines']]
#注意这种操作,非常巧妙
>>> actor_list = [i for j in tmp_list for i in j]
>>> print (len(set(actor_list)))
需求:对于这一组电影数据,如果我们想runtime的分布情况,应该如何呈现数据?
代码:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("./data/IMDB-Movie-Data.csv")
Runtime_list = df["Runtime (Minutes)"].tolist()
bin_width = 5
bins = (max(Runtime_list) - min(Runtime_list))//bin_width
print( bins)
plt.figure(figsize=(16, 9), dpi=80)
plt.hist(Runtime_list, bins)
plt.xticks(range(min(Runtime_list), max(Runtime_list)+bin_width, bin_width))
plt.savefig("./t1.png")
>>> df1
a b c d
A 1.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0
>>> df2
x y z
A 0.0 0.0 0.0
B 0.0 0.0 0.0
C 0.0 0.0 0.0
>>> df1.join(df2)
a b c d x y z
A 1.0 1.0 1.0 1.0 0.0 0.0 0.0
B 1.0 1.0 1.0 1.0 0.0 0.0 0.0
>>> df2.join(df1)
x y z a b c d
A 0.0 0.0 0.0 1.0 1.0 1.0 1.0
B 0.0 0.0 0.0 1.0 1.0 1.0 1.0
C 0.0 0.0 0.0 NaN NaN NaN NaN
>>> df1
a b c d
A 1.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0
>>> df3
f a x
X 0 1 2
Y 3 4 5
Z 6 7 8
>>> df1.merge(df3, on="a")
a b c d f x
0 1.0 1.0 1.0 1.0 0 2
1 1.0 1.0 1.0 1.0 0 2
分组方法:
在pandas中类似的分组的操作我们有很简单的方式来完成
df.groupby(by="columns_name")
In [230]: t2
Out[230]:
name age gender
a lilei 18 male
b hanmeimei 17 female
c zhangwei 25 male
In [238]: t2.sort_values(by="age")
Out[238]:
name age gender
b hanmeimei 17 female
a lilei 18 male
c zhangwei 25 male
In [233]: t2.groupby(by="gender")
Out[233]:
In [234]: for i in t2.groupby(by="gender"):
...: print (i)
('female', name age gender
b hanmeimei 17 female)
('male', name age gender
a lilei 18 male
c zhangwei 25 male)
In [240]: t2.groupby(by="gender").count()
Out[240]:
name age
gender
female 1 1
male 2 2
In [241]: t2.groupby(by="gender").count()["name"]
Out[241]:
gender
female 1
male 2
Name: name, dtype: int64
In [242]: t2.groupby(by="gender")["name"].count()
Out[242]:
gender
female 1
male 2
Name: name, dtype: int64
In [245]: t2.groupby(by="gender")["name"].count()["male"]
Out[245]: 2
In [246]: t2.groupby(by="gender").count()["name"]["male"]
Out[246]: 2
In [262]: t2[t2['gender']=="male"].groupby(by="name").count()
Out[262]:
age gender
name
lilei 1 1
zhangwei 1 1
In [267]: t2[t2['gender']=="male"].groupby(by="name")["name"].count().sort_values(ascending= False)[:50]
Out[267]:
name
zhangwei 1
lilei 1
Name: name, dtype: int64
先来复习一下python时间格式化:
strftime()函数将指定的struct_time(默认为当前时间),根据指定的格式化字符串输出
>>> from datetime import datetime
>>> datetime.now()
datetime.datetime(2020, 2, 7, 11, 40, 50, 266586)
>>> datetime.now().strftime('%Y-%m-%d %H:%M:%S %a')
'2020-02-07 11:40:53 Fri'
>>> datetime.now().strftime('%y-%m-%d %H:%M:%S %a')
'20-02-07 11:43:17 Fri'
>>> datetime.now().strftime('%y-%m-%d %H:%M:%S %A')
'20-02-07 11:43:48 Friday'
python中时间日期格式化符号:
%y 两位数的年份表示(00-99)
%Y 四位数的年份表示(000-9999)
%m 月份(01-12)
%d 月内中的一天(0-31)
%H 24小时制小时数(0-23)
%I 12小时制小时数(01-12)
%M 分钟数(00=59)
%S 秒(00-59)
%a 本地简化星期名称
%A 本地完整星期名称
%b 本地简化的月份名称
%B 本地完整的月份名称
%c 本地相应的日期表示和时间表示
%j 年内的一天(001-366)
%p 本地A.M.或P.M.的等价符
%U 一年中的星期数(00-53)星期天为星期的开始
%w 星期(0-6),星期天为星期的开始
%W 一年中的星期数(00-53)星期一为星期的开始
%x 本地相应的日期表示
%X 本地相应的时间表示
%Z 当前时区的名称
%% %号本身
>>> pd.date_range(start="20171230", end="20180131", freq="D")
DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01', '2018-01-02',
'2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06',
'2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10',
'2018-01-11', '2018-01-12', '2018-01-13', '2018-01-14',
'2018-01-15', '2018-01-16', '2018-01-17', '2018-01-18',
'2018-01-19', '2018-01-20', '2018-01-21', '2018-01-22',
'2018-01-23', '2018-01-24', '2018-01-25', '2018-01-26',
'2018-01-27', '2018-01-28', '2018-01-29', '2018-01-30',
'2018-01-31'],
dtype='datetime64[ns]', freq='D')
>>> pd.date_range(start="20171230", end="20180131", freq="10D")
DatetimeIndex(['2017-12-30', '2018-01-09', '2018-01-19', '2018-01-29'], dtype='datetime64[ns]', freq='10D')
>>> pd.date_range(start="20171230", periods=10, freq="10D")
DatetimeIndex(['2017-12-30', '2018-01-09', '2018-01-19', '2018-01-29',
'2018-02-08', '2018-02-18', '2018-02-28', '2018-03-10',
'2018-03-20', '2018-03-30'],
dtype='datetime64[ns]', freq='10D')
>>> pd.date_range(start="20171230", periods=10, freq="M")
DatetimeIndex(['2017-12-31', '2018-01-31', '2018-02-28', '2018-03-31',
'2018-04-30', '2018-05-31', '2018-06-30', '2018-07-31',
'2018-08-31', '2018-09-30'],
dtype='datetime64[ns]', freq='M')
>>> pd.date_range(start="20171230", periods=10, freq="T")
DatetimeIndex(['2017-12-30 00:00:00', '2017-12-30 00:01:00',
'2017-12-30 00:02:00', '2017-12-30 00:03:00',
'2017-12-30 00:04:00', '2017-12-30 00:05:00',
'2017-12-30 00:06:00', '2017-12-30 00:07:00',
'2017-12-30 00:08:00', '2017-12-30 00:09:00'],
dtype='datetime64[ns]', freq='T')
>>> index=pd.date_range("20170101",periods=10)
>>> df = pd.DataFrame(np.random.rand(10),index=index)
>>> df
0
2017-01-01 0.094399
2017-01-02 0.923081
2017-01-03 0.980860
2017-01-04 0.167984
2017-01-05 0.504205
2017-01-06 0.921958
2017-01-07 0.881825
2017-01-08 0.405544
2017-01-09 0.196156
2017-01-10 0.347028
>>> pd.date_range(start="2017/12/30 10:10:10", periods=10, freq="T")
DatetimeIndex(['2017-12-30 10:10:10', '2017-12-30 10:11:10',
'2017-12-30 10:12:10', '2017-12-30 10:13:10',
'2017-12-30 10:14:10', '2017-12-30 10:15:10',
'2017-12-30 10:16:10', '2017-12-30 10:17:10',
'2017-12-30 10:18:10', '2017-12-30 10:19:10'],
dtype='datetime64[ns]', freq='T')
重采样:指的是将时间序列从一个频率转化为另一个频率进行处理的过程,将高频率数据转化为低频率数据为降采样,低频率转化为高频率为升采样
pandas提供了一个resample的方法来帮助我们实现频率转化
In [273]: t = pd.DataFrame(np.random.uniform(10, 50, (100, 1)), index= pd.date_range("20170101", periods=100))
In [274]: t
Out[274]:
0
2017-01-01 33.534974
2017-01-02 15.102734
2017-01-03 28.826946
2017-01-04 10.063349
2017-01-05 20.740533
... ...
2017-04-06 28.283353
2017-04-07 41.277973
2017-04-08 13.308545
2017-04-09 36.027962
2017-04-10 40.969535
[100 rows x 1 columns]
In [278]: t.resample("10D").mean()
Out[278]:
0
2017-01-01 27.914261
2017-01-11 28.681582
2017-01-21 31.130641
2017-01-31 30.972632
2017-02-10 31.600134
2017-02-20 29.230198
2017-03-02 30.104250
2017-03-12 30.276918
2017-03-22 26.510953
2017-04-01 31.701654
>>> import pandas as pd
>>> df = pd.read_csv("./data/911.csv")
>>> df["timeStamp"]
0 2015-12-10 17:10:52
1 2015-12-10 17:29:21
2 2015-12-10 14:39:21
3 2015-12-10 16:47:36
4 2015-12-10 16:56:52
...
249732 2017-09-20 19:38:35
249733 2017-09-20 19:37:39
249734 2017-09-20 19:42:36
249735 2017-09-20 19:42:05
249736 2017-09-20 19:42:29
Name: timeStamp, Length: 249737, dtype: object
>>> df["timeStamp"] = pd.to_datetime(df["timeStamp"])
>>> df["timeStamp"]
0 2015-12-10 17:10:52
1 2015-12-10 17:29:21
2 2015-12-10 14:39:21
3 2015-12-10 16:47:36
4 2015-12-10 16:56:52
...
249732 2017-09-20 19:38:35
249733 2017-09-20 19:37:39
249734 2017-09-20 19:42:36
249735 2017-09-20 19:42:05
249736 2017-09-20 19:42:29
Name: timeStamp, Length: 249737, dtype: datetime64[ns]
可以看到经过处理dtype变了;
综合练习:
现在我们有2015到2017年25万条911的紧急电话的数据,
数据来源:https://www.kaggle.com/mchirico/montcoalert/data
1.统计出911数据中不同月份电话次数的变化情况
代码:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("./data/911.csv")
df["timeStamp"] = pd.to_datetime(df["timeStamp"])
df["cate"] = pd.DataFrame([i[0] for i in df["title"].str.split(":")])
df = df.set_index("timeStamp")
resampled = df["lng"].resample("M").count()
_x = resampled.index
_y = resampled.values
plt.figure(figsize=(16,9), dpi=80)
plt.bar(range(len(_y)), _y)
plt.xticks(range(len(_y)), _x.strftime('%Y-%m'), rotation=45)
plt.savefig("t148_1.png")
先来复习一下python时间格式化:
Series.apply()
回到主题, pandas 的 apply() 函数可以作用于 Series 或者整个 DataFrame,功能也是自动遍历整个 Series 或者 DataFrame, 对每一个元素运行指定的函数。
举一个例子,现在有这样一组数据,学生的考试成绩:
Name Nationality Score 张 汉 400 李 回 450 王 汉 460
如果民族不是汉族,则总分在考试分数上再加 5 分,现在需要用 pandas 来做这种计算,我们在 Dataframe 中增加一列。当然如果只是为了得到结果, numpy.where() 函数更简单,这里主要为了演示 Series.apply() 函数的用法。
import pandas as pd
df = pd.read_csv("studuent-score.csv")
df['ExtraScore'] = df['Nationality'].apply(lambda x : 5 if x != '汉' else 0)
df['TotalScore'] = df['Score'] + df['ExtraScore']
对于 Nationality 这一列, pandas 遍历每一个值,并且对这个值执行 lambda 匿名函数,将计算结果存储在一个新的 Series 中返回。上面代码在 jupyter notebook 中显示的结果如下:
Name Nationality Score ExtraScore TotalScore 0 张 汉 400 0 400 1 李 回 450 5 455 2 王 汉 460 0 460
apply() 函数当然也可执行 python 内置的函数,比如我们想得到 Name 这一列字符的个数,如果用 apply() 的话:
df['NameLength'] = df['Name'].apply(len)
DataFrame.apply()
DataFrame.apply() 函数则会遍历每一个元素,对元素运行指定的 function。比如下面的示例:
import pandas as pd
import numpy as np
matrix = [ [1,2,3], [4,5,6], [7,8,9] ]
df = pd.DataFrame(matrix, columns=list('xyz'), index=list('abc'))
df.apply(np.square)
x y z
a 1 4 9
b 16 25 36
c 49 64 81
如果只想 apply() 作用于指定的行和列,可以用行或者列的 name 属性进行限定。比如下面的示例将 x 列进行平方运算:
df.apply(lambda x : np.square(x) if x.name=='x' else x)
x y z
a 1 2 3
b 16 5 6
c 49 8 9
下面的示例对 x 和 y 列进行平方运算:
df.apply(lambda x : np.square(x) if x.name in ['x', 'y'] else x)
x y z
a 1 4 3
b 16 25 6
c 49 64 9
下面的示例对第一行 (a 标签所在行)进行平方运算:
df.apply(lambda x : np.square(x) if x.name == 'a' else x, axis=1)
>>> df.apply(lambda x : np.square(x) if x.name == 'a' else x, axis=1)
x y z
a 1 4 9
b 4 5 6
c 7 8 9
默认情况下 axis=0 表示按列,axis=1 表示按行。
apply() 计算日期相减示例
平时我们会经常用到日期的计算,比如要计算两个日期的间隔,比如下面的一组关于 wbs 起止日期的数据:
wbs date_from date_to
job1 2019-04-01 2019-05-01
job2 2019-04-07 2019-05-17
job3 2019-05-16 2019-05-31
job4 2019-05-20 2019-06-11
假定要计算起止日期间隔的天数。比较简单的方法就是两列相减(datetime 类型):
import pandas as pd
import datetime as dt
wbs = { "wbs": ["job1", "job2", "job3", "job4"], "date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"], "date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"] }
df = pd.DataFrame(wbs)
df['elpased'] = df['date_to'].apply(pd.to_datetime) -df['date_from'].apply(pd.to_datetime)
apply() 函数将 date_from 和 date_to 两列转换成 datetime 类型。我们 print 一下 df:
wbs date_from date_to elapsed 0 job1 2019-04-01 2019-05-01 30 days 1 job2 2019-04-07 2019-05-17 40 days 2 job3 2019-05-16 2019-05-31 15 days 3 job4 2019-05-20 2019-06-11 22 days
日期间隔已经计算出来,但后面带有一个单位 days,这是因为两个 datetime 类型相减,得到的数据类型是 timedelta64,如果只要数字,还需要使用 timedelta 的 days 属性转换一下。
elapsed= df['date_to'].apply(pd.to_datetime) - df['date_from'].apply(pd.to_datetime)
df['elapsed'] = elapsed.apply(lambda x : x.days)
使用 DataFrame.apply() 函数也能达到同样的效果,我们需要先定义一个函数 get_interval_days() 函数的第一列是一个 Series 类型的变量,执行的时候,依次接收 DataFrame 的每一行。