Python数据分析第四课:数据的处理(数据合并、数据筛选、数据排序)

一、数据合并

数据合并主要包括两种操作:

  • 轴向连接(concatenation):pd.concat()可以沿一个轴将多个DataFrame对象连接在一起,形成一个新的DataFrame对象
  • 融合(merging):pd.merge()方法可以根据一个或多个键将不同的DataFrame中的行连接在一起

1. 轴向连接pd.concat

pd.concat()函数可以将根据不同的轴进行合并,常用参数如下:

pd.concat(objs,axis=0,join='outer')

  • objs:Series、DataFrame或者是panel构成的序列list
  • axis:需要合并链接的轴,0是行,1是列,默认为0
  • join:连接的方式inner,或者outer,默认是outer
import pandas as pd

dict_1 = {
         'A':['A0','A1','A2','A3'],
         'B':['B0','B1','B2','B3'],
         'C':['C0','C1','C2','C3']}
df1 = pd.DataFrame(dict_1)
df1
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
dict_2 = {
         'B':['B0','B1','B2','B3'],
         'C':['C0','C1','C2','C3'],
         'D':['D0','D1','D2','D3']}
df2 = pd.DataFrame(dict_2)
df2
B C D
0 B0 C0 D0
1 B1 C1 D1
2 B2 C2 D2
3 B3 C3 D3

当使用pd.concat()默认参数合并df1和df2时:

df3 = pd.concat([df1,df2], axis=0, join='outer')
df3
A B C D
0 A0 B0 C0 NaN
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 A3 B3 C3 NaN
0 NaN B0 C0 D0
1 NaN B1 C1 D1
2 NaN B2 C2 D2
3 NaN B3 C3 D3

通过以上的结果发现,当join=‘outer’,axis参数为0时,列进行并集处理,纵向表拼接,缺失值由NaN填充,并且会保留原有数据的行索引。

如果两个表的Index都没有实际含义,使用ignore_index参数,设置成True,即可重新整理一个新的index。

pd.concat([df1,df2],axis=0,join='outer',ignore_index=True

df4 = pd.concat([df1,df2],axis=0,join='outer',ignore_index=True)
df4
A B C D
0 A0 B0 C0 NaN
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 A3 B3 C3 NaN
4 NaN B0 C0 D0
5 NaN B1 C1 D1
6 NaN B2 C2 D2
7 NaN B3 C3 D3

当我们将concat()的axis参数调整为1,合并df1和df2时:

df5 = pd.concat([df1,df2],axis=1,join='outer')
df5
A B C B C D
0 A0 B0 C0 B0 C0 D0
1 A1 B1 C1 B1 C1 D1
2 A2 B2 C2 B2 C2 D2
3 A3 B3 C3 B3 C3 D3
当我们将concat()的join参数调整为inner,合并df1和df2时:
df6 = pd.concat([df1,df2],axis=0,join='inner')
df6
B C
0 B0 C0
1 B1 C1
2 B2 C2
3 B3 C3
0 B0 C0
1 B1 C1
2 B2 C2
3 B3 C3

通过以上的结果可知:

  • 如果join参数为inner,得到的是两表的交集;
  • 如果join参数为outer,得到的是两表的并集。

2. 融合pd.merge()

pd.merge()的常用参数如下:

pd.merge(left,right,how='inner',on=None)

参数介绍:

  • left和right:两个要合并的DataFrame
  • how:连接方式,有inner、left、right、outer,默认为inner
  • on:指的是用于连接的列索引名称,必须存在于左右两个DataFrame,如果没有指定且其他参数也没有指定,则以两个DataFrame列名交集作为连接键
import pandas as pd
left = pd.DataFrame({'key':['a','b','b','d'],'data1':range(4)})
left
key data1
0 a 0
1 b 1
2 b 2
3 d 3
right = pd.DataFrame({'key':['a','b','c'],'data2':range(3)})
right
key data2
0 a 0
1 b 1
2 c 2
df7 = pd.merge(left,right)
df7
key data1 data2
0 a 0 0
1 b 1 1
2 b 2 1

merge()默认采用inner连接,并且使用两个DataFrame的列名交集(key)作为连接键,同意,最终连接的数据也是两个DataFrame key列数据的交集

当两个DataFrame使用outer连接时:

df8= pd.merge(left,right,on=['key'],how='outer')
df8
key data1 data2
0 a 0.0 0.0
1 b 1.0 1.0
2 b 2.0 1.0
3 d 3.0 NaN
4 c NaN 2.0

当merge()做outer连接时,最终连接的数据是两个DataFrame key列数据的并集,确实内容由NaN填充

当两个DataFrame使用left连接时:

df9 = pd.merge(left,right,on=['key'],how='left')
df9
key data1 data2
0 a 0 0.0
1 b 1 1.0
2 b 2 1.0
3 d 3 NaN

当merge()做left连接时,最终连接的数据是以left数据的链接键为基准合并两个数据的列数据,确实的内容由NaN填充

那么,当两个DataFrame使用right连接时,会发生什么呢?

df10= pd.merge(left,right,on=['key'],how='right')
df10
key data1 data2
0 a 0.0 0
1 b 1.0 1
2 b 2.0 1
3 c NaN 2

两种合并方式的应用场景:

  • 有两张表格分别存储了9月和10月份的成交信息,此时可以使用concat()将两个表沿0轴合并
  • 现在有两张表格,一个是城郊型,包含订单号、金额、客户ID等;第二个是客户信息,包含客户ID、姓名、电话号码等信息,此时可以使用merge()根据客户ID讲两个表合并成一个完整的表

二、数据的筛选

前面的内容中,我们学习了如何获取一条数据或连续的多条数据,但实际工作中我们经常需要处理上万条。,甚至更多的数据。
那么,要如何才能快速的筛选出符合条件的数据呢?

import pandas as pd

# 导入数据
df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
# 了解数据基本情况
print(df.shape)
df.head()
(72756, 7)
_id 关注的收藏夹 关注 关注者 关注的问题 关注的话题 关注的专栏
0 587598f89f11daf90617fb7a 52 17 1 30 58 2
1 587598f89f11daf90617fb7c 27 73 15 87 26 1
2 587598f89f11daf90617fb7e 72 94 1 112 20 4
3 587598f89f11daf90617fb80 174 84 8 895 30 7
4 587598f89f11daf90617fb82 3 236 64 119 44 17

1. 要将关注者超过100的用户数据获取出来

bools = df['关注者']>100
bools
0        False
1        False
2        False
3        False
4        False
         ...  
72751    False
72752    False
72753    False
72754    False
72755    False
Name: 关注者, Length: 72756, dtype: bool
df1 = df[bools]
print(df1.shape)
df1
(5501, 7)
_id 关注的收藏夹 关注 关注者 关注的问题 关注的话题 关注的专栏
79 587598f89f11daf90617fc18 8 111 3049 1216 45 17
101 587598f89f11daf90617fc44 61 123 205 670 24 18
112 587598f99f11daf90617fc5a 1 804 197 830 39 26
114 587598f99f11daf90617fc5e 44 62 201 314 8 3
121 587598f99f11daf90617fc6c 44 628 6597 1566 144 69
... ... ... ... ... ... ... ...
72694 5878399e9f11da09c8fe690e 3 172 219 466 68 11
72700 5878399e9f11da09c8fe691a 21 39 281 51 45 2
72713 5878399e9f11da09c8fe6934 49 635 248 1772 151 123
72726 5878399e9f11da09c8fe694e 447 62 728 14 1 21
72741 5878399e9f11da09c8fe696c 3546 390 142 25 4 126

5501 rows × 7 columns

来复盘一下以上代码的逻辑:

  1. 首先判断每个用户的关注者数量是否大于100,大于则返回True,表示该行被标记为True,否则被标记为False。bools记录了每一行是否符合筛选条件,是一个Series对象,其中的值是bool类型。
bools = df['关注者']>100
  1. 然后,根据bools每行的值来对df进行筛选,值为True,表示对应的行会留下,否则去除
df1 = df[bools]
  1. 最后打印的df1数据就是关注者超过100的用户数据。这是pandas根据某列的值进行筛选的基本逻辑

根据某列的值进行筛选的逻辑如上所述,那么,如果要进行多条件的联合筛选呢?

2. 获取关注者超过300切关注的数量超过100的用户数据

import pandas as pd
# 导入数据
df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
# 查看数据基本情况
df
_id 关注的收藏夹 关注 关注者 关注的问题 关注的话题 关注的专栏
0 587598f89f11daf90617fb7a 52 17 1 30 58 2
1 587598f89f11daf90617fb7c 27 73 15 87 26 1
2 587598f89f11daf90617fb7e 72 94 1 112 20 4
3 587598f89f11daf90617fb80 174 84 8 895 30 7
4 587598f89f11daf90617fb82 3 236 64 119 44 17
... ... ... ... ... ... ... ...
72751 5878399e9f11da09c8fe6980 112 77 27 39 2 32
72752 5878399e9f11da09c8fe6982 15 90 16 61 116 2
72753 5878399e9f11da09c8fe6984 7 19 1 25 12 2
72754 5878399e9f11da09c8fe6986 14 320 3 68 38 9
72755 5878399e9f11da09c8fe6988 7 1 2 7 15 0

72756 rows × 7 columns

bool_1 = df['关注者']>300
bool_2 = df['关注']>100
df2 = df[bool_1 & bool_2]
df2
_id 关注的收藏夹 关注 关注者 关注的问题 关注的话题 关注的专栏
79 587598f89f11daf90617fc18 8 111 3049 1216 45 17
121 587598f99f11daf90617fc6c 44 628 6597 1566 144 69
155 587598f99f11daf90617fcb0 194 250 1103 10 1 19
228 587598f99f11daf90617fd42 48 224 1450 360 128 20
261 587598f99f11daf90617fd84 1125 842 24848 108 33 27
... ... ... ... ... ... ... ...
72173 5878399b9f11da09c8fe64fc 38 5000 1386 2205 25 6
72416 5878399d9f11da09c8fe66e2 661 299 457 32 4 9
72463 5878399d9f11da09c8fe6740 274 209 1556 2405 188 318
72648 5878399e9f11da09c8fe68b2 10 278 471 763 66 12
72661 5878399e9f11da09c8fe68cc 756 491 414 229 2 58

1921 rows × 7 columns

复盘以上的代码:

  1. 通过2个限制条件df[‘关注者’]>300和df[‘关注’]>100,分别得到bool_1和bool_2这2个Series
  2. 因为我们的需求是要同时满足两个条件,所以使用逻辑运算符连接两个值
  3. 最后获取同时满足两个条件的值

3. 数据筛选的第二种方法

使用loc[]方法实现筛选的效果。

import pandas as pd

# 导入数据
df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
# 了解数据基本情况
print(df.shape)
df.head()
(72756, 7)
_id 关注的收藏夹 关注 关注者 关注的问题 关注的话题 关注的专栏
0 587598f89f11daf90617fb7a 52 17 1 30 58 2
1 587598f89f11daf90617fb7c 27 73 15 87 26 1
2 587598f89f11daf90617fb7e 72 94 1 112 20 4
3 587598f89f11daf90617fb80 174 84 8 895 30 7
4 587598f89f11daf90617fb82 3 236 64 119 44 17
# 获取关注者超过100的用户数据
df_new = df.loc[df['关注者']>100]
df_new
_id 关注的收藏夹 关注 关注者 关注的问题 关注的话题 关注的专栏
79 587598f89f11daf90617fc18 8 111 3049 1216 45 17
101 587598f89f11daf90617fc44 61 123 205 670 24 18
112 587598f99f11daf90617fc5a 1 804 197 830 39 26
114 587598f99f11daf90617fc5e 44 62 201 314 8 3
121 587598f99f11daf90617fc6c 44 628 6597 1566 144 69
... ... ... ... ... ... ... ...
72694 5878399e9f11da09c8fe690e 3 172 219 466 68 11
72700 5878399e9f11da09c8fe691a 21 39 281 51 45 2
72713 5878399e9f11da09c8fe6934 49 635 248 1772 151 123
72726 5878399e9f11da09c8fe694e 447 62 728 14 1 21
72741 5878399e9f11da09c8fe696c 3546 390 142 25 4 126

5501 rows × 7 columns

三、数据的排序

在数据获取过程中,数据的排序也是我们经常需要处理的问题。例如:我们需要找出关注者数量前10的用户信息

Python数据分析第四课:数据的处理(数据合并、数据筛选、数据排序)_第1张图片

可以使用sort_index()、sort_values()两个方法对数据进行排序,并且这两个方法Series和DataFrame都支持

那么,DataFrame的这两个方法有什么区别呢?

  • sort_index()方法是按照行进行排序
  • sort_values()可以指定具体列进行排序

1. 根据国家名称来进行排序

import pandas as pd

people = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx',index_col='Country Code')
people
CountryName 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
Country Code
AFG Afghanistan 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
ALB Albania 61.808311 59.585866 50.862987 49.663787 48.637067 NaN 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
DZA Algeria 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
ASM American Samoa NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ADO Andorra NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
VIR Virgin Islands (U.S.) 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
WBG West Bank and Gaza 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
YEM Yemen, Rep. 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
ZMB Zambia 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
ZWE Zimbabwe 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133

219 rows × 14 columns

people.sort_index(inplace=True,ascending=True)
people.head()
CountryName 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
Country Code
ABW Aruba 47.500831 44.420778 43.475981 43.957839 44.531061 45.157235 44.993173 44.706680 44.355302 44.070252 43.952390 -3.548441 0.476409
ADO Andorra NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
AFG Afghanistan 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
AGO Angola 101.394722 100.930475 102.563811 102.609186 102.428788 102.035690 102.106756 101.836900 101.315234 100.637667 99.855751 -1.538971 -2.708060
ALB Albania 61.808311 59.585866 50.862987 49.663787 48.637067 NaN 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014

复盘以上代码:

  1. read_excel()中的参数index_col='Country Code’作用是在读取文件的时候指定Country Code这一列数据作为行索引
  2. inplace=True参数和我们之前见过的作用一样,用来控制是否直接对原始数据进行修改
  3. ascengding可以控制排序的顺序,默认值True小到大升序排列,但设置为False时就可以实现降序排列

2. 获取关注者数据前十的用户数据

那么我们需要根据关注者这一列数据进行排序。

import pandas as pd

user = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
user.sort_values(by='关注者',ascending=False,inplace=True)
user.head(10)
_id 关注的收藏夹 关注 关注者 关注的问题 关注的话题 关注的专栏
23313 5876ee159f11da03c9726155 3181 1053 585787 93 33 26
14461 5876ede89f11da03c9721c33 1041 254 356246 92 3 23
22048 5876ee0c9f11da03c9725773 41 72 351414 336 47 16
52464 5876ece59f11da03afbc71b3 1652 794 308553 23 1 112
42135 5876eb5a9f11da0399d461b7 1182 532 268353 23 1 54
63579 5878396e9f11da09c8fe21e6 10819 1412 198199 90 1 517
21740 5876ee0b9f11da03c972550b 991 479 195175 46 3 18
56721 5876ecfa9f11da03afbc92f1 164 550 164254 0 1 19
22784 5876ee129f11da03c9725d33 2011 579 148457 49 6 44
29577 587838f59f11da09bd427945 8 303 137087 477 19 15

复盘以上代码:

  1. by:决定了是按照数据中的哪一列进行排序,将需要按照某列排序的列名赋值给by即可
  2. ascending=False:将数据按大到小的顺序排列
  3. inplace=Ture:用来控制是否直接对原始数据进行修改

四、总结

数据合并

Python数据分析第四课:数据的处理(数据合并、数据筛选、数据排序)_第2张图片

数据筛选

Python数据分析第四课:数据的处理(数据合并、数据筛选、数据排序)_第3张图片

数据排序

Python数据分析第四课:数据的处理(数据合并、数据筛选、数据排序)_第4张图片

五、练习

  1. 目前手上有两份数据,将两个csv数据合并成一个,并按骑行时间进行降序排列,获取最长的骑行时间。
import pandas as pd
# 读取第一份数据
bike1 = pd.read_csv(r'C:\Users\lin-a\Desktop\data\2017_1_data.csv')
bike1
Duration (ms) Start date End date Start station End station Bike number Member type
0 221834 2017/1/1 0:00 2017/1/1 0:04 3rd & Tingey St SE M St & New Jersey Ave SE W00869 Member
1 1676854 2017/1/1 0:06 2017/1/1 0:34 Lincoln Memorial 8th & D St NW W00894 Casual
2 1356956 2017/1/1 0:07 2017/1/1 0:29 Henry Bacon Dr & Lincoln Memorial Circle NW New York Ave & 15th St NW W21945 Casual
3 1327901 2017/1/1 0:07 2017/1/1 0:29 Henry Bacon Dr & Lincoln Memorial Circle NW New York Ave & 15th St NW W20012 Casual
4 1636768 2017/1/1 0:07 2017/1/1 0:34 Lincoln Memorial 8th & D St NW W22786 Casual
... ... ... ... ... ... ... ...
42400 262484 2017/1/10 23:52 2017/1/10 23:56 Georgia & New Hampshire Ave NW 11th & Kenyon St NW W01445 Member
42401 630296 2017/1/10 23:52 2017/1/11 0:03 17th & K St NW / Farragut Square 21st & M St NW W00097 Casual
42402 114584 2017/1/10 23:53 2017/1/10 23:55 Montgomery & East Ln Bethesda Ave & Arlington Rd W20637 Member
42403 564322 2017/1/10 23:53 2017/1/11 0:03 17th & K St NW / Farragut Square 21st & M St NW W01377 Casual
42404 809980 2017/1/10 23:59 2017/1/11 0:13 4th St & Madison Dr NW North Capitol St & F St NW W22436 Member

42405 rows × 7 columns

# 读取第二份数据
bike2 = pd.read_csv(r'C:\Users\lin-a\Desktop\data\2017_2_data.csv')
bike2
Duration (ms) Start date End date Start station End station Bike number Member type
0 636921 2017/2/1 0:00 2017/2/1 0:11 Park Rd & Holmead Pl NW 14th & Upshur St NW W01023 Casual
1 399528 2017/2/1 0:01 2017/2/1 0:08 M St & Pennsylvania Ave NW 21st & M St NW W20687 Casual
2 186311 2017/2/1 0:05 2017/2/1 0:08 Massachusetts Ave & Dupont Circle NW 17th & Corcoran St NW W21637 Member
3 746740 2017/2/1 0:05 2017/2/1 0:17 11th & M St NW 16th & Harvard St NW W22297 Member
4 508111 2017/2/1 0:05 2017/2/1 0:14 14th & R St NW 18th St & Wyoming Ave NW W22024 Member
... ... ... ... ... ... ... ...
72875 667119 2017/2/10 23:57 2017/2/11 0:08 24th & N St NW 15th & L St NW W21336 Member
72876 569178 2017/2/10 23:58 2017/2/11 0:08 7th & T St NW 18th & R St NW W01405 Member
72877 308804 2017/2/10 23:59 2017/2/11 0:04 18th & M St NW New Hampshire Ave & T St NW W22673 Member
72878 1087725 2017/2/10 23:59 2017/2/11 0:17 7th & F St NW / National Portrait Gallery 19th & East Capitol St SE W00254 Member
72879 572705 2017/2/10 23:59 2017/2/11 0:09 11th & H St NE 4th & East Capitol St NE W21512 Member

72880 rows × 7 columns

# 合并数据
# 因为各个参数都是默认,也可以采用pd.concat([bike1,bike2])
bike3 = pd.concat([bike1,bike2],axis=0,join='outer')
bike3
Duration (ms) Start date End date Start station End station Bike number Member type
0 221834 2017/1/1 0:00 2017/1/1 0:04 3rd & Tingey St SE M St & New Jersey Ave SE W00869 Member
1 1676854 2017/1/1 0:06 2017/1/1 0:34 Lincoln Memorial 8th & D St NW W00894 Casual
2 1356956 2017/1/1 0:07 2017/1/1 0:29 Henry Bacon Dr & Lincoln Memorial Circle NW New York Ave & 15th St NW W21945 Casual
3 1327901 2017/1/1 0:07 2017/1/1 0:29 Henry Bacon Dr & Lincoln Memorial Circle NW New York Ave & 15th St NW W20012 Casual
4 1636768 2017/1/1 0:07 2017/1/1 0:34 Lincoln Memorial 8th & D St NW W22786 Casual
... ... ... ... ... ... ... ...
72875 667119 2017/2/10 23:57 2017/2/11 0:08 24th & N St NW 15th & L St NW W21336 Member
72876 569178 2017/2/10 23:58 2017/2/11 0:08 7th & T St NW 18th & R St NW W01405 Member
72877 308804 2017/2/10 23:59 2017/2/11 0:04 18th & M St NW New Hampshire Ave & T St NW W22673 Member
72878 1087725 2017/2/10 23:59 2017/2/11 0:17 7th & F St NW / National Portrait Gallery 19th & East Capitol St SE W00254 Member
72879 572705 2017/2/10 23:59 2017/2/11 0:09 11th & H St NE 4th & East Capitol St NE W21512 Member

115285 rows × 7 columns

# 降序排列骑行时间
bike3.sort_values(by='Duration (ms)',ascending=False,inplace=True)
bike3
Duration (ms) Start date End date Start station End station Bike number Member type
24905 84876226 2017/1/6 10:59 2017/1/7 10:33 17th & K St NW / Farragut Square 19th & K St NW W23232 Member
44077 84070936 2017/2/7 10:44 2017/2/8 10:05 Thomas Circle 4th & C St SW W21615 Casual
35857 83789379 2017/2/6 12:35 2017/2/7 11:51 Fleet St & Ritchie Pkwy E Montgomery Ave & Maryland Ave W22565 Casual
35874 83662979 2017/2/6 12:38 2017/2/7 11:53 Fleet St & Ritchie Pkwy E Montgomery Ave & Maryland Ave W21846 Casual
29204 82897521 2017/1/7 11:42 2017/1/8 10:44 8th & H St NW 4th St & Madison Dr NW W21095 Casual
... ... ... ... ... ... ... ...
22670 60614 2017/2/4 8:29 2017/2/4 8:30 18th & R St NW 18th & R St NW W22795 Member
48831 60606 2017/2/7 18:31 2017/2/7 18:32 Kennedy Center Kennedy Center W22223 Casual
60148 60504 2017/2/8 19:14 2017/2/8 19:15 Columbus Circle / Union Station Columbus Circle / Union Station W21323 Casual
37585 60459 2017/2/6 16:53 2017/2/6 16:54 Columbus Circle / Union Station Columbus Circle / Union Station W21710 Member
48652 60452 2017/2/7 18:21 2017/2/7 18:22 Towers Crescent Dr & Tysons One Pl Towers Crescent Dr & Tysons One Pl W23040 Member

115285 rows × 7 columns

# 获取最长时间
long_time = bike3.iloc[0]['Duration (ms)']
long_time
84876226
  1. 根据第六次全国人口普查中的常住人口数据,获取全国人口最多的10个城市。数据信息:共包含省、地区、结尾、常住人口4个字段。
  • series.str会将每个数据转换成字符串
  • contains()判断字符串是否含有指定字段,返回的是bool类型
import pandas as pd
# 读取数据
people = pd.read_csv(r'C:\Users\lin-a\Desktop\data\liupu.csv')

# 筛选数据
bools = people['结尾'].str.contains('市')
people_2 = people[bools]

# 根据人口数据降序排列
people_3 = people_2.sort_values(by='常住人口',ascending=False,inplace=False)
# 获取前10个数据
people_3.head(10)
地区 结尾 常住人口
567 上海市 上海市 23019196.0
23 北京市 北京市 19612368.0
569 四川省 成都市 14047625.0
601 天津市 天津市 12938693.0
73 广东省 广州市 12701948.0
157 河北省 保定市 11194382.0
232 黑龙江省 哈尔滨市 10635971.0
378 江苏省 苏州市 10459890.0
95 广东省 深圳市 10358381.0
204 河南省 南阳市 10263660.0

你可能感兴趣的:(python,数据分析)