数据合并主要包括两种操作:
pd.concat()函数可以将根据不同的轴进行合并,常用参数如下:
pd.concat(objs,axis=0,join='outer')
import pandas as pd
dict_1 = {
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3'],
'C':['C0','C1','C2','C3']}
df1 = pd.DataFrame(dict_1)
df1
A | B | C | |
---|---|---|---|
0 | A0 | B0 | C0 |
1 | A1 | B1 | C1 |
2 | A2 | B2 | C2 |
3 | A3 | B3 | C3 |
dict_2 = {
'B':['B0','B1','B2','B3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']}
df2 = pd.DataFrame(dict_2)
df2
B | C | D | |
---|---|---|---|
0 | B0 | C0 | D0 |
1 | B1 | C1 | D1 |
2 | B2 | C2 | D2 |
3 | B3 | C3 | D3 |
当使用pd.concat()默认参数合并df1和df2时:
df3 = pd.concat([df1,df2], axis=0, join='outer')
df3
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | NaN |
1 | A1 | B1 | C1 | NaN |
2 | A2 | B2 | C2 | NaN |
3 | A3 | B3 | C3 | NaN |
0 | NaN | B0 | C0 | D0 |
1 | NaN | B1 | C1 | D1 |
2 | NaN | B2 | C2 | D2 |
3 | NaN | B3 | C3 | D3 |
通过以上的结果发现,当join=‘outer’,axis参数为0时,列进行并集处理,纵向表拼接,缺失值由NaN填充,并且会保留原有数据的行索引。
如果两个表的Index都没有实际含义,使用ignore_index参数,设置成True,即可重新整理一个新的index。
pd.concat([df1,df2],axis=0,join='outer',ignore_index=True
df4 = pd.concat([df1,df2],axis=0,join='outer',ignore_index=True)
df4
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | NaN |
1 | A1 | B1 | C1 | NaN |
2 | A2 | B2 | C2 | NaN |
3 | A3 | B3 | C3 | NaN |
4 | NaN | B0 | C0 | D0 |
5 | NaN | B1 | C1 | D1 |
6 | NaN | B2 | C2 | D2 |
7 | NaN | B3 | C3 | D3 |
当我们将concat()的axis参数调整为1,合并df1和df2时:
df5 = pd.concat([df1,df2],axis=1,join='outer')
df5
A | B | C | B | C | D | |
---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | B3 | C3 | D3 |
当我们将concat()的join参数调整为inner,合并df1和df2时:
df6 = pd.concat([df1,df2],axis=0,join='inner')
df6
B | C | |
---|---|---|
0 | B0 | C0 |
1 | B1 | C1 |
2 | B2 | C2 |
3 | B3 | C3 |
0 | B0 | C0 |
1 | B1 | C1 |
2 | B2 | C2 |
3 | B3 | C3 |
通过以上的结果可知:
pd.merge()的常用参数如下:
pd.merge(left,right,how='inner',on=None)
参数介绍:
import pandas as pd
left = pd.DataFrame({'key':['a','b','b','d'],'data1':range(4)})
left
key | data1 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | b | 2 |
3 | d | 3 |
right = pd.DataFrame({'key':['a','b','c'],'data2':range(3)})
right
key | data2 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | c | 2 |
df7 = pd.merge(left,right)
df7
key | data1 | data2 | |
---|---|---|---|
0 | a | 0 | 0 |
1 | b | 1 | 1 |
2 | b | 2 | 1 |
merge()默认采用inner连接,并且使用两个DataFrame的列名交集(key)作为连接键,同意,最终连接的数据也是两个DataFrame key列数据的交集
当两个DataFrame使用outer连接时:
df8= pd.merge(left,right,on=['key'],how='outer')
df8
key | data1 | data2 | |
---|---|---|---|
0 | a | 0.0 | 0.0 |
1 | b | 1.0 | 1.0 |
2 | b | 2.0 | 1.0 |
3 | d | 3.0 | NaN |
4 | c | NaN | 2.0 |
当merge()做outer连接时,最终连接的数据是两个DataFrame key列数据的并集,确实内容由NaN填充
当两个DataFrame使用left连接时:
df9 = pd.merge(left,right,on=['key'],how='left')
df9
key | data1 | data2 | |
---|---|---|---|
0 | a | 0 | 0.0 |
1 | b | 1 | 1.0 |
2 | b | 2 | 1.0 |
3 | d | 3 | NaN |
当merge()做left连接时,最终连接的数据是以left数据的链接键为基准合并两个数据的列数据,确实的内容由NaN填充
那么,当两个DataFrame使用right连接时,会发生什么呢?
df10= pd.merge(left,right,on=['key'],how='right')
df10
key | data1 | data2 | |
---|---|---|---|
0 | a | 0.0 | 0 |
1 | b | 1.0 | 1 |
2 | b | 2.0 | 1 |
3 | c | NaN | 2 |
前面的内容中,我们学习了如何获取一条数据或连续的多条数据,但实际工作中我们经常需要处理上万条。,甚至更多的数据。
那么,要如何才能快速的筛选出符合条件的数据呢?
import pandas as pd
# 导入数据
df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
# 了解数据基本情况
print(df.shape)
df.head()
(72756, 7)
_id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏 | |
---|---|---|---|---|---|---|---|
0 | 587598f89f11daf90617fb7a | 52 | 17 | 1 | 30 | 58 | 2 |
1 | 587598f89f11daf90617fb7c | 27 | 73 | 15 | 87 | 26 | 1 |
2 | 587598f89f11daf90617fb7e | 72 | 94 | 1 | 112 | 20 | 4 |
3 | 587598f89f11daf90617fb80 | 174 | 84 | 8 | 895 | 30 | 7 |
4 | 587598f89f11daf90617fb82 | 3 | 236 | 64 | 119 | 44 | 17 |
bools = df['关注者']>100
bools
0 False
1 False
2 False
3 False
4 False
...
72751 False
72752 False
72753 False
72754 False
72755 False
Name: 关注者, Length: 72756, dtype: bool
df1 = df[bools]
print(df1.shape)
df1
(5501, 7)
_id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏 | |
---|---|---|---|---|---|---|---|
79 | 587598f89f11daf90617fc18 | 8 | 111 | 3049 | 1216 | 45 | 17 |
101 | 587598f89f11daf90617fc44 | 61 | 123 | 205 | 670 | 24 | 18 |
112 | 587598f99f11daf90617fc5a | 1 | 804 | 197 | 830 | 39 | 26 |
114 | 587598f99f11daf90617fc5e | 44 | 62 | 201 | 314 | 8 | 3 |
121 | 587598f99f11daf90617fc6c | 44 | 628 | 6597 | 1566 | 144 | 69 |
... | ... | ... | ... | ... | ... | ... | ... |
72694 | 5878399e9f11da09c8fe690e | 3 | 172 | 219 | 466 | 68 | 11 |
72700 | 5878399e9f11da09c8fe691a | 21 | 39 | 281 | 51 | 45 | 2 |
72713 | 5878399e9f11da09c8fe6934 | 49 | 635 | 248 | 1772 | 151 | 123 |
72726 | 5878399e9f11da09c8fe694e | 447 | 62 | 728 | 14 | 1 | 21 |
72741 | 5878399e9f11da09c8fe696c | 3546 | 390 | 142 | 25 | 4 | 126 |
5501 rows × 7 columns
来复盘一下以上代码的逻辑:
bools = df['关注者']>100
df1 = df[bools]
根据某列的值进行筛选的逻辑如上所述,那么,如果要进行多条件的联合筛选呢?
import pandas as pd
# 导入数据
df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
# 查看数据基本情况
df
_id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏 | |
---|---|---|---|---|---|---|---|
0 | 587598f89f11daf90617fb7a | 52 | 17 | 1 | 30 | 58 | 2 |
1 | 587598f89f11daf90617fb7c | 27 | 73 | 15 | 87 | 26 | 1 |
2 | 587598f89f11daf90617fb7e | 72 | 94 | 1 | 112 | 20 | 4 |
3 | 587598f89f11daf90617fb80 | 174 | 84 | 8 | 895 | 30 | 7 |
4 | 587598f89f11daf90617fb82 | 3 | 236 | 64 | 119 | 44 | 17 |
... | ... | ... | ... | ... | ... | ... | ... |
72751 | 5878399e9f11da09c8fe6980 | 112 | 77 | 27 | 39 | 2 | 32 |
72752 | 5878399e9f11da09c8fe6982 | 15 | 90 | 16 | 61 | 116 | 2 |
72753 | 5878399e9f11da09c8fe6984 | 7 | 19 | 1 | 25 | 12 | 2 |
72754 | 5878399e9f11da09c8fe6986 | 14 | 320 | 3 | 68 | 38 | 9 |
72755 | 5878399e9f11da09c8fe6988 | 7 | 1 | 2 | 7 | 15 | 0 |
72756 rows × 7 columns
bool_1 = df['关注者']>300
bool_2 = df['关注']>100
df2 = df[bool_1 & bool_2]
df2
_id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏 | |
---|---|---|---|---|---|---|---|
79 | 587598f89f11daf90617fc18 | 8 | 111 | 3049 | 1216 | 45 | 17 |
121 | 587598f99f11daf90617fc6c | 44 | 628 | 6597 | 1566 | 144 | 69 |
155 | 587598f99f11daf90617fcb0 | 194 | 250 | 1103 | 10 | 1 | 19 |
228 | 587598f99f11daf90617fd42 | 48 | 224 | 1450 | 360 | 128 | 20 |
261 | 587598f99f11daf90617fd84 | 1125 | 842 | 24848 | 108 | 33 | 27 |
... | ... | ... | ... | ... | ... | ... | ... |
72173 | 5878399b9f11da09c8fe64fc | 38 | 5000 | 1386 | 2205 | 25 | 6 |
72416 | 5878399d9f11da09c8fe66e2 | 661 | 299 | 457 | 32 | 4 | 9 |
72463 | 5878399d9f11da09c8fe6740 | 274 | 209 | 1556 | 2405 | 188 | 318 |
72648 | 5878399e9f11da09c8fe68b2 | 10 | 278 | 471 | 763 | 66 | 12 |
72661 | 5878399e9f11da09c8fe68cc | 756 | 491 | 414 | 229 | 2 | 58 |
1921 rows × 7 columns
复盘以上的代码:
使用loc[]方法实现筛选的效果。
import pandas as pd
# 导入数据
df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
# 了解数据基本情况
print(df.shape)
df.head()
(72756, 7)
_id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏 | |
---|---|---|---|---|---|---|---|
0 | 587598f89f11daf90617fb7a | 52 | 17 | 1 | 30 | 58 | 2 |
1 | 587598f89f11daf90617fb7c | 27 | 73 | 15 | 87 | 26 | 1 |
2 | 587598f89f11daf90617fb7e | 72 | 94 | 1 | 112 | 20 | 4 |
3 | 587598f89f11daf90617fb80 | 174 | 84 | 8 | 895 | 30 | 7 |
4 | 587598f89f11daf90617fb82 | 3 | 236 | 64 | 119 | 44 | 17 |
# 获取关注者超过100的用户数据
df_new = df.loc[df['关注者']>100]
df_new
_id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏 | |
---|---|---|---|---|---|---|---|
79 | 587598f89f11daf90617fc18 | 8 | 111 | 3049 | 1216 | 45 | 17 |
101 | 587598f89f11daf90617fc44 | 61 | 123 | 205 | 670 | 24 | 18 |
112 | 587598f99f11daf90617fc5a | 1 | 804 | 197 | 830 | 39 | 26 |
114 | 587598f99f11daf90617fc5e | 44 | 62 | 201 | 314 | 8 | 3 |
121 | 587598f99f11daf90617fc6c | 44 | 628 | 6597 | 1566 | 144 | 69 |
... | ... | ... | ... | ... | ... | ... | ... |
72694 | 5878399e9f11da09c8fe690e | 3 | 172 | 219 | 466 | 68 | 11 |
72700 | 5878399e9f11da09c8fe691a | 21 | 39 | 281 | 51 | 45 | 2 |
72713 | 5878399e9f11da09c8fe6934 | 49 | 635 | 248 | 1772 | 151 | 123 |
72726 | 5878399e9f11da09c8fe694e | 447 | 62 | 728 | 14 | 1 | 21 |
72741 | 5878399e9f11da09c8fe696c | 3546 | 390 | 142 | 25 | 4 | 126 |
5501 rows × 7 columns
在数据获取过程中,数据的排序也是我们经常需要处理的问题。例如:我们需要找出关注者数量前10的用户信息
可以使用sort_index()、sort_values()两个方法对数据进行排序,并且这两个方法Series和DataFrame都支持
那么,DataFrame的这两个方法有什么区别呢?
import pandas as pd
people = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx',index_col='Country Code')
people
CountryName | 1990 | 2000 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | Change 1990-2015 | Change 2007-2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country Code | ||||||||||||||
AFG | Afghanistan | 101.094930 | 103.254202 | 100.000371 | 100.215886 | 100.060480 | 99.459839 | 97.667911 | 95.312707 | 92.602785 | 89.773777 | 86.954464 | -14.140466 | -13.045907 |
ALB | Albania | 61.808311 | 59.585866 | 50.862987 | 49.663787 | 48.637067 | NaN | 46.720288 | 45.835739 | 45.247477 | 44.912168 | 44.806973 | -17.001338 | -6.056014 |
DZA | Algeria | 87.675705 | 62.886169 | 49.487870 | 48.910002 | 48.645026 | 48.681853 | 49.233576 | 49.847713 | 50.600697 | 51.536631 | 52.617579 | -35.058127 | 3.129709 |
ASM | American Samoa | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ADO | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
VIR | Virgin Islands (U.S.) | 53.546901 | 52.647183 | 49.912779 | 50.459425 | 51.257336 | 52.382523 | 53.953515 | 55.687666 | 57.537152 | 59.399244 | 61.199651 | 7.652751 | 11.286872 |
WBG | West Bank and Gaza | 102.789182 | 100.410470 | 88.367099 | 86.204936 | 84.173732 | 82.333762 | 80.851747 | 79.426778 | 78.118403 | 76.975462 | 76.001869 | -26.787313 | -12.365230 |
YEM | Yemen, Rep. | 118.779727 | 105.735754 | 88.438350 | 85.899643 | 83.594489 | 81.613924 | 80.193948 | 78.902603 | 77.734373 | 76.644268 | 75.595147 | -43.184581 | -12.843203 |
ZMB | Zambia | 100.485263 | 97.124776 | 98.223058 | 98.277253 | 98.260795 | 98.148325 | 97.854237 | 97.385802 | 96.791310 | 96.122165 | 95.402326 | -5.082936 | -2.820732 |
ZWE | Zimbabwe | 96.365244 | 84.877487 | 81.272166 | 81.024020 | 80.934968 | 80.985702 | 80.740494 | 80.579870 | 80.499816 | 80.456439 | 80.391033 | -15.974211 | -0.881133 |
219 rows × 14 columns
people.sort_index(inplace=True,ascending=True)
people.head()
CountryName | 1990 | 2000 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | Change 1990-2015 | Change 2007-2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country Code | ||||||||||||||
ABW | Aruba | 47.500831 | 44.420778 | 43.475981 | 43.957839 | 44.531061 | 45.157235 | 44.993173 | 44.706680 | 44.355302 | 44.070252 | 43.952390 | -3.548441 | 0.476409 |
ADO | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AFG | Afghanistan | 101.094930 | 103.254202 | 100.000371 | 100.215886 | 100.060480 | 99.459839 | 97.667911 | 95.312707 | 92.602785 | 89.773777 | 86.954464 | -14.140466 | -13.045907 |
AGO | Angola | 101.394722 | 100.930475 | 102.563811 | 102.609186 | 102.428788 | 102.035690 | 102.106756 | 101.836900 | 101.315234 | 100.637667 | 99.855751 | -1.538971 | -2.708060 |
ALB | Albania | 61.808311 | 59.585866 | 50.862987 | 49.663787 | 48.637067 | NaN | 46.720288 | 45.835739 | 45.247477 | 44.912168 | 44.806973 | -17.001338 | -6.056014 |
复盘以上代码:
那么我们需要根据关注者这一列数据进行排序。
import pandas as pd
user = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
user.sort_values(by='关注者',ascending=False,inplace=True)
user.head(10)
_id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏 | |
---|---|---|---|---|---|---|---|
23313 | 5876ee159f11da03c9726155 | 3181 | 1053 | 585787 | 93 | 33 | 26 |
14461 | 5876ede89f11da03c9721c33 | 1041 | 254 | 356246 | 92 | 3 | 23 |
22048 | 5876ee0c9f11da03c9725773 | 41 | 72 | 351414 | 336 | 47 | 16 |
52464 | 5876ece59f11da03afbc71b3 | 1652 | 794 | 308553 | 23 | 1 | 112 |
42135 | 5876eb5a9f11da0399d461b7 | 1182 | 532 | 268353 | 23 | 1 | 54 |
63579 | 5878396e9f11da09c8fe21e6 | 10819 | 1412 | 198199 | 90 | 1 | 517 |
21740 | 5876ee0b9f11da03c972550b | 991 | 479 | 195175 | 46 | 3 | 18 |
56721 | 5876ecfa9f11da03afbc92f1 | 164 | 550 | 164254 | 0 | 1 | 19 |
22784 | 5876ee129f11da03c9725d33 | 2011 | 579 | 148457 | 49 | 6 | 44 |
29577 | 587838f59f11da09bd427945 | 8 | 303 | 137087 | 477 | 19 | 15 |
复盘以上代码:
import pandas as pd
# 读取第一份数据
bike1 = pd.read_csv(r'C:\Users\lin-a\Desktop\data\2017_1_data.csv')
bike1
Duration (ms) | Start date | End date | Start station | End station | Bike number | Member type | |
---|---|---|---|---|---|---|---|
0 | 221834 | 2017/1/1 0:00 | 2017/1/1 0:04 | 3rd & Tingey St SE | M St & New Jersey Ave SE | W00869 | Member |
1 | 1676854 | 2017/1/1 0:06 | 2017/1/1 0:34 | Lincoln Memorial | 8th & D St NW | W00894 | Casual |
2 | 1356956 | 2017/1/1 0:07 | 2017/1/1 0:29 | Henry Bacon Dr & Lincoln Memorial Circle NW | New York Ave & 15th St NW | W21945 | Casual |
3 | 1327901 | 2017/1/1 0:07 | 2017/1/1 0:29 | Henry Bacon Dr & Lincoln Memorial Circle NW | New York Ave & 15th St NW | W20012 | Casual |
4 | 1636768 | 2017/1/1 0:07 | 2017/1/1 0:34 | Lincoln Memorial | 8th & D St NW | W22786 | Casual |
... | ... | ... | ... | ... | ... | ... | ... |
42400 | 262484 | 2017/1/10 23:52 | 2017/1/10 23:56 | Georgia & New Hampshire Ave NW | 11th & Kenyon St NW | W01445 | Member |
42401 | 630296 | 2017/1/10 23:52 | 2017/1/11 0:03 | 17th & K St NW / Farragut Square | 21st & M St NW | W00097 | Casual |
42402 | 114584 | 2017/1/10 23:53 | 2017/1/10 23:55 | Montgomery & East Ln | Bethesda Ave & Arlington Rd | W20637 | Member |
42403 | 564322 | 2017/1/10 23:53 | 2017/1/11 0:03 | 17th & K St NW / Farragut Square | 21st & M St NW | W01377 | Casual |
42404 | 809980 | 2017/1/10 23:59 | 2017/1/11 0:13 | 4th St & Madison Dr NW | North Capitol St & F St NW | W22436 | Member |
42405 rows × 7 columns
# 读取第二份数据
bike2 = pd.read_csv(r'C:\Users\lin-a\Desktop\data\2017_2_data.csv')
bike2
Duration (ms) | Start date | End date | Start station | End station | Bike number | Member type | |
---|---|---|---|---|---|---|---|
0 | 636921 | 2017/2/1 0:00 | 2017/2/1 0:11 | Park Rd & Holmead Pl NW | 14th & Upshur St NW | W01023 | Casual |
1 | 399528 | 2017/2/1 0:01 | 2017/2/1 0:08 | M St & Pennsylvania Ave NW | 21st & M St NW | W20687 | Casual |
2 | 186311 | 2017/2/1 0:05 | 2017/2/1 0:08 | Massachusetts Ave & Dupont Circle NW | 17th & Corcoran St NW | W21637 | Member |
3 | 746740 | 2017/2/1 0:05 | 2017/2/1 0:17 | 11th & M St NW | 16th & Harvard St NW | W22297 | Member |
4 | 508111 | 2017/2/1 0:05 | 2017/2/1 0:14 | 14th & R St NW | 18th St & Wyoming Ave NW | W22024 | Member |
... | ... | ... | ... | ... | ... | ... | ... |
72875 | 667119 | 2017/2/10 23:57 | 2017/2/11 0:08 | 24th & N St NW | 15th & L St NW | W21336 | Member |
72876 | 569178 | 2017/2/10 23:58 | 2017/2/11 0:08 | 7th & T St NW | 18th & R St NW | W01405 | Member |
72877 | 308804 | 2017/2/10 23:59 | 2017/2/11 0:04 | 18th & M St NW | New Hampshire Ave & T St NW | W22673 | Member |
72878 | 1087725 | 2017/2/10 23:59 | 2017/2/11 0:17 | 7th & F St NW / National Portrait Gallery | 19th & East Capitol St SE | W00254 | Member |
72879 | 572705 | 2017/2/10 23:59 | 2017/2/11 0:09 | 11th & H St NE | 4th & East Capitol St NE | W21512 | Member |
72880 rows × 7 columns
# 合并数据
# 因为各个参数都是默认,也可以采用pd.concat([bike1,bike2])
bike3 = pd.concat([bike1,bike2],axis=0,join='outer')
bike3
Duration (ms) | Start date | End date | Start station | End station | Bike number | Member type | |
---|---|---|---|---|---|---|---|
0 | 221834 | 2017/1/1 0:00 | 2017/1/1 0:04 | 3rd & Tingey St SE | M St & New Jersey Ave SE | W00869 | Member |
1 | 1676854 | 2017/1/1 0:06 | 2017/1/1 0:34 | Lincoln Memorial | 8th & D St NW | W00894 | Casual |
2 | 1356956 | 2017/1/1 0:07 | 2017/1/1 0:29 | Henry Bacon Dr & Lincoln Memorial Circle NW | New York Ave & 15th St NW | W21945 | Casual |
3 | 1327901 | 2017/1/1 0:07 | 2017/1/1 0:29 | Henry Bacon Dr & Lincoln Memorial Circle NW | New York Ave & 15th St NW | W20012 | Casual |
4 | 1636768 | 2017/1/1 0:07 | 2017/1/1 0:34 | Lincoln Memorial | 8th & D St NW | W22786 | Casual |
... | ... | ... | ... | ... | ... | ... | ... |
72875 | 667119 | 2017/2/10 23:57 | 2017/2/11 0:08 | 24th & N St NW | 15th & L St NW | W21336 | Member |
72876 | 569178 | 2017/2/10 23:58 | 2017/2/11 0:08 | 7th & T St NW | 18th & R St NW | W01405 | Member |
72877 | 308804 | 2017/2/10 23:59 | 2017/2/11 0:04 | 18th & M St NW | New Hampshire Ave & T St NW | W22673 | Member |
72878 | 1087725 | 2017/2/10 23:59 | 2017/2/11 0:17 | 7th & F St NW / National Portrait Gallery | 19th & East Capitol St SE | W00254 | Member |
72879 | 572705 | 2017/2/10 23:59 | 2017/2/11 0:09 | 11th & H St NE | 4th & East Capitol St NE | W21512 | Member |
115285 rows × 7 columns
# 降序排列骑行时间
bike3.sort_values(by='Duration (ms)',ascending=False,inplace=True)
bike3
Duration (ms) | Start date | End date | Start station | End station | Bike number | Member type | |
---|---|---|---|---|---|---|---|
24905 | 84876226 | 2017/1/6 10:59 | 2017/1/7 10:33 | 17th & K St NW / Farragut Square | 19th & K St NW | W23232 | Member |
44077 | 84070936 | 2017/2/7 10:44 | 2017/2/8 10:05 | Thomas Circle | 4th & C St SW | W21615 | Casual |
35857 | 83789379 | 2017/2/6 12:35 | 2017/2/7 11:51 | Fleet St & Ritchie Pkwy | E Montgomery Ave & Maryland Ave | W22565 | Casual |
35874 | 83662979 | 2017/2/6 12:38 | 2017/2/7 11:53 | Fleet St & Ritchie Pkwy | E Montgomery Ave & Maryland Ave | W21846 | Casual |
29204 | 82897521 | 2017/1/7 11:42 | 2017/1/8 10:44 | 8th & H St NW | 4th St & Madison Dr NW | W21095 | Casual |
... | ... | ... | ... | ... | ... | ... | ... |
22670 | 60614 | 2017/2/4 8:29 | 2017/2/4 8:30 | 18th & R St NW | 18th & R St NW | W22795 | Member |
48831 | 60606 | 2017/2/7 18:31 | 2017/2/7 18:32 | Kennedy Center | Kennedy Center | W22223 | Casual |
60148 | 60504 | 2017/2/8 19:14 | 2017/2/8 19:15 | Columbus Circle / Union Station | Columbus Circle / Union Station | W21323 | Casual |
37585 | 60459 | 2017/2/6 16:53 | 2017/2/6 16:54 | Columbus Circle / Union Station | Columbus Circle / Union Station | W21710 | Member |
48652 | 60452 | 2017/2/7 18:21 | 2017/2/7 18:22 | Towers Crescent Dr & Tysons One Pl | Towers Crescent Dr & Tysons One Pl | W23040 | Member |
115285 rows × 7 columns
# 获取最长时间
long_time = bike3.iloc[0]['Duration (ms)']
long_time
84876226
import pandas as pd
# 读取数据
people = pd.read_csv(r'C:\Users\lin-a\Desktop\data\liupu.csv')
# 筛选数据
bools = people['结尾'].str.contains('市')
people_2 = people[bools]
# 根据人口数据降序排列
people_3 = people_2.sort_values(by='常住人口',ascending=False,inplace=False)
# 获取前10个数据
people_3.head(10)
省 | 地区 | 结尾 | 常住人口 | |
---|---|---|---|---|
567 | 上海市 | 上海市 | 市 | 23019196.0 |
23 | 北京市 | 北京市 | 市 | 19612368.0 |
569 | 四川省 | 成都市 | 市 | 14047625.0 |
601 | 天津市 | 天津市 | 市 | 12938693.0 |
73 | 广东省 | 广州市 | 市 | 12701948.0 |
157 | 河北省 | 保定市 | 市 | 11194382.0 |
232 | 黑龙江省 | 哈尔滨市 | 市 | 10635971.0 |
378 | 江苏省 | 苏州市 | 市 | 10459890.0 |
95 | 广东省 | 深圳市 | 市 | 10358381.0 |
204 | 河南省 | 南阳市 | 市 | 10263660.0 |