Pandas基础(五):append、assign、combine、update、concat、merge、join
DataWhale第十二期组队学习:Python Pandas
哪一次拍卖的中标率首次小于 5%
df1_rate = df1.assign(biddingRate = df1['Total number of license issued']/df1['Total number of applicants'])
df1_rate[df1_rate['biddingRate']<=0.05].iloc[0].values[0]
'15-May'
按年统计拍卖最低价的下列统计量:最大值、均值、0.75 分位数,要求显示在同一张表上
df1_year = df1.assign(Year=df1['Date'].str.split('-',expand=True).iloc[:,0].astype(int))
mean_values = df1_year.groupby(['Year'])['lowest price '].mean()
max_values = df1_year.groupby(['Year'])['lowest price '].max()
percent_values = df1_year.groupby(['Year'])['lowest price '].describe(percentiles=[.75])['75%']
df1_lowest = pd.DataFrame({
'Mean':mean_values,
'Max':max_values,
'75%':percent_values})
df1_lowest.head()
Mean | Max | 75% | |
---|---|---|---|
Year | |||
2 | 20316.666667 | 30800 | 24300.0 |
3 | 31983.333333 | 38500 | 36300.0 |
4 | 29408.333333 | 44200 | 38400.0 |
5 | 31908.333333 | 37900 | 35600.0 |
6 | 37058.333333 | 39900 | 39525.0 |
将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为月份(英语缩写),添加到列表作为第一第二列,并将原表第一列删除,其他列依次向后顺延
def to_year(x):
if len(x) == 1:
return "200"+x
elif len(x) == 2:
return "20"+x
else:
return -1
year = df1['Date'].str.split('-',expand=True).iloc[:,0].apply(lambda x:to_year(x))
month = df1['Date'].str.split('-',expand=True).iloc[:,1]
df1_time = pd.DataFrame({
'Year':year,'Month':month})
df1_time = pd.concat([df1_time, df1.drop(columns='Date')], axis=1)
df1_time.head()
Year | Month | Total number of license issued | lowest price | avg price | Total number of applicants | |
---|---|---|---|---|---|---|
0 | 2002 | Jan | 1400 | 13600 | 14735 | 3718 |
1 | 2002 | Feb | 1800 | 13100 | 14057 | 4590 |
2 | 2002 | Mar | 2000 | 14300 | 14662 | 5190 |
3 | 2002 | Apr | 2300 | 16000 | 16334 | 4806 |
4 | 2002 | May | 2350 | 17800 | 18357 | 4665 |
现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第五列的变量名,列索引为月份
df1_time.set_index(['Year','Total number of license issued','lowest price ',
'avg price','Total number of applicants']).head()
Month | |||||
---|---|---|---|---|---|
Year | Total number of license issued | lowest price | avg price | Total number of applicants | |
2002 | 1400 | 13600 | 14735 | 3718 | Jan |
1800 | 13100 | 14057 | 4590 | Feb | |
2000 | 14300 | 14662 | 5190 | Mar | |
2300 | 16000 | 16334 | 4806 | Apr | |
2350 | 17800 | 18357 | 4665 | May |
一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值的差额具有相同的正负号,哪些拍卖时间不具有这个特点?
con1 = (df1['lowest price '].values - np.insert(df1['lowest price '].values[:-1],0,0))>0
con2 = (df1['avg price'].values - np.insert(df1['avg price'].values[:-1],0,0))>0
con = con1^con2
df1_time[con][['Year',"Month"]].values
array([['2003', 'Oct'],
['2003', 'Nov'],
['2004', 'Jun'],
['2005', 'Jan'],
['2005', 'Feb'],
['2005', 'Sep'],
['2006', 'May'],
['2006', 'Sep'],
['2007', 'Jan'],
['2007', 'Feb'],
['2007', 'Dec'],
['2012', 'Oct']], dtype=object)
将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用 0 填充,求发行增益极值出现的时间。
mean_tow_rows = [0,0]
for i in range(len(df1)-2):
mean_tow_rows.append(df1[i:i+2]['Total number of license issued'].mean())
df1_mean = pd.Series(mean_tow_rows)
df1_submean = df1_time['Total number of license issued'] - df1_mean
df1_submean[0:2]=0
print('最大值:',df1_time.loc[df1_submean.idxmax()].values[0:2])
print('最小值:',df1_time.loc[df1_submean.idxmin()].values[0:2])
最大值: ['2008' 'Jan']
最小值: ['2008' 'Apr']
求每年货运航班总运量
df2.groupby('Year')['Whole year'].sum()
Year
2007 659438.23
2008 664682.46
2009 560809.77
2010 693033.98
2011 818691.71
2012 846388.03
2013 792337.08
2014 729457.12
2015 630208.97
2016 679370.15
2017 773662.28
2018 767095.28
2019 764606.27
Name: Whole year, dtype: float64
每年记录的机场都是相同的吗?
temp = list(df2.groupby('Year').get_group(2007)['Airport name'].values)
for n,d in df2.groupby('Year'):
a = list(d['Airport name'].values)
if a == temp:
if not n == '2007':
print(n,'和上一年是一样的')
else:
print(n,'和上一年不同')
temp = a
2007 和上一年是一样的
2008 和上一年是一样的
2009 和上一年是一样的
2010 和上一年是一样的
2011 和上一年是一样的
2012 和上一年是一样的
2013 和上一年是一样的
2014 和上一年是一样的
2015 和上一年是一样的
2016 和上一年是一样的
2017 和上一年是一样的
2018 和上一年不同
2019 和上一年不同
按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例
for year in range(2010,2016):
print(year,"年:")
print(df2.groupby('Year').get_group(year)['Whole year'].value_counts()[0]
/ df2.groupby('Year').get_group(year)['Airport name'].count())
2010 年:
0.7671232876712328
2011 年:
0.7705479452054794
2012 年:
0.7705479452054794
2013 年:
0.7705479452054794
2014 年:
0.7705479452054794
2015 年:
0.7705479452054794
若某机场至少存在 5 年或以上满足所有月运量记录都为 0,则将其所有年份的记录信息从表中删除,并返回处理后的表格
air_name = pd.crosstab(index=df2['Airport name'],columns=df2['Whole year']).iloc[:,0]>=5
air_name = air_name[air_name.isin([True])]
df2_nozero = df2.set_index('Airport name').drop(air_name.index)
df2_nozero
Year | January | February | March | April | May | June | July | August | September | October | November | December | Whole year | Airport coordinates | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Airport name | |||||||||||||||
Abakan | 2019 | 44.70 | 66.21 | 72.70 | 75.82 | 100.34 | 78.38 | 63.88 | 73.06 | 66.74 | 75.44 | 110.50 | 89.80 | 917.57 | (Decimal(‘91.399735’), Decimal(‘53.751351’)) |
Anadyr (Carbon) | 2019 | 81.63 | 143.01 | 260.90 | 304.36 | 122.00 | 106.87 | 84.99 | 130.00 | 102.00 | 118.00 | 94.00 | 199.00 | 1746.76 | (Decimal(‘177.738273’), Decimal(‘64.713433’)) |
Anapa (Vitjazevo) | 2019 | 45.92 | 53.15 | 54.00 | 54.72 | 52.00 | 67.45 | 172.31 | 72.57 | 70.00 | 63.00 | 69.00 | 82.10 | 856.22 | (Decimal(‘37.341511’), Decimal(‘45.003748’)) |
Arkhangelsk (Talagy) | 2019 | 85.61 | 118.70 | 131.39 | 144.82 | 137.95 | 140.18 | 128.56 | 135.68 | 124.75 | 139.60 | 210.27 | 307.10 | 1804.61 | (Decimal(‘40.714892’), Decimal(‘64.596138’)) |
Astrakhan (Narimanovo) | 2019 | 51.75 | 61.08 | 65.60 | 71.84 | 71.38 | 63.95 | 164.86 | 79.46 | 85.21 | 87.23 | 79.06 | 99.16 | 980.58 | (Decimal(‘47.999896’), Decimal(‘46.287344’)) |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Reads (tub) | 2007 | 55.96 | 80.09 | 85.90 | 154.54 | 162.71 | 107.51 | 80.14 | 138.71 | 133.19 | 188.97 | 228.84 | 184.00 | 1600.56 | (Decimal(‘113.306492’), Decimal(‘52.020464’)) |
Yuzhno-(Khomutovo) | 2007 | 710.80 | 970.00 | 1330.30 | 1352.30 | 1324.40 | 1613.00 | 1450.70 | 1815.60 | 1902.30 | 1903.20 | 1666.10 | 1632.10 | 17670.80 | (Decimal(‘142.723677’), Decimal(‘46.886967’)) |
Yakutsk | 2007 | 583.70 | 707.80 | 851.80 | 1018.00 | 950.80 | 900.00 | 1154.90 | 1137.84 | 1485.50 | 1382.50 | 1488.00 | 1916.60 | 13577.44 | (Decimal(‘129.750225’), Decimal(‘62.086594’)) |
Yamburg | 2007 | 3.55 | 0.16 | 3.37 | 5.32 | 4.31 | 6.30 | 6.88 | 3.60 | 4.13 | 4.93 | 4.17 | 8.87 | 55.59 | (Decimal(‘75.097783’), Decimal(‘67.980026’)) |
Yaroslavl (Tunoshna) | 2007 | 847.00 | 1482.90 | 1325.40 | 1235.97 | 629.00 | 838.00 | 1211.30 | 915.00 | 1249.60 | 1650.50 | 1822.60 | 2055.60 | 15262.87 | (Decimal(‘40.170054’), Decimal(‘57.56231’)) |
862 rows × 15 columns
采用一种合理的方式将所有机场划分为东南西北四个分区,并给出 2017年-2019 年货运总量最大的区域
import re
df2_5 = df2.set_index(['Year']).sort_index().loc[2017:2019].loc[:,["Whole year","Airport coordinates"]]
df2_found = df2_5[df2_5['Airport coordinates']!='Not found']
df2_air = df2_found['Airport coordinates']
loc_x = []
loc_y = []
for i in range(df2_found.shape[0]):
loc = re.findall(r"\d+\.?\d*", df2_air.iloc[i])
loc_x.append(float(loc[0]))
loc_y.append(float(loc[1]))
df2_loc = df2_found.copy()
df2_loc['locX'] = loc_x
df2_loc['locY'] = loc_y
df2_loc = df2_loc[df2_loc['Whole year']!=0]
binsX = [df2_loc['locX'].min(),df2_loc['locX'].median(),df2_loc['locX'].max()]
binsY = [df2_loc['locY'].min(),df2_loc['locY'].median(),df2_loc['locY'].max()]
cutX = pd.cut(df2_loc['locX'],binsX)
cutY = pd.cut(df2_loc['locY'],binsY)
df2_cut = df2_loc.copy()
df2_cut['cutX'] = cutX
df2_cut['cutY'] = cutY
df2_area = df2_cut.copy()
df2_area.groupby(['Year','cutX','cutY'])['Whole year'].agg(['max'])
max | |||
---|---|---|---|
Year | cutX | cutY | |
2017 | (20.586, 75.098] | (42.821, 55.437] | 122862.35 |
(55.437, 71.978] | 293972.50 | ||
(75.098, 179.293] | (42.821, 55.437] | 24075.79 | |
(55.437, 71.978] | 13852.29 | ||
2018 | (20.586, 75.098] | (42.821, 55.437] | 120354.17 |
(55.437, 71.978] | 308684.40 | ||
(75.098, 179.293] | (42.821, 55.437] | 24640.90 | |
(55.437, 71.978] | 15506.82 | ||
2019 | (20.586, 75.098] | (42.821, 55.437] | 105862.72 |
(55.437, 71.978] | 329817.20 | ||
(75.098, 179.293] | (42.821, 55.437] | 26558.90 | |
(55.437, 71.978] | 15448.14 |
在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该机场该月在当年所有月份中货运量的排名(例如 *** 机场 19 年 1 月运量在整个 19 年 12 个月中排名第一,则秩为 1),那么判断某月运量情况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量定义为每一个月的秩综合指数,请根据上述定义计算 2016 年 12 个月的秩综合指数。
df2_16 = df2[df2['Year'] ==2016].reset_index().loc[:,['Airport name','January', 'February', 'March',
'April', 'May','June', 'July', 'August',
'September', 'October', 'November','December']]
df2_mon = df2_16.melt(id_vars=['Airport name'],value_vars=['January', 'February', 'March',
'April', 'May','June', 'July', 'August',
'September', 'October', 'November','December'],
var_name='Month',value_name='num')
df2_rank = pd.concat([df2_mon,df2_mon.groupby('Airport name').rank(method='min')],axis=1,names='rank')
pd.pivot_table(df2_rank.iloc[:,[0,1,3]],columns='Month',values='num',aggfunc='sum')[['January', 'February', 'March',
'April', 'May','June', 'July', 'August',
'September', 'October', 'November','December']]
Month | January | February | March | April | May | June | July | August | September | October | November | December |
---|---|---|---|---|---|---|---|---|---|---|---|---|
num | 402.0 | 507.0 | 628.0 | 701.0 | 631.0 | 633.0 | 603.0 | 703.0 | 736.0 | 771.0 | 824.0 | 905.0 |
用 corr() 函数计算县(每行都是一个县)人口与表中最后一天记录日期死亡数的相关系数。
df4[['Population', '2020/4/26']].corr()
Population | 2020/4/26 | |
---|---|---|
Population | 1.000000 | 0.403844 |
2020/4/26 | 0.403844 | 1.000000 |
截止到 4 月 1 日,统计每个州零感染县的比例。
df3_41zero = df3[df3['2020/4/1']==0].groupby('Province_State')['UID'].count()
df3_41all = df3.groupby('Province_State')['UID'].fillna(0).count()
(df3_41zero / df3_41all).head()
Province_State
Alabama 0.002546
Alaska 0.007320
Arkansas 0.007002
California 0.002546
Colorado 0.004456
Name: UID, dtype: float64