Pandas基础练习(综合练习)

综合练习 Pandas

Pandas基础(五):append、assign、combine、update、concat、merge、join

DataWhale第十二期组队学习:Python Pandas

文章目录

  • 综合练习 Pandas
    • 一、2002 年-2018 年上海机动车拍照拍卖
      • 1.1
      • 1.2
      • 1.3
      • 1.4
      • 1.5
      • 1.6
    • 二、2007 年-2019 年俄罗斯机场货运航班运载量
      • 2.1
      • 2.2
      • 2.3
      • 2.4
      • 2.5
      • 2.6
    • 三、新冠肺炎在美国的传播
      • 3.1
      • 3.2

一、2002 年-2018 年上海机动车拍照拍卖

1.1

哪一次拍卖的中标率首次小于 5%

df1_rate = df1.assign(biddingRate = df1['Total number of license issued']/df1['Total number of applicants'])
df1_rate[df1_rate['biddingRate']<=0.05].iloc[0].values[0]
'15-May'

1.2

按年统计拍卖最低价的下列统计量:最大值、均值、0.75 分位数,要求显示在同一张表上

df1_year = df1.assign(Year=df1['Date'].str.split('-',expand=True).iloc[:,0].astype(int))
mean_values = df1_year.groupby(['Year'])['lowest price '].mean()
max_values = df1_year.groupby(['Year'])['lowest price '].max()
percent_values = df1_year.groupby(['Year'])['lowest price '].describe(percentiles=[.75])['75%']
df1_lowest = pd.DataFrame({
     'Mean':mean_values,
                          'Max':max_values,
                          '75%':percent_values})
df1_lowest.head()
Mean Max 75%
Year
2 20316.666667 30800 24300.0
3 31983.333333 38500 36300.0
4 29408.333333 44200 38400.0
5 31908.333333 37900 35600.0
6 37058.333333 39900 39525.0

1.3

将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为月份(英语缩写),添加到列表作为第一第二列,并将原表第一列删除,其他列依次向后顺延

def to_year(x):
    if len(x) == 1:
        return "200"+x
    elif len(x) == 2:
        return "20"+x
    else:
        return -1
    
year = df1['Date'].str.split('-',expand=True).iloc[:,0].apply(lambda x:to_year(x))
month = df1['Date'].str.split('-',expand=True).iloc[:,1]
df1_time = pd.DataFrame({
     'Year':year,'Month':month})
df1_time = pd.concat([df1_time, df1.drop(columns='Date')], axis=1)
df1_time.head()
Year Month Total number of license issued lowest price avg price Total number of applicants
0 2002 Jan 1400 13600 14735 3718
1 2002 Feb 1800 13100 14057 4590
2 2002 Mar 2000 14300 14662 5190
3 2002 Apr 2300 16000 16334 4806
4 2002 May 2350 17800 18357 4665

1.4

现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第五列的变量名,列索引为月份

df1_time.set_index(['Year','Total number of license issued','lowest price ',
                      'avg price','Total number of applicants']).head()
Month
Year Total number of license issued lowest price avg price Total number of applicants
2002 1400 13600 14735 3718 Jan
1800 13100 14057 4590 Feb
2000 14300 14662 5190 Mar
2300 16000 16334 4806 Apr
2350 17800 18357 4665 May

1.5

一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值的差额具有相同的正负号,哪些拍卖时间不具有这个特点?

con1 = (df1['lowest price '].values - np.insert(df1['lowest price '].values[:-1],0,0))>0
con2 = (df1['avg price'].values - np.insert(df1['avg price'].values[:-1],0,0))>0
con = con1^con2
df1_time[con][['Year',"Month"]].values
array([['2003', 'Oct'],
       ['2003', 'Nov'],
       ['2004', 'Jun'],
       ['2005', 'Jan'],
       ['2005', 'Feb'],
       ['2005', 'Sep'],
       ['2006', 'May'],
       ['2006', 'Sep'],
       ['2007', 'Jan'],
       ['2007', 'Feb'],
       ['2007', 'Dec'],
       ['2012', 'Oct']], dtype=object)

1.6

将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用 0 填充,求发行增益极值出现的时间。

mean_tow_rows = [0,0]
for i in range(len(df1)-2):
    mean_tow_rows.append(df1[i:i+2]['Total number of license issued'].mean())
df1_mean = pd.Series(mean_tow_rows)
df1_submean = df1_time['Total number of license issued'] - df1_mean
df1_submean[0:2]=0
print('最大值:',df1_time.loc[df1_submean.idxmax()].values[0:2])
print('最小值:',df1_time.loc[df1_submean.idxmin()].values[0:2])
最大值: ['2008' 'Jan']
最小值: ['2008' 'Apr']

二、2007 年-2019 年俄罗斯机场货运航班运载量

2.1

求每年货运航班总运量

df2.groupby('Year')['Whole year'].sum()
Year
2007    659438.23
2008    664682.46
2009    560809.77
2010    693033.98
2011    818691.71
2012    846388.03
2013    792337.08
2014    729457.12
2015    630208.97
2016    679370.15
2017    773662.28
2018    767095.28
2019    764606.27
Name: Whole year, dtype: float64

2.2

每年记录的机场都是相同的吗?

temp = list(df2.groupby('Year').get_group(2007)['Airport name'].values)
for n,d in df2.groupby('Year'):
    a = list(d['Airport name'].values)
    if a == temp:
        if not n == '2007':
            print(n,'和上一年是一样的')
    else:
        print(n,'和上一年不同')
    temp = a
2007 和上一年是一样的
2008 和上一年是一样的
2009 和上一年是一样的
2010 和上一年是一样的
2011 和上一年是一样的
2012 和上一年是一样的
2013 和上一年是一样的
2014 和上一年是一样的
2015 和上一年是一样的
2016 和上一年是一样的
2017 和上一年是一样的
2018 和上一年不同
2019 和上一年不同

2.3

按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例

for year in range(2010,2016):
    print(year,"年:")
    print(df2.groupby('Year').get_group(year)['Whole year'].value_counts()[0] 
          / df2.groupby('Year').get_group(year)['Airport name'].count())
2010 年:
0.7671232876712328
2011 年:
0.7705479452054794
2012 年:
0.7705479452054794
2013 年:
0.7705479452054794
2014 年:
0.7705479452054794
2015 年:
0.7705479452054794

2.4

若某机场至少存在 5 年或以上满足所有月运量记录都为 0,则将其所有年份的记录信息从表中删除,并返回处理后的表格

air_name = pd.crosstab(index=df2['Airport name'],columns=df2['Whole year']).iloc[:,0]>=5
air_name = air_name[air_name.isin([True])]
df2_nozero = df2.set_index('Airport name').drop(air_name.index)
df2_nozero
Year January February March April May June July August September October November December Whole year Airport coordinates
Airport name
Abakan 2019 44.70 66.21 72.70 75.82 100.34 78.38 63.88 73.06 66.74 75.44 110.50 89.80 917.57 (Decimal(‘91.399735’), Decimal(‘53.751351’))
Anadyr (Carbon) 2019 81.63 143.01 260.90 304.36 122.00 106.87 84.99 130.00 102.00 118.00 94.00 199.00 1746.76 (Decimal(‘177.738273’), Decimal(‘64.713433’))
Anapa (Vitjazevo) 2019 45.92 53.15 54.00 54.72 52.00 67.45 172.31 72.57 70.00 63.00 69.00 82.10 856.22 (Decimal(‘37.341511’), Decimal(‘45.003748’))
Arkhangelsk (Talagy) 2019 85.61 118.70 131.39 144.82 137.95 140.18 128.56 135.68 124.75 139.60 210.27 307.10 1804.61 (Decimal(‘40.714892’), Decimal(‘64.596138’))
Astrakhan (Narimanovo) 2019 51.75 61.08 65.60 71.84 71.38 63.95 164.86 79.46 85.21 87.23 79.06 99.16 980.58 (Decimal(‘47.999896’), Decimal(‘46.287344’))
Reads (tub) 2007 55.96 80.09 85.90 154.54 162.71 107.51 80.14 138.71 133.19 188.97 228.84 184.00 1600.56 (Decimal(‘113.306492’), Decimal(‘52.020464’))
Yuzhno-(Khomutovo) 2007 710.80 970.00 1330.30 1352.30 1324.40 1613.00 1450.70 1815.60 1902.30 1903.20 1666.10 1632.10 17670.80 (Decimal(‘142.723677’), Decimal(‘46.886967’))
Yakutsk 2007 583.70 707.80 851.80 1018.00 950.80 900.00 1154.90 1137.84 1485.50 1382.50 1488.00 1916.60 13577.44 (Decimal(‘129.750225’), Decimal(‘62.086594’))
Yamburg 2007 3.55 0.16 3.37 5.32 4.31 6.30 6.88 3.60 4.13 4.93 4.17 8.87 55.59 (Decimal(‘75.097783’), Decimal(‘67.980026’))
Yaroslavl (Tunoshna) 2007 847.00 1482.90 1325.40 1235.97 629.00 838.00 1211.30 915.00 1249.60 1650.50 1822.60 2055.60 15262.87 (Decimal(‘40.170054’), Decimal(‘57.56231’))

862 rows × 15 columns

2.5

采用一种合理的方式将所有机场划分为东南西北四个分区,并给出 2017年-2019 年货运总量最大的区域

import re
df2_5 = df2.set_index(['Year']).sort_index().loc[2017:2019].loc[:,["Whole year","Airport coordinates"]]
df2_found = df2_5[df2_5['Airport coordinates']!='Not found']
df2_air = df2_found['Airport coordinates']

loc_x = []
loc_y = []
for i in range(df2_found.shape[0]):
    loc = re.findall(r"\d+\.?\d*", df2_air.iloc[i])  
    loc_x.append(float(loc[0]))
    loc_y.append(float(loc[1]))

df2_loc = df2_found.copy()
df2_loc['locX'] = loc_x
df2_loc['locY'] = loc_y
df2_loc = df2_loc[df2_loc['Whole year']!=0]

binsX = [df2_loc['locX'].min(),df2_loc['locX'].median(),df2_loc['locX'].max()]
binsY = [df2_loc['locY'].min(),df2_loc['locY'].median(),df2_loc['locY'].max()]
cutX = pd.cut(df2_loc['locX'],binsX)
cutY = pd.cut(df2_loc['locY'],binsY)
df2_cut = df2_loc.copy()
df2_cut['cutX'] = cutX
df2_cut['cutY'] = cutY
df2_area = df2_cut.copy()
df2_area.groupby(['Year','cutX','cutY'])['Whole year'].agg(['max'])
max
Year cutX cutY
2017 (20.586, 75.098] (42.821, 55.437] 122862.35
(55.437, 71.978] 293972.50
(75.098, 179.293] (42.821, 55.437] 24075.79
(55.437, 71.978] 13852.29
2018 (20.586, 75.098] (42.821, 55.437] 120354.17
(55.437, 71.978] 308684.40
(75.098, 179.293] (42.821, 55.437] 24640.90
(55.437, 71.978] 15506.82
2019 (20.586, 75.098] (42.821, 55.437] 105862.72
(55.437, 71.978] 329817.20
(75.098, 179.293] (42.821, 55.437] 26558.90
(55.437, 71.978] 15448.14

2.6

在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该机场该月在当年所有月份中货运量的排名(例如 *** 机场 19 年 1 月运量在整个 19 年 12 个月中排名第一,则秩为 1),那么判断某月运量情况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量定义为每一个月的秩综合指数,请根据上述定义计算 2016 年 12 个月的秩综合指数。

df2_16 = df2[df2['Year'] ==2016].reset_index().loc[:,['Airport name','January', 'February', 'March',
                                                      'April', 'May','June', 'July', 'August', 
                                                      'September', 'October', 'November','December']]
df2_mon = df2_16.melt(id_vars=['Airport name'],value_vars=['January', 'February', 'March',
                                                 'April', 'May','June', 'July', 'August',
                                                 'September', 'October', 'November','December'],
              var_name='Month',value_name='num')

df2_rank = pd.concat([df2_mon,df2_mon.groupby('Airport name').rank(method='min')],axis=1,names='rank')
pd.pivot_table(df2_rank.iloc[:,[0,1,3]],columns='Month',values='num',aggfunc='sum')[['January', 'February', 'March',
                                             'April', 'May','June', 'July', 'August',
                                             'September', 'October', 'November','December']]
Month January February March April May June July August September October November December
num 402.0 507.0 628.0 701.0 631.0 633.0 603.0 703.0 736.0 771.0 824.0 905.0

三、新冠肺炎在美国的传播

3.1

用 corr() 函数计算县(每行都是一个县)人口与表中最后一天记录日期死亡数的相关系数。

df4[['Population', '2020/4/26']].corr()
Population 2020/4/26
Population 1.000000 0.403844
2020/4/26 0.403844 1.000000

3.2

截止到 4 月 1 日,统计每个州零感染县的比例。

df3_41zero = df3[df3['2020/4/1']==0].groupby('Province_State')['UID'].count()
df3_41all = df3.groupby('Province_State')['UID'].fillna(0).count()
(df3_41zero / df3_41all).head()
Province_State
Alabama       0.002546
Alaska        0.007320
Arkansas      0.007002
California    0.002546
Colorado      0.004456
Name: UID, dtype: float64

你可能感兴趣的:(Pandas,python,数据分析)