Pandas数据分析 ——Task06:综合练习

Pandas综合练习

  • 一、2002 年-2018 年上海机动车拍照拍卖
  • 二、2007 年-2019 年俄罗斯机场货运航班运载量
  • 三、新冠肺炎在美国的传播

一、2002 年-2018 年上海机动车拍照拍卖

Date Total number of license issued lowest price avg price Total number of applicants
日期 颁发许可证总数 最低价格 平均价格 申请人总数
2-Jan 1400 13600 14735 3718
2-Feb 1800 13100 14057 4590
2-Mar 2000 14300 14662 5190

问题
(1) 哪一次拍卖的中标率首次小于 5%?

idf1 = df[df['Total number of license issued'] / df['Total number of applicants'] < 0.05]
print(idf1.index[0])

结果:
159

(2) 按年统计拍卖最低价的下列统计量:最大值、均值、0.75 分位数,要求
显示在同一张表上。(先进行(3))

group_year = idf2.drop(columns = ['Month', 
     'Total number of license issued','avg price',
      'Total number of applicants']).groupby('Year')
idf3 = pd.DataFrame()
idf3['max'] = group_year['lowest price '].max()
idf3['mean'] = group_year['lowest price '].mean()
idf3['quantile'] = group_year['lowest price '].quantile(0.75)
print(idf3)

结果:

max mean quantile
Year
2002 30800 20316.666667 24300.0
2003 38500 31983.333333 36300.0
2004 44200 29408.333333 38400.0
2005 37900 31908.333333 35600.0
2006 39900 37058.333333 39525.0
2007 53800 45691.666667 48950.0
2008 37300 29945.454545 34150.0
2009 36900 31333.333333 34150.0
2010 44900 38008.333333 41825.0
2011 53800 47958.333333 51000.0
2012 68900 61108.333333 65325.0
2013 90800 79125.000000 82550.0
2014 74600 73816.666667 74000.0
2015 85300 80575.000000 83450.0
2016 88600 85733.333333 87475.0
2017 93500 90616.666667 92350.0
2018 89000 87825.000000 88150.0

(3) 将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为
月份(英语缩写),添加到列表作为第一第二列,并将原表第一列删除,
其他列依次向后顺延。

df['Year'] = df['Date'].apply(lambda x:2000 + int(x.split('-')[0]))
df['Month'] = df['Date'].apply(lambda x: x.split('-')[1])
idf2 = df.drop(columns = 'Date')
idf2 = idf2.reindex(columns = ['Year', 'Month', 
     'Total number of license issued', 'lowest price ',
      'avg price', 'Total number of applicants'])
print(idf2.head())

结果:

Year Month avg price Total number of applicants
0 2002 Jan 14735 3718
1 2002 Feb 14057 4590
2 2002 Mar 14662 5190
3 2002 Apr 16334 4806
4 2002 May 18357 4665

(4) 现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第
五列的变量名,列索引为月份。
我大概知道是什么形式,但我没弄出来

(5) 一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值
的差额具有相同的正负号,哪些拍卖时间不具有这个特点?

for index in range(202):
    flag = (df.loc[index,'lowest price ']-df.loc[index+1,'lowest price '])*\
    (df.loc[index,'avg price']-df.loc[index+1,'avg price'])
    if flag < 0:
        print(df.loc[index+1]['Date'])

结果:
3-Oct
3-Nov
4-Jun
5-Jan
5-Feb
5-Sep
6-May
6-Sep
7-Jan
7-Feb
7-Dec
12-Oct

(6) 将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增
益,最初的两个月用 0 填充,求发行增益极值出现的时间。

df['issue gain'] = 0
for index in range(2,203):
    df.loc[index,'issue gain'] = df.loc[index,'Total number of license issued']\
    -(df.loc[index-1,'Total number of license issued']\
    +df.loc[index-2,'Total number of license issued'])/2    
imin = df[df['issue gain']==df['issue gain'].min()]['Date']
imax = df[df['issue gain']==df['issue gain'].max()]['Date']
print("min:",imin.values[0],'\nmax:',imax.values[0])

结果:
min: 8-Apr
max: 8-Jan

二、2007 年-2019 年俄罗斯机场货运航班运载量

Airport name Year January February March April May June July August September October November December Whole year Airport coordinates
机场名称 一月 二月 三月 四月 五月 六月 七月 八月 九月 十月 十一月 十二月 全年 机场坐标
Abakan 2019 44.7 66.21 72.7 75.82 100.34 78.38 63.88 73.06 66.74 75.44 110.5 89.8 917.57 “(Decimal(‘91.399735’) Decimal(‘53.751351’))”
Aikhal 2019 0 0 0 0 0 0 0 0 0 0 0 0 0 “(Decimal(‘111.543324’) Decimal(‘65.957161’))”
Loss 2019 0 0 0 0 0 0 0 0 0 0 0 0 0 “(Decimal(‘125.398355’) Decimal(‘58.602489’))”
Amderma 2019 0 0 0 0 0 0 0 0 0 0 0 0 0 “(Decimal(‘61.577429’) Decimal(‘69.759076’))”
Anadyr (Carbon) 2019 81.63 143.01 260.9 304.36 122 106.87 84.99 130 102 118 94 199 1746.76 “(Decimal(‘177.738273’) Decimal(‘64.713433’))”

问题
(1) 求每年货运航班总运量。

group_year = df.groupby('Year')
print(group_year['Whole year'].sum())

结果:

Year
2007 659438.23
2008 664682.46
2009 560809.77
2010 693033.98
2011 818691.71
2012 846388.03
2013 792337.08
2014 729457.12
2015 630208.97
2016 679370.15
2017 773662.28
2018 767095.28
2019 764606.27

(2) 每年记录的机场都是相同的吗?

print(group_year['Airport name'].count())

结果:

Year
2007 292
2008 292
2009 292
2010 292
2011 292
2012 292
2013 292
2014 292
2015 292
2016 292
2017 292
2018 248
2019 251

可见每年记录的机场不是完全相同的

(3) 按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例。

idf3 = group_year['Whole year'].agg(rate = lambda x: str(len(x[x == 0])/len(x)*100)+'%')
print(idf3.loc[2010:2015])

结果:

rate
Year
2010 76.71232876712328%
2011 77.05479452054794%
2012 77.05479452054794%
2013 77.05479452054794%
2014 77.05479452054794%
2015 77.05479452054794%

(4) 若某机场至少存在 5 年或以上满足所有月运量记录都为 0,则将其所有
年份的记录信息从表中删除,并返回处理后的表格

group_name = df.groupby('Airport name')
zerocount = group_name['Whole year'].agg(count = lambda x: len(x[x == 0]))
idf4 = df.set_index('Airport name').drop(zerocount[zerocount['count'] > 5].index)
print(idf4)

结果:

Year Airport coordinates
Airport name
Abakan 2019 (Decimal(‘91.399735’), Decimal(‘53.751351’))
Anadyr(Carbon) 2019 (Decimal(‘177.738273’), Decimal(‘64.713433’))
Anapa(Vitjazevo) 2019 (Decimal(‘37.341511’), Decimal(‘45.003748’))
Arkhangelsk(Talagy) 2019 (Decimal(‘40.714892’), Decimal(‘64.596138’))
Astrakhan(Narimanovo) 2019 (Decimal(‘47.999896’), Decimal(‘46.287344’))

[5 rows x 15 columns]

(5) 采用一种合理的方式将所有机场划分为东南西北四个分区,并给出 2017
年-2019 年货运总量最大的区域。
东南西北四个分区不知道怎么分
要是分成东西、东南、西北、东北四个区的话,可以对经纬度分别求均值来分组,再求对应货运总量来比较

(6) 在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该
机场该月在当年所有月份中货运量的排名(例如 *** 机场 19 年 1 月运
量在整个 19 年 12 个月中排名第一,则秩为 1),那么判断某月运量情
况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量
定义为每一个月的秩综合指数,请根据上述定义计算 2016 年 12 个月
的秩综合指数。

df2016 = df.query('Year == 2016').reset_index()
Month = df2016.columns[3:15]
irank = pd.DataFrame(index = Month)
for ix in df2016.index:
    rank = df2016.loc[ix,Month].sort_values(ascending = False).index.to_list()
    irank[ix] = [rank.index(mon)+1 for mon in Month]
print(irank.sum(axis = 1))

结果:

January 3406
February 3076
March 2730
April 2432
May 2276
June 2047
July 1854
August 1527
September 1269
October 1009
November 728
December 422

三、新冠肺炎在美国的传播

美国确证数

UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key 2020/1/22 2020/1/23 2020/1/24 2020/1/25 2020/1/26 2020/1/27 2020/1/28 2020/1/29 2020/1/30 2020/1/31 2020/2/1 2020/2/2 2020/2/3 2020/2/4 2020/2/5 2020/2/6 2020/2/7 2020/2/8 2020/2/9 2020/2/10 2020/2/11 2020/2/12 2020/2/13 2020/2/14 2020/2/15 2020/2/16 2020/2/17 2020/2/18 2020/2/19 2020/2/20 2020/2/21 2020/2/22 2020/2/23 2020/2/24 2020/2/25 2020/2/26 2020/2/27 2020/2/28 2020/2/29 2020/3/1 2020/3/2 2020/3/3 2020/3/4 2020/3/5 2020/3/6 2020/3/7 2020/3/8 2020/3/9 2020/3/10 2020/3/11 2020/3/12 2020/3/13 2020/3/14 2020/3/15 2020/3/16 2020/3/17 2020/3/18 2020/3/19 2020/3/20 2020/3/21 2020/3/22 2020/3/23 2020/3/24 2020/3/25 2020/3/26 2020/3/27 2020/3/28 2020/3/29 2020/3/30 2020/3/31 2020/4/1 2020/4/2 2020/4/3 2020/4/4 2020/4/5 2020/4/6 2020/4/7 2020/4/8 2020/4/9 2020/4/10 2020/4/11 2020/4/12 2020/4/13 2020/4/14 2020/4/15 2020/4/16 2020/4/17 2020/4/18 2020/4/19 2020/4/20 2020/4/21 2020/4/22 2020/4/23 2020/4/24 2020/4/25 2020/4/26
84001001 US USA 840 1001 Autauga Alabama US 32.53952745 -86.64408227 "Autauga Alabama US" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 6 6 6 6 6 7 8 10 12 12 12 12 12 12 15 17 19 19 19 23 24 26 26 25 26 28 30 32 33 36
84001003 US USA 840 1003 Baldwin Alabama US 30.72774991 -87.72207058 "Baldwin Alabama US" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 3 4 4 5 5 10 15 18 19 20 24 28 29 29 38 42 44 56 59 66 71 72 87 91 101 103 109 112 117 123 132 143 147
84001005 US USA 840 1005 Barbour Alabama US 31.868263 -85.3871286 "Barbour Alabama US" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 3 4 9 9 10 10 11 12 14 15 18 20 22 28 29 30 32
84001007 US USA 840 1007 Bibb Alabama US 32.99642064 -87.1251146 "Bibb Alabama US" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 3 3 4 4 4 5 7 8 9 9 11 13 16 17 17 18 22 24 26 28 32 32 34 33 34
84001009 US USA 840 1009 Blount Alabama US 33.98210918 -86.56790593 "Blount Alabama US" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 4 5 5 5 5 5 6 9 10 10 10 10 10 11 12 12 13 14 16 17 18 20 20 21 22 26 29 31 31

美国死亡数

UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key Population 2020/1/22 2020/1/23 2020/1/24 2020/1/25 2020/1/26 2020/1/27 2020/1/28 2020/1/29 2020/1/30 2020/1/31 2020/2/1 2020/2/2 2020/2/3 2020/2/4 2020/2/5 2020/2/6 2020/2/7 2020/2/8 2020/2/9 2020/2/10 2020/2/11 2020/2/12 2020/2/13 2020/2/14 2020/2/15 2020/2/16 2020/2/17 2020/2/18 2020/2/19 2020/2/20 2020/2/21 2020/2/22 2020/2/23 2020/2/24 2020/2/25 2020/2/26 2020/2/27 2020/2/28 2020/2/29 2020/3/1 2020/3/2 2020/3/3 2020/3/4 2020/3/5 2020/3/6 2020/3/7 2020/3/8 2020/3/9 2020/3/10 2020/3/11 2020/3/12 2020/3/13 2020/3/14 2020/3/15 2020/3/16 2020/3/17 2020/3/18 2020/3/19 2020/3/20 2020/3/21 2020/3/22 2020/3/23 2020/3/24 2020/3/25 2020/3/26 2020/3/27 2020/3/28 2020/3/29 2020/3/30 2020/3/31 2020/4/1 2020/4/2 2020/4/3 2020/4/4 2020/4/5 2020/4/6 2020/4/7 2020/4/8 2020/4/9 2020/4/10 2020/4/11 2020/4/12 2020/4/13 2020/4/14 2020/4/15 2020/4/16 2020/4/17 2020/4/18 2020/4/19 2020/4/20 2020/4/21 2020/4/22 2020/4/23 2020/4/24 2020/4/25 2020/4/26
84001001 US USA 840 1001 Autauga Alabama US 32.53952745 -86.64408227 "Autauga Alabama US" 55869 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 2 2
84001003 US USA 840 1003 Baldwin Alabama US 30.72774991 -87.72207058 "Baldwin Alabama US" 223234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3
84001005 US USA 840 1005 Barbour Alabama US 31.868263 -85.3871286 "Barbour Alabama US" 24686 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
84001007 US USA 840 1007 Bibb Alabama US 32.99642064 -87.1251146 "Bibb Alabama US" 22394 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
84001009 US USA 840 1009 Blount Alabama US 33.98210918 -86.56790593 "Blount Alabama US" 57826 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

问题
(1) 用 corr() 函数计算县(每行都是一个县)人口与表中最后一天记录日期
死亡数的相关系数。

print(df2[['Population','2020/4/26']].corr())

结果:

Population 2020/4/26
Population 1.000000 0.403844
2020/4/26 0.403844 1.000000

(2) 截止到 4 月 1 日,统计每个州零感染县的比例。

group_Province = df1[['Province_State', '2020/4/1']].groupby('Province_State')
idf2 = group_Province['2020/4/1'].agg(rate = lambda x: str(len(x[x == 0])/len(x)*100)+'%')
print(idf2)

结果:

rate
Province_State
Alabama 11.940298507462686%
Alaska 79.3103448275862%
Arizona 0.0%
Arkansas 29.333333333333332%
California 13.793103448275861%
Colorado 21.875%
Connecticut 0.0%
Delaware 0.0%
District of Columbia 0.0%
Florida 16.417910447761194%
Georgia 12.578616352201259%
Hawaii 20.0%
Idaho 38.63636363636363%
Illinois 48.03921568627451%
Indiana 10.869565217391305%
Iowa 40.4040404040404%
Kansas 60.952380952380956%
Kentucky 44.166666666666664%
Louisiana 6.25%
Maine 25.0%
Maryland 4.166666666666666%
Massachusetts 14.285714285714285%
Michigan 19.27710843373494%
Minnesota 36.7816091954023%
Mississippi 6.097560975609756%
Missouri 39.130434782608695%
Montana 62.5%
Nebraska 75.26881720430107%
Nevada 47.05882352941176%
New Hampshire 10.0%
New Jersey 0.0%
New Mexico 42.42424242424242%
New York 8.064516129032258%
North Carolina 18.0%
North Dakota 54.71698113207547%
Ohio 18.181818181818183%
Oklahoma 37.66233766233766%
Oregon 27.77777777777778%
Pennsylvania 10.44776119402985%
Rhode Island 0.0%
South Carolina 6.521739130434782%
South Dakota 56.060606060606055%
Tennessee 11.578947368421053%
Texas 45.2755905511811%
Utah 48.275862068965516%
Vermont 14.285714285714285%
Virginia 27.06766917293233%
Washington 12.82051282051282%
West Virginia 47.27272727272727%
Wisconsin 31.944444444444443%
Wyoming 34.78260869565217%

(3) 请找出最早出确证病例的三个县。

idf3 = df1.copy()
towns = []
for day in df1.columns[11:]:
    if len(towns) >= 3: break
    town = idf3[idf3[day] > 0]['Admin2']
    if town.shape[0] > 0:
        towns.extend(town.values)
        idf3 = idf3.drop(index = town.index)
print(towns)

结果:
[‘King’, ‘Cook’, ‘Maricopa’, ‘Los Angeles’, ‘Orange’]

(4) 按州统计单日死亡增加数,并给出哪个州在哪一天确诊数增加最大(这
里指的是在所有州和所有天两个指标一起算,不是分别算)。

(5) 现需对每个州编制确证与死亡表,第一列为时间,并且起始时间为该州
开始出现死亡比例的那一天,第二列和第三列分别为确证数和死亡数,
每个州需要保存为一个单独的 csv 文件,文件名为“州名.csv”。

(6) 现需对 4 月 1 日至 4 月 10 日编制新增确证数与新增死亡数表,第一列
为州名,第二列和第三列分别为新增确证数和新增死亡数,分别保存为
十个单独的 csv 文件,文件名为“日期.csv”。

你可能感兴趣的:(pandas,python,数据分析)