google play store的app数据分析

google play store app数据源 提取码: 38jk

google play store的app数据分析

1. 加载数据

  • 加载数据分析使用的库
  • 加载数据前,先用文本编辑器简单浏览一下数据
  • 加载好数据之后,第一步先分别使用shape、head、count、describe和info方法看下数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 加载文件 
# 这次只分析'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type'
df = pd.read_csv('./googleplaystore.csv', usecols=(0, 1, 2, 3, 4, 5, 6))

# 简单浏览下数据
print(df.head())
# 查看行列数量
print(df.shape)
# 查看各个列的非空数量
print(df.count())

# 使用describe和info方法看下数据的大概分布
print(df.describe())
print(df.info())
                                           App        Category  Rating  \
0     Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
1                                Coloring book moana  ART_AND_DESIGN     3.9   
2  U Launcher Lite – FREE Live Cool Themes, Hide ...  ART_AND_DESIGN     4.7   
3                              Sketch - Draw & Paint  ART_AND_DESIGN     4.5   
4              Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN     4.3   

  Reviews  Size     Installs  Type  
0     159   19M      10,000+  Free  
1     967   14M     500,000+  Free  
2   87510  8.7M   5,000,000+  Free  
3  215644   25M  50,000,000+  Free  
4     967  2.8M     100,000+  Free  
(10841, 7)
App         10841
Category    10841
Rating       9367
Reviews     10841
Size        10841
Installs    10841
Type        10840
dtype: int64
            Rating
count  9367.000000
mean      4.193338
std       0.537431
min       1.000000
25%       4.000000
50%       4.300000
75%       4.500000
max      19.000000

RangeIndex: 10841 entries, 0 to 10840
Data columns (total 7 columns):
App         10841 non-null object
Category    10841 non-null object
Rating      9367 non-null float64
Reviews     10841 non-null object
Size        10841 non-null object
Installs    10841 non-null object
Type        10840 non-null object
dtypes: float64(1), object(6)
memory usage: 592.9+ KB
None
  • 从上面的运行结果得出
  • 数据一共有10841行
  • Rating和Type数据有缺失
  • Rating有一个19的异常值
  • Size的‘M’和‘k’和Installs的‘+’都需要处理,方便进一步计算

2. 数据清洗 # App

  • 查看有没有重复值
print(df['App'].unique().size)
9660
  • 有重复值,先不着急删除,为了不把其他列的异常值留下,先处理数值异常的列

3. 数据清洗 # Categoery

print(df['Category'].value_counts(dropna=False))
print(df[df['Category'] == '1.9'])
FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
COMICS                   60
PARENTING                60
BEAUTY                   53
1.9                       1
Name: Category, dtype: int64
                                           App Category  Rating Reviews  \
10472  Life Made WI-Fi Touchscreen Photo Frame      1.9    19.0    3.0M   

         Size Installs Type  
10472  1,000+     Free    0  
  • 有一条异常值,观察发现应该是Category值缺失,所以这里删除这条数据
df.drop(index=10472, inplace=True)

4. 数据清洗 # Rating

print(df['Rating'].value_counts(dropna=False))
NaN     1474
4.4     1109
4.3     1076
4.5     1038
4.2      952
4.6      823
4.1      708
4.0      568
4.7      499
3.9      386
3.8      303
5.0      274
3.7      239
4.8      234
3.6      174
3.5      163
3.4      128
3.3      102
4.9       87
3.0       83
3.1       69
3.2       64
2.9       45
2.8       42
2.6       25
2.7       25
2.5       21
2.3       20
2.4       19
1.0       16
2.2       14
1.9       13
2.0       12
1.8        8
1.7        8
2.1        8
1.6        4
1.5        3
1.4        3
1.2        1
Name: Rating, dtype: int64
  • 一共有1474条NaN值,用平均值来填充
df['Rating'].fillna(value=df['Rating'].mean(), inplace=True)

5. 数据清洗 # Reviews

print(df['Rating'].value_counts(dropna=False))
print(df['Reviews'].str.isnumeric().sum())
4.193338     1474
4.400000     1109
4.300000     1076
4.500000     1038
4.200000      952
4.600000      823
4.100000      708
4.000000      568
4.700000      499
3.900000      386
3.800000      303
5.000000      274
3.700000      239
4.800000      234
3.600000      174
3.500000      163
3.400000      128
3.300000      102
4.900000       87
3.000000       83
3.100000       69
3.200000       64
2.900000       45
2.800000       42
2.700000       25
2.600000       25
2.500000       21
2.300000       20
2.400000       19
1.000000       16
2.200000       14
1.900000       13
2.000000       12
2.100000        8
1.800000        8
1.700000        8
1.600000        4
1.400000        3
1.500000        3
1.200000        1
Name: Rating, dtype: int64
10840
  • 用value_counts看数据分布挺广,都是数字
  • 把Reviews的数据类型转换成‘i8’,方便后面的分析
df['Reviews'] = df['Reviews'].astype('i8')
print(df.describe())
    Rating       Reviews
count  10840.000000  1.084000e+04
mean       4.191757  4.441529e+05
std        0.478907  2.927761e+06
min        1.000000  0.000000e+00
25%        4.100000  3.800000e+01
50%        4.200000  2.094000e+03
75%        4.500000  5.477550e+04
max        5.000000  7.815831e+07

6. 数据清洗 # Size

print(df['Size'].value_counts())
Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
15M                    184
17M                    160
19M                    154
26M                    149
16M                    149
25M                    143
20M                    139
21M                    138
10M                    136
24M                    136
18M                    133
23M                    117
22M                    114
29M                    103
27M                     97
28M                     95
30M                     84
33M                     79
3.3M                    77
37M                     76
35M                     72
31M                     70
2.9M                    69
2.3M                    68
2.5M                    68
                      ... 
809k                     1
39k                      1
691k                     1
241k                     1
954k                     1
378k                     1
203k                     1
887k                     1
754k                     1
253k                     1
11k                      1
787k                     1
992k                     1
626k                     1
857k                     1
54k                      1
862k                     1
743k                     1
642k                     1
234k                     1
313k                     1
82k                      1
549k                     1
400k                     1
240k                     1
778k                     1
161k                     1
478k                     1
89k                      1
154k                     1
Name: Size, Length: 461, dtype: int64
  • 数据中存在‘M’和‘k’需要处理,还存在字符串1695个‘Varies with device’
  • 把‘Varies with device’用‘0’来替换
  • 把Size数据类型转换成f8
  • 然后再用平均值来填充‘0’值
df['Size'] = df['Size'].str.replace('M', 'e+6')
df['Size'] = df['Size'].str.replace('k', 'e+3')
# 转换剩下的字符串
df['Size'] = df['Size'].str.replace('Varies with device', '0')
# 转换数据类型
df['Size'] = df['Size'].astype('f8')
df['Size'].replace(0, df['Size'].mean(), inplace=True)
df['Size']
0        1.900000e+07
1        1.400000e+07
2        8.700000e+06
3        2.500000e+07
4        2.800000e+06
5        5.600000e+06
6        1.900000e+07
7        2.900000e+07
8        3.300000e+07
9        3.100000e+06
10       2.800000e+07
11       1.200000e+07
12       2.000000e+07
13       2.100000e+07
14       3.700000e+07
15       2.700000e+06
16       5.500000e+06
17       1.700000e+07
18       3.900000e+07
19       3.100000e+07
20       1.400000e+07
21       1.200000e+07
22       4.200000e+06
23       7.000000e+06
24       2.300000e+07
25       6.000000e+06
26       2.500000e+07
27       6.100000e+06
28       4.600000e+06
29       4.200000e+06
             ...     
10811    3.900000e+06
10812    1.300000e+07
10813    2.700000e+06
10814    3.100000e+07
10815    4.900000e+06
10816    6.800000e+06
10817    8.000000e+06
10818    1.500000e+06
10819    3.600000e+06
10820    8.600000e+06
10821    2.500000e+06
10822    3.100000e+06
10823    2.900000e+06
10824    8.200000e+07
10825    7.700000e+06
10826    1.815209e+07
10827    1.300000e+07
10828    1.300000e+07
10829    7.400000e+06
10830    2.300000e+06
10831    9.800000e+06
10832    5.820000e+05
10833    6.190000e+05
10834    2.600000e+06
10835    9.600000e+06
10836    5.300000e+07
10837    3.600000e+06
10838    9.500000e+06
10839    1.815209e+07
10840    1.900000e+07
Name: Size, Length: 10840, dtype: float64
print(df.describe())
             Rating       Reviews          Size
count  10840.000000  1.084000e+04  1.084000e+04
mean       4.191757  4.441529e+05  2.099045e+07
std        0.478907  2.927761e+06  2.078345e+07
min        1.000000  0.000000e+00  8.500000e+03
25%        4.100000  3.800000e+01  5.900000e+06
50%        4.200000  2.094000e+03  1.800000e+07
75%        4.500000  5.477550e+04  2.600000e+07
max        5.000000  7.815831e+07  1.000000e+08

7. 数据清洗 # Installs

  • 先查看分布
print(df['Installs'].value_counts())
1,000,000+        1579
10,000,000+       1252
100,000+          1169
10,000+           1054
1,000+             907
5,000,000+         752
100+               719
500,000+           539
50,000+            479
5,000+             477
100,000,000+       409
10+                386
500+               330
50,000,000+        289
50+                205
5+                  82
500,000,000+        72
1+                  67
1,000,000,000+      58
0+                  14
0                    1
Name: Installs, dtype: int64
  • 分布比较少,直接替换
df['Installs'] = df['Installs'].str.replace('+', '')
df['Installs'] = df['Installs'].str.replace(',', '')
  • 转换数据类型为‘i8’
df['Installs'] = df['Installs'].astype('i8')
print(df.describe())
            Rating       Reviews          Size      Installs
count  10840.000000  1.084000e+04  1.084000e+04  1.084000e+04
mean       4.191757  4.441529e+05  2.099045e+07  1.546434e+07
std        0.478907  2.927761e+06  2.078345e+07  8.502936e+07
min        1.000000  0.000000e+00  8.500000e+03  0.000000e+00
25%        4.100000  3.800000e+01  5.900000e+06  1.000000e+03
50%        4.200000  2.094000e+03  1.800000e+07  1.000000e+05
75%        4.500000  5.477550e+04  2.600000e+07  5.000000e+06
max        5.000000  7.815831e+07  1.000000e+08  1.000000e+09

8. 数据清洗 # Type

  • info信息中查看到有na值,这里需要dropna参数
print(df['Type'].value_counts(dropna=False))
print(df[df['Type'].isnull()])
Free    10039
Paid      800
NaN         1
Name: Type, dtype: int64
                            App Category    Rating  Reviews          Size  \
9148  Command & Conquer: Rivals   FAMILY  4.191757        0  1.815209e+07   

      Installs Type  
9148         0  NaN  

  • 删除这条数据
df.drop(index=9148, inplace=True)
  • 最后删除App重复的行
df.drop_duplicates('App', inplace=True)
  • 数据清洗完毕,可以开始分析了
  • 整体情况
print(df.describe())
 Rating       Reviews          Size      Installs
count  9658.000000  9.658000e+03  9.658000e+03  9.658000e+03
mean      4.176046  2.166150e+05  2.011053e+07  7.778312e+06
std       0.494383  1.831413e+06  2.040865e+07  5.376100e+07
min       1.000000  0.000000e+00  8.500000e+03  0.000000e+00
25%       4.000000  2.500000e+01  5.300000e+06  1.000000e+03
50%       4.200000  9.670000e+02  1.600000e+07  1.000000e+05
75%       4.500000  2.940800e+04  2.500000e+07  1.000000e+06
max       5.000000  7.815831e+07  1.000000e+08  1.000000e+09

9. 数据分析 # Category&App

  • 分类的个数
print(df.Category.unique().size)
33
  • 每个分类的App数量,排序,可以得出哪些分类的App最受开发者欢迎
Category_App_count = df.groupby('Category').count().sort_values('App', ascending=False)['App']
print(Category_App_count)
plt.figure(figsize=(20,10),dpi=80)
Category_App_count.plot(kind='barh')
plt.savefig('./Category_App_count.png')
plt.show()
Category
FAMILY                 1831
GAME                    959
TOOLS                   827
BUSINESS                420
MEDICAL                 395
PERSONALIZATION         376
PRODUCTIVITY            374
LIFESTYLE               369
FINANCE                 345
SPORTS                  325
COMMUNICATION           315
HEALTH_AND_FITNESS      288
PHOTOGRAPHY             281
NEWS_AND_MAGAZINES      254
SOCIAL                  239
BOOKS_AND_REFERENCE     222
TRAVEL_AND_LOCAL        219
SHOPPING                202
DATING                  171
VIDEO_PLAYERS           163
MAPS_AND_NAVIGATION     131
EDUCATION               119
FOOD_AND_DRINK          112
ENTERTAINMENT           102
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       84
WEATHER                  79
HOUSE_AND_HOME           74
EVENTS                   64
ART_AND_DESIGN           64
PARENTING                60
COMICS                   56
BEAUTY                   53
Name: App, dtype: int64
  • 33个分类App的数据可视化
    google play store的app数据分析_第1张图片
  • App数量排名前十分类的数据可视化
count_top_10 = df.groupby('Category').count()['App'].sort_values(ascending=False)[:10]
print(count_top_10)
plt.figure(figsize=(20,10),dpi=80)
x = count_top_10.index
y = count_top_10.values
# 添加数据标签
for a, b in zip(x, y):
    plt.text(a, b, b, ha='center', va='bottom', fontsize=12)
plt.bar(x, y, width=0.5)
plt.savefig('./count_top_10.png')
plt.show()
Category
FAMILY             1831
GAME                959
TOOLS               827
BUSINESS            420
MEDICAL             395
PERSONALIZATION     376
PRODUCTIVITY        374
LIFESTYLE           369
FINANCE             345
SPORTS              325
Name: App, dtype: int64

google play store的app数据分析_第2张图片

10. 数据分析 # Category&Installs

  • 33种分类的安装量排序
  • 安装量前10分类的可视化
# 33种分类的安装量排序
Category_Installs_mean = df.groupby('Category').mean()['Installs'].sort_values( ascending=False)
print(Category_Installs_mean)
# 安装量前10分类的可视化
mean_top_10 = df.groupby('Category').mean()['Installs'].sort_values( ascending=False)[:10]
print(mean_top_10)
plt.figure(figsize=(20,10),dpi=80)
x = mean_top_10.index
y = mean_top_10.values.astype('i8')
# 添加数据标签
for a, b in zip(x, y):
    plt.text(a, b, b, ha='center', va='bottom', fontsize=12)
plt.bar(x, y, width=0.5)
plt.savefig('./mean_top_10.png')
plt.show()
Category
COMMUNICATION          3.504215e+07
VIDEO_PLAYERS          2.409143e+07
SOCIAL                 2.296179e+07
ENTERTAINMENT          2.072216e+07
PHOTOGRAPHY            1.654501e+07
PRODUCTIVITY           1.548955e+07
GAME                   1.447229e+07
TRAVEL_AND_LOCAL       1.321866e+07
TOOLS                  9.675661e+06
NEWS_AND_MAGAZINES     9.327629e+06
BOOKS_AND_REFERENCE    7.504367e+06
SHOPPING               6.932420e+06
WEATHER                4.570893e+06
PERSONALIZATION        4.075784e+06
HEALTH_AND_FITNESS     3.972300e+06
MAPS_AND_NAVIGATION    3.841846e+06
SPORTS                 3.373768e+06
EDUCATION              2.965983e+06
FAMILY                 2.418319e+06
FOOD_AND_DRINK         1.891060e+06
ART_AND_DESIGN         1.786533e+06
BUSINESS               1.659916e+06
LIFESTYLE              1.365375e+06
FINANCE                1.319851e+06
HOUSE_AND_HOME         1.313682e+06
DATING                 8.241293e+05
COMICS                 8.032348e+05
LIBRARIES_AND_DEMO     6.309037e+05
AUTO_AND_VEHICLES      6.250613e+05
PARENTING              5.253518e+05
BEAUTY                 5.131519e+05
EVENTS                 2.495806e+05
MEDICAL                9.669159e+04
Name: Installs, dtype: float64

Category
COMMUNICATION         3.504215e+07
VIDEO_PLAYERS         2.409143e+07
SOCIAL                2.296179e+07
ENTERTAINMENT         2.072216e+07
PHOTOGRAPHY           1.654501e+07
PRODUCTIVITY          1.548955e+07
GAME                  1.447229e+07
TRAVEL_AND_LOCAL      1.321866e+07
TOOLS                 9.675661e+06
NEWS_AND_MAGAZINES    9.327629e+06
Name: Installs, dtype: float64

google play store的app数据分析_第3张图片

  • 得出结论:娱乐社交类安装量最多

11. 数据分析 # Category&Reviews

  • 33种分类的评论数量排序
  • 评论数量前10分类的可视化
# 33种分类的评论数量排序
Category_Reviews_mean = df.groupby('Category').mean()['Reviews'].sort_values(ascending=False)
print(Category_Reviews_mean)
# 33种分类的评论数量排序
top_mean_10 = df.groupby('Category').mean()['Reviews'].sort_values(ascending=False)[:10]
print(top_mean_10)

plt.figure(figsize=(20,10),dpi=80)
x = top_mean_10.index
y = top_mean_10.values.astype('i8')
# 添加数据标签
for a, b in zip(x, y):
    plt.text(a, b, b, ha='center', va='bottom', fontsize=12)
plt.bar(x, y, width=0.5)
plt.savefig('./top_mean_10.png')
plt.show()
Category
SOCIAL                 953672.807531
COMMUNICATION          907337.676190
GAME                   648903.763295
VIDEO_PLAYERS          414015.754601
PHOTOGRAPHY            374915.551601
ENTERTAINMENT          340810.294118
TOOLS                  277335.644498
SHOPPING               220553.118812
WEATHER                155634.987342
PRODUCTIVITY           148638.098930
PERSONALIZATION        142401.808511
MAPS_AND_NAVIGATION    135337.007634
TRAVEL_AND_LOCAL       122464.570776
EDUCATION              112303.764706
SPORTS                 108765.578462
NEWS_AND_MAGAZINES      91063.889764
FAMILY                  78550.239214
BOOKS_AND_REFERENCE     75321.234234
HEALTH_AND_FITNESS      74171.371528
FOOD_AND_DRINK          56473.464286
COMICS                  41822.696429
FINANCE                 36701.756522
LIFESTYLE               32066.859079
HOUSE_AND_HOME          26079.013514
BUSINESS                23548.202381
ART_AND_DESIGN          22175.046875
DATING                  21190.315789
PARENTING               15972.183333
AUTO_AND_VEHICLES       13690.188235
LIBRARIES_AND_DEMO      10795.607143
BEAUTY                   7476.226415
MEDICAL                  2994.863291
EVENTS                   2515.906250
Name: Reviews, dtype: float64
Category
SOCIAL           953672.807531
COMMUNICATION    907337.676190
GAME             648903.763295
VIDEO_PLAYERS    414015.754601
PHOTOGRAPHY      374915.551601
ENTERTAINMENT    340810.294118
TOOLS            277335.644498
SHOPPING         220553.118812
WEATHER          155634.987342
PRODUCTIVITY     148638.098930
Name: Reviews, dtype: float64

google play store的app数据分析_第4张图片

  • 得出结论:社交游戏视频评论多

12. 数据分析 # Category&Rating

  • 分类的打分数据
Category_Rating_mean = df.groupby('Category').mean()['Rating'].sort_values(ascending=False)
print(Category_Rating_mean)
Category
EVENTS                 4.363178
EDUCATION              4.362956
ART_AND_DESIGN         4.349614
BOOKS_AND_REFERENCE    4.308393
PERSONALIZATION        4.303077
PARENTING              4.281960
BEAUTY                 4.260553
GAME                   4.244643
SOCIAL                 4.238926
WEATHER                4.238510
HEALTH_AND_FITNESS     4.235199
SHOPPING               4.225835
SPORTS                 4.211275
AUTO_AND_VEHICLES      4.190601
PRODUCTIVITY           4.185022
COMICS                 4.181848
LIBRARIES_AND_DEMO     4.181371
FAMILY                 4.181137
FOOD_AND_DRINK         4.175461
MEDICAL                4.173252
PHOTOGRAPHY            4.159614
HOUSE_AND_HOME         4.156771
NEWS_AND_MAGAZINES     4.135385
ENTERTAINMENT          4.135294
COMMUNICATION          4.134647
BUSINESS               4.133347
FINANCE                4.125060
LIFESTYLE              4.111489
TRAVEL_AND_LOCAL       4.087380
TOOLS                  4.059615
VIDEO_PLAYERS          4.058137
MAPS_AND_NAVIGATION    4.051854
DATING                 4.018100
Name: Rating, dtype: float64

12. 数据分析 # Category&Type

  • 分type数据
print(df.groupby('Type')['App'].count())
print(df.groupby('Type').sum()['Installs'].sort_values(ascending=False))
Type
Free    8902
Paid     756
Name: App, dtype: int64
Type
Free    75065572646
Paid       57364881
Name: Installs, dtype: int64
  • 免费占比大,收费占比小,免费仍然是主流
  • Category和Type一起分析
df.groupby(['Type', 'Category']).sum()['Reviews'].sort_values(ascending=False)
Type  Category           
Free  GAME                   620725858
      COMMUNICATION          285727154
      TOOLS                  229184641
      SOCIAL                 227927559
      FAMILY                 140192916
      PHOTOGRAPHY            105236039
      VIDEO_PLAYERS           67471201
      PRODUCTIVITY            55418928
      PERSONALIZATION         53249927
      SHOPPING                44551246
      SPORTS                  35198178
      ENTERTAINMENT           34752641
      TRAVEL_AND_LOCAL        26801668
      NEWS_AND_MAGAZINES      23130027
      HEALTH_AND_FITNESS      21315562
      MAPS_AND_NAVIGATION     17721960
      BOOKS_AND_REFERENCE     16719518
      EDUCATION               13329503
      FINANCE                 12638908
      WEATHER                 12158723
      LIFESTYLE               11785249
      BUSINESS                 9865113
      FOOD_AND_DRINK           6321631
Paid  FAMILY                   3632572
Free  DATING                   3621936
      COMICS                   2342071
      HOUSE_AND_HOME           1929847
Paid  GAME                     1572851
Free  ART_AND_DESIGN           1417037
      MEDICAL                  1162965
                               ...    
      BEAUTY                    396240
Paid  PERSONALIZATION           293153
      TOOLS                     171937
      PRODUCTIVITY              171721
Free  EVENTS                    161018
Paid  SPORTS                    150635
      WEATHER                   136441
      PHOTOGRAPHY               115231
      COMMUNICATION              84214
      LIFESTYLE                  47422
      HEALTH_AND_FITNESS         45793
      EDUCATION                  34645
      BUSINESS                   25132
      FINANCE                    23198
      MEDICAL                    20006
      TRAVEL_AND_LOCAL           18073
      VIDEO_PLAYERS              13367
      ENTERTAINMENT              10009
      PARENTING                   8366
      MAPS_AND_NAVIGATION         7188
      AUTO_AND_VEHICLES           4163
      FOOD_AND_DRINK              3397
      ART_AND_DESIGN              2166
      BOOKS_AND_REFERENCE         1796
      DATING                      1608
      SHOPPING                     484
      SOCIAL                       242
      NEWS_AND_MAGAZINES           201
      LIBRARIES_AND_DEMO             4
      EVENTS                         0
Name: Reviews, Length: 63, dtype: int64
  • 评论安装比
Type_Category = df.groupby(['Type', 'Category']).mean()
print((Type_Category['Reviews'] / Type_Category['Installs']).sort_values(ascending=False))
Type  Category           
Paid  VIDEO_PLAYERS          0.188268
      FAMILY                 0.175913
      WEATHER                0.168031
      PARENTING              0.166986
      DATING                 0.141674
      ART_AND_DESIGN         0.135375
      FINANCE                0.124988
      PRODUCTIVITY           0.121611
      SPORTS                 0.121107
      BUSINESS               0.118115
      TOOLS                  0.099533
      TRAVEL_AND_LOCAL       0.098727
      HEALTH_AND_FITNESS     0.096587
      PERSONALIZATION        0.089958
      AUTO_AND_VEHICLES      0.083011
      BOOKS_AND_REFERENCE    0.077029
      GAME                   0.074898
      COMMUNICATION          0.061920
      PHOTOGRAPHY            0.061334
      MAPS_AND_NAVIGATION    0.059356
      EDUCATION              0.057550
      FOOD_AND_DRINK         0.056617
Free  COMICS                 0.052068
Paid  ENTERTAINMENT          0.050045
      SHOPPING               0.047921
Free  GAME                   0.044792
      SOCIAL                 0.041533
Paid  SOCIAL                 0.040333
      LIFESTYLE              0.040218
      LIBRARIES_AND_DEMO     0.040000
                               ...   
Free  MAPS_AND_NAVIGATION    0.035221
      PERSONALIZATION        0.034821
      WEATHER                0.033747
      SPORTS                 0.032138
      SHOPPING               0.031815
      FAMILY                 0.031809
      MEDICAL                0.030903
      PARENTING              0.030185
      FOOD_AND_DRINK         0.029856
      TOOLS                  0.028648
      FINANCE                0.027768
      COMMUNICATION          0.025888
      DATING                 0.025703
      LIFESTYLE              0.023446
      PHOTOGRAPHY            0.022645
      AUTO_AND_VEHICLES      0.021844
      HOUSE_AND_HOME         0.019852
      HEALTH_AND_FITNESS     0.018640
      VIDEO_PLAYERS          0.017182
      LIBRARIES_AND_DEMO     0.017111
      ENTERTAINMENT          0.016443
      BEAUTY                 0.014569
      BUSINESS               0.014155
      ART_AND_DESIGN         0.012395
      EVENTS                 0.010081
      BOOKS_AND_REFERENCE    0.010036
      NEWS_AND_MAGAZINES     0.009763
      PRODUCTIVITY           0.009569
      TRAVEL_AND_LOCAL       0.009259
Paid  EVENTS                 0.000000
Length: 63, dtype: float64
  • 收费的App评论比率高

13. 数据分析 # 相关性 corr

print(df.corr())
Rating   Reviews      Size  Installs
Rating    1.000000  0.054337  0.052751  0.039245
Reviews   0.054337  1.000000  0.080578  0.625164
Size      0.052751  0.080578  1.000000  0.050675
Installs  0.039245  0.625164  0.050675  1.000000
  • 评论数和安装数强相关,其他的连0.1都不到,可以认为是不相关(0.5以上可以认为是相关的,0.3以上可以认为是弱相关)

你可能感兴趣的:(数据分析项目)