数据集:大学毕业生收入

数据集:大学毕业生收入

下载地址,本文以绘制直方图为主。

1. 字段描述

字段名称 字段类型 字段说明
Major_code 整型 专业代码。
Major 字符型 专业名称。
Major_category 字符型 专业所属目录。
Total 整型 总人数。
Employed 整型 就业人数。
Employed_full_time_year_round 整型 全年全职在岗人数。
Unemployed 整型 失业人数。
Unemployment_rate 浮点型 失业率。
Median 整型 收入的中位数。
P25th 整型 收入的25百分位数。
P75th 浮点型 收入的75百分位数。

2. 数据预处理

2.1 导包

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import warnings
warnings.filterwarnings('ignore')

2.2 读取数据

df = pd.read_csv('大学毕业生收入数据集.csv')

3. 数据预览

3.1 预览数据

print(df.head())

结果

Major_code                                  Major  ...  P25th    P75th
0        1100                    GENERAL AGRICULTURE  ...  34000  80000.0
1        1101  AGRICULTURE PRODUCTION AND MANAGEMENT  ...  36000  80000.0
2        1102                 AGRICULTURAL ECONOMICS  ...  40000  98000.0
3        1103                        ANIMAL SCIENCES  ...  30000  72000.0
4        1104                           FOOD SCIENCE  ...  38500  90000.0

3.2 查看基本信息

df.info()

结果

RangeIndex: 173 entries, 0 to 172
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Major_code                     173 non-null    int64  
 1   Major                          173 non-null    object 
 2   Major_category                 173 non-null    object 
 3   Total                          173 non-null    int64  
 4   Employed                       173 non-null    int64  
 5   Employed_full_time_year_round  173 non-null    int64  
 6   Unemployed                     173 non-null    int64  
 7   Unemployment_rate              173 non-null    float64
 8   Median                         173 non-null    int64  
 9   P25th                          173 non-null    int64  
 10  P75th                          173 non-null    float64
dtypes: float64(2), int64(7), object(2)

3.3 查看重复值

print(df.duplicated().sum())

结果

0

3.4 查看缺失值

print(df.isnull().sum())

结果

Major_code                       0
Major                            0
Major_category                   0
Total                            0
Employed                         0
Employed_full_time_year_round    0
Unemployed                       0
Unemployment_rate                0
Median                           0
P25th                            0
P75th                            0
dtype: int64

4. 数据集描述性信息

describe = df.describe()
print(describe)

结果

Major_code         Total  ...         P25th          P75th
count   173.000000  1.730000e+02  ...    173.000000     173.000000
mean   3879.815029  2.302566e+05  ...  38697.109827   82506.358382
std    1687.753140  4.220685e+05  ...   9414.524761   20805.330126
min    1100.000000  2.396000e+03  ...  24900.000000   45800.000000
25%    2403.000000  2.428000e+04  ...  32000.000000   70000.000000
50%    3608.000000  7.579100e+04  ...  36000.000000   80000.000000
75%    5503.000000  2.057630e+05  ...  42000.000000   95000.000000
max    6403.000000  3.123510e+06  ...  78000.000000  210000.000000

[8 rows x 9 columns]

可在变量视图中查看describe

5. 数据分析

5.1 各专业种类(Major_category)的专业分支个数

Major_category_counts=df['Major_category'].value_counts()
print(Major_category_counts)
rects = plt.bar(range(1,17),Major_category_counts);
for rect in rects:  #rects 是三根柱子的集合
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height, str(height), size=12, ha='center', va='bottom')
interval = ['Engineering','Education','Humanities & Liberal Arts','Biology & Life Science','Business','Health','Computers & Mathematics','Agriculture & Natural Resources','Physical Sciences','Social Science','Psychology & Social Work','Arts','Industrial Arts & Consumer Services','Law & Public Policy','Communications & Journalism','Interdisciplinary']
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Number of Branches by Major Category')
plt.ylabel('Counts')
plt.show()

结果

Engineering                            29
Education                              16
Humanities & Liberal Arts              15
Biology & Life Science                 14
Business                               13
Health                                 12
Computers & Mathematics                11
Agriculture & Natural Resources        10
Physical Sciences                      10
Social Science                          9
Psychology & Social Work                9
Arts                                    8
Industrial Arts & Consumer Services     7
Law & Public Policy                     5
Communications & Journalism             4
Interdisciplinary                       1
Name: Major_category, dtype: int64

图示
数据集:大学毕业生收入_第1张图片
结论
由于机械类专业发展历史悠久,故相对来说机械类专业分支数相较其他大类专业要多

5.2 各大类专业收入

averageMoney = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Median'][j]
    averageMoney.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageMoney);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Annual salary by Major Category')
plt.ylabel('Moneys')
plt.show()

图示
数据集:大学毕业生收入_第2张图片
结论
由于机械类专业与人工智能、自动化等领域相关,故平均工资比较高;计算机与数学类专业发展前景很好,但是小公司工资普遍不高,大公司工资相对来说较高。

5.3 各大类专业失业率

averageUnemployRate = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Unemployment_rate'][j]
    averageUnemployRate.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageUnemployRate);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Unemployment Rate by Major Category')
plt.ylabel('Rate')
plt.show()

图示
数据集:大学毕业生收入_第3张图片
结论
艺术类专业由于可变动性特别大,加上对人才的要求相对来说较为苛刻,故失业率较高。

5.4 各大类专业就业率

averageEmployRate = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Employed'][j] / df['Total'][j]
    averageEmployRate.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageEmployRate);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Employment Rate by Major Category')
plt.ylabel('Rate')
plt.show()

图示
数据集:大学毕业生收入_第4张图片
结论
相对来说,由于计算机的发展前景,计算机与数学类的就业率较高。

5.5 各大类专业全年全职在岗率

averageFullTimeRate = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Employed_full_time_year_round'][j] / df['Employed'][j]
    averageFullTimeRate.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageFullTimeRate);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Full-Time Rate by Major Category')
plt.ylabel('Rate')
plt.show()

图示
数据集:大学毕业生收入_第5张图片

5.6 各大类专业总人数

averageNum = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Total'][j]
    averageNum.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageNum);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Total Numbers by Major Category')
plt.ylabel('Counts')
plt.show()

图示
数据集:大学毕业生收入_第6张图片

5.7 就业失业比

EUratio = []
for i in range(len(interval)):
    EUratio.append(averageEmployRate[i]/averageUnemployRate[i])
plt.bar(range(1,17),EUratio);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Employment-Unemployment Ratio by Major Category')
plt.ylabel('Ratio')
plt.show()

图示
数据集:大学毕业生收入_第7张图片
结论
相对来说,农业就业的门槛低,就业率高的同时失业率低。

6. 完整代码

# 导包
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import warnings
warnings.filterwarnings('ignore')

# 读取数据
df = pd.read_csv('大学毕业生收入数据集.csv')

# 预览数据
print(df.head())

# 规范字段名称(本数据集已经较为规范)

# 查看基本信息
df.info()

# 查看重复值
print(df.duplicated().sum())

# 查看缺失值
print(df.isnull().sum())

# 查看数据集描述性信息
describe = df.describe()
print(describe)

# 统计表中每个专业种类(Major_category)的个数
Major_category_counts=df['Major_category'].value_counts()
print(Major_category_counts)
rects = plt.bar(range(1,17),Major_category_counts);
for rect in rects:  #rects 是三根柱子的集合
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height, str(height), size=12, ha='center', va='bottom')
interval = ['Engineering','Education','Humanities & Liberal Arts','Biology & Life Science','Business','Health','Computers & Mathematics','Agriculture & Natural Resources','Physical Sciences','Social Science','Psychology & Social Work','Arts','Industrial Arts & Consumer Services','Law & Public Policy','Communications & Journalism','Interdisciplinary']
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Number of Branches by Major Category')
plt.ylabel('Counts')
plt.show()

# 对各大类专业收入作统计并作图
averageMoney = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Median'][j]
    averageMoney.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageMoney);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Annual salary by Major Category')
plt.ylabel('Moneys')
plt.show()

# 对各大类专业失业率作统计并作图
averageUnemployRate = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Unemployment_rate'][j]
    averageUnemployRate.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageUnemployRate);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Unemployment Rate by Major Category')
plt.ylabel('Rate')
plt.show()

# 对各大类专业就业率作统计并作图
averageEmployRate = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Employed'][j] / df['Total'][j]
    averageEmployRate.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageEmployRate);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Employment Rate by Major Category')
plt.ylabel('Rate')
plt.show()

# 对各大类专业全年全职在岗率作统计并作图(没有早退的)
averageFullTimeRate = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Employed_full_time_year_round'][j] / df['Employed'][j]
    averageFullTimeRate.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageFullTimeRate);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Full-Time Rate by Major Category')
plt.ylabel('Rate')
plt.show()

# 对各大类专业总人数作统计并作图
averageNum = []
for i in range(len(interval)):
    sum = 0
    for j in range(173):
        if df['Major_category'][j] == interval[i]:
            sum = sum + df['Total'][j]
    averageNum.append(sum/Major_category_counts[i])
plt.bar(range(1,17),averageNum);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Average Total Numbers by Major Category')
plt.ylabel('Counts')
plt.show()

# 对各大类专业就业失业比作统计并作图
EUratio = []
for i in range(len(interval)):
    EUratio.append(averageEmployRate[i]/averageUnemployRate[i])
plt.bar(range(1,17),EUratio);
plt.xticks(range(1,17),interval,rotation=90);
plt.title('Employment-Unemployment Ratio by Major Category')
plt.ylabel('Ratio')
plt.show()


你可能感兴趣的:(数据集汇总,直方图,数据集,数据分析,可视化)