The dataset is stored in recent-grads.csv file.It contains information on earnings of college majors in US from 2010 to 2012.
It can be download form here:https://github.com/fivethirtyeight/data/tree/master/college-majors
In this project,I will explore the dataset and try to find some patterns in the earning of majors then plot it use matplotlib library.
代码使用jupyter完成:
读取数据:
import pandas as pd
recent_grads=pd.read_csv('./data/recent-grads.csv')
recent_grads.columns
print(recent_grads.info())
print(recent_grads.describe())
print(recent_grads.head(1))
处理缺失值:
raw_data_count=recent_grads.shape[0]
print(raw_data_count)
cleaned_data_count=recent_grads.dropna().shape[0]
print(cleaned_data_count)
==>>173
172
绘制散点图,查看各属性之间的关系:
import matplotlib.pyplot as plt
%matplotlib inline
recent_grads.plot(x='Full_time',y='Median',kind='scatter')
recent_grads.plot(x='Unemployed',y='Median',kind='scatter')
recent_grads.plot(x='Men',y='Median',kind='scatter')
recent_grads.plot(x='Women',y='Median',kind='scatter')
得到
我们继续绘制柱状图,查看各属性的分布情况:
columns=['Median','Employed','Employed','Unemployment_rate','Women','Men']
['Men'].hist()
fig=plt.figure(figsize=(6,18))
for i,col in enumerate(columns):
ax=fig.add_subplot(6,1,i+1)
ax=recent_grads[col].hist(color='orange')
plt.show()
为了更方便的查看就业人数与薪资的关系,使用scatter_matrix函数来构建散点图矩阵:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(recent_grads[['Employed','Median']],figsize=(10,10),c=['red','blue'])
关于该矩阵的说明:
接下来不妨做些有意思的事情,分析一下薪资前10以及后10的专业中女生所占比例:
recent_grads[:10].plot.bar(x='Major',y='ShareWomen')
plt.legend(loc='upper left')
plt.title('The 10 highest paying majors.')
recent_grads[162:].plot(x='Major',y='ShareWomen',kind='bar')
plt.title('The 10 lowest paying majors.')
分析薪资较高的专业中的男女性别比例:
recent_grads[:10].plot.bar(x='Major',y=['Men','Women'])