第一步:导入相关模块
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus']=False
第二步:加载dataset目录下US_Baby_names_right.csv文件数据并查看数据的基本信息
data = pd.read_csv('dataset/US_Baby_names_right.csv')
data.info()
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1016395 non-null int64
1 Id 1016395 non-null int64
2 Name 1016395 non-null object
3 Year 1016395 non-null int64
4 Gender 1016395 non-null object
5 State 1016395 non-null object
6 Count 1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB
第三步:查看前十行数据
data.head(10)
|
Unnamed: 0 |
Id |
Name |
Year |
Gender |
State |
Count |
0 |
11349 |
11350 |
Emma |
2004 |
F |
AK |
62 |
1 |
11350 |
11351 |
Madison |
2004 |
F |
AK |
48 |
2 |
11351 |
11352 |
Hannah |
2004 |
F |
AK |
46 |
3 |
11352 |
11353 |
Grace |
2004 |
F |
AK |
44 |
4 |
11353 |
11354 |
Emily |
2004 |
F |
AK |
41 |
5 |
11354 |
11355 |
Abigail |
2004 |
F |
AK |
37 |
6 |
11355 |
11356 |
Olivia |
2004 |
F |
AK |
33 |
7 |
11356 |
11357 |
Isabella |
2004 |
F |
AK |
30 |
8 |
11357 |
11358 |
Alyssa |
2004 |
F |
AK |
29 |
9 |
11358 |
11359 |
Sophia |
2004 |
F |
AK |
28 |
数据注释:
- Name 名字
- Year 婴儿出生的名字
- Gender 婴儿性别
- State 婴儿出生的地区缩写
- Count 该名字被使用的次数
第三步:删除 Unname:0和Id这两列数据
del data['Unnamed: 0']
del data['Id']
data.info()
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 1016395 non-null object
1 Year 1016395 non-null int64
2 Gender 1016395 non-null object
3 State 1016395 non-null object
4 Count 1016395 non-null int64
dtypes: int64(2), object(3)
memory usage: 38.8+ MB
第四步:统计数据集中男孩名字和女孩名字各是多少。
data['Gender'].value_counts()
F 558846
M 457549
Name: Gender, dtype: int64
第五步:按照Name字段将数据集进行分组并求和赋值给变量names,最后输出前五行
names = data.groupby('Name')['Year','Count'].sum()
names.head()
|
Year |
Count |
Name |
|
|
Aaban |
4027 |
12 |
Aadan |
8039 |
23 |
Aadarsh |
2009 |
5 |
Aaden |
393963 |
3426 |
Aadhav |
2014 |
6 |
第六步:按照每个名字被使用的次数(Count)对第五步中结果进行降序排序,得出最受欢迎的的五个名字
names.sort_values(['Count'], ascending=False).head(5)
|
Year |
Count |
Name |
|
|
Jacob |
1141099 |
242874 |
Emma |
1137085 |
214852 |
Michael |
1161152 |
214405 |
Ethan |
1139091 |
209277 |
Isabella |
1137090 |
204798 |
第七步:在数据集中,共出现了多少个名字?(不包含重复项)
data['Name'].nunique()
17632
第八步:根据names变量中的数据,删除掉Year列数据后,得出如下所示的基本统计参数
del names['Year']
names.describe()
|
Count |
count |
17632.000000 |
mean |
2008.932169 |
std |
11006.069468 |
min |
5.000000 |
25% |
11.000000 |
50% |
49.000000 |
75% |
337.000000 |
max |
242874.000000 |