数据集:
baby_names.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 5 columns):
Name 1016395 non-null object
Year 1016395 non-null int64
Gender 1016395 non-null object
State 1016395 non-null object
Count 1016395 non-null int64
dtypes: int64(2), object(3)
memory usage: 38.8+ MB
baby_names.head()
Out[30]:
Name Year Gender State Count
0 Emma 2004 F AK 62
1 Madison 2004 F AK 48
2 Hannah 2004 F AK 46
3 Grace 2004 F AK 44
4 Emily 2004 F AK 41
value_counts(): Series.value_counts() 出现频率(注意DataFrame (baby_names)和DataFrame的分组(DataFrame.groupby())没有value_counts()属性)
1) 数据集中一共有多少个name (即:’Name’去重后的个数)
方法:
1) baby_names[‘Name’].value_counts().shape
2) baby_names.drop_duplicates(‘Name’).count()
(注意:在方法2)中 drop_duplicats(‘Name’)表示name列不重复的数据, 若drop_duplicats([‘Name’, ‘Gender’)] 则表示元组(Name, Gender)作为一个整体且该整体不重复的数据(Lucy, F)和(Lucy, M)整体不重复,计数为2条数据)
baby_names['Name'].value_counts().shape
Out[92]: (17632,)
baby_names.drop_duplicates('Name').count()
Out[91]:
Name 17632
Year 17632
Gender 17632
State 17632
Count 17632
dtype: int64
AttributeError: 'DataFrame' object has no attribute 'value_counts'
AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts'
说明:数据集中,通过Series的value_counts()属性获知每一个Name元素出现的频率是1或多次
补充 :描述和汇总统计:
a) idxmax(): 能够获取最大值的数据的索引值
问题:names中Count列数据最大的人是谁?
方法:names[‘Count’].idxmax()
names['Count'].idxmax()
Out[103]: 'Jacob'
b) 统计个数方法:.count() / .shape / len()
问题:names中Count数值最小的名字的个数
方法:(推荐方法2和方法3)
names[names['Count']==names['Count'].min()].count()
Out[105]:
Year 2578
Count 2578
dtype: int64
names[names['Count']==names['Count'].min()].shape[0]
Out[108]: 2578
len(names[names['Count']==names['Count'].min()])
Out[107]: 2578
c) .median() / .var() / .std() / .describe() ……
names['Count'].median() # 值的中位数
Out[119]: 49.0
names['Count'].var() # 样本值的方差
Out[120]: 121133565.13204491
names['Count'].std() # 样本值的标准差
Out[117]: 11006.069467891111
names.describe()
Out[118]:
Year Count
count 1.763200e+04 17632.000000
mean 1.158117e+05 2008.932169
std 2.451618e+05 11006.069468
min 2.004000e+03 5.000000
25% 4.017000e+03 11.000000
50% 1.606100e+04 49.000000
75% 7.846425e+04 337.000000
max 2.233993e+06 242874.000000
baby_names.groupby(‘Name’).agg_func()
names = baby_names.groupby('Name').sum()
names.head()
Out[56]:
Year Count
Name
Aaban 4027 12
Aadan 8039 23
Aadarsh 2009 5
Aaden 393963 3426
Aadhav 2014 6
len(names) # How many different names exist in the dataset?
Out[74]: 17632
注意两个输入和输出的呈现格式不同:
names['Count'].sort_values(ascending=False).head()
Out[67]:
Name
Jacob 242874
Emma 214852
Michael 214405
Ethan 209277
Isabella 204798
Name: Count, dtype: int64
names.sort_values(by=['Count'],ascending=False).head()
Out[68]:
Year Count
Name
Jacob 1141099 242874
Emma 1137085 214852
Michael 1161152 214405
Ethan 1139091 209277
Isabella 1137090 204798