pandas基础属性方法随机整理(三)--- 描述统计/去重计数/分组排序

  1. 分组和排序
  2. Series.value_counts() & drop_duplicates()

数据集:

baby_names.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 5 columns):
Name      1016395 non-null object
Year      1016395 non-null int64
Gender    1016395 non-null object
State     1016395 non-null object
Count     1016395 non-null int64
dtypes: int64(2), object(3)
memory usage: 38.8+ MB

baby_names.head()
Out[30]: 
      Name  Year Gender State  Count
0     Emma  2004      F    AK     62
1  Madison  2004      F    AK     48
2   Hannah  2004      F    AK     46
3    Grace  2004      F    AK     44
4    Emily  2004      F    AK     41
  • Tips:value_counts() & drop_duplicates()

value_counts(): Series.value_counts() 出现频率(注意DataFrame (baby_names)和DataFrame的分组(DataFrame.groupby())没有value_counts()属性)
1) 数据集中一共有多少个name (即:’Name’去重后的个数)
方法:
1) baby_names[‘Name’].value_counts().shape
2) baby_names.drop_duplicates(‘Name’).count()
(注意:在方法2)中 drop_duplicats(‘Name’)表示name列不重复的数据, 若drop_duplicats([‘Name’, ‘Gender’)] 则表示元组(Name, Gender)作为一个整体且该整体不重复的数据(Lucy, F)和(Lucy, M)整体不重复,计数为2条数据)

baby_names['Name'].value_counts().shape
Out[92]: (17632,)

baby_names.drop_duplicates('Name').count()
Out[91]: 
Name      17632
Year      17632
Gender    17632
State     17632
Count     17632
dtype: int64
AttributeError: 'DataFrame' object has no attribute 'value_counts'
AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts'

说明:数据集中,通过Series的value_counts()属性获知每一个Name元素出现的频率是1或多次

补充 :描述和汇总统计:
a) idxmax(): 能够获取最大值的数据的索引值
问题:names中Count列数据最大的人是谁?
方法:names[‘Count’].idxmax()

names['Count'].idxmax()
Out[103]: 'Jacob'

b) 统计个数方法:.count() / .shape / len()
问题:names中Count数值最小的名字的个数
方法:(推荐方法2和方法3)

names[names['Count']==names['Count'].min()].count()
Out[105]: 
Year     2578
Count    2578
dtype: int64

names[names['Count']==names['Count'].min()].shape[0]
Out[108]: 2578

len(names[names['Count']==names['Count'].min()])
Out[107]: 2578

c) .median() / .var() / .std() / .describe() ……

names['Count'].median()         # 值的中位数
Out[119]: 49.0

names['Count'].var()            # 样本值的方差
Out[120]: 121133565.13204491

names['Count'].std()            # 样本值的标准差 
Out[117]: 11006.069467891111

names.describe()
Out[118]: 
               Year          Count
count  1.763200e+04   17632.000000
mean   1.158117e+05    2008.932169
std    2.451618e+05   11006.069468
min    2.004000e+03       5.000000
25%    4.017000e+03      11.000000
50%    1.606100e+04      49.000000
75%    7.846425e+04     337.000000
max    2.233993e+06  242874.000000
  • 分组:

baby_names.groupby(‘Name’).agg_func()

names = baby_names.groupby('Name').sum()
names.head()
Out[56]: 
           Year  Count
Name                  
Aaban      4027     12
Aadan      8039     23
Aadarsh    2009      5
Aaden    393963   3426
Aadhav     2014      6

len(names)          # How many different names exist in the dataset?
Out[74]: 17632
  • 排序:

注意两个输入和输出的呈现格式不同:

names['Count'].sort_values(ascending=False).head()
Out[67]: 
Name
Jacob       242874
Emma        214852
Michael     214405
Ethan       209277
Isabella    204798
Name: Count, dtype: int64

names.sort_values(by=['Count'],ascending=False).head()
Out[68]: 
             Year   Count
Name                     
Jacob     1141099  242874
Emma      1137085  214852
Michael   1161152  214405
Ethan     1139091  209277
Isabella  1137090  204798

你可能感兴趣的:(pandas)