学习记录--- 也是一个数据分类汇总统计的方法

一个数据分类汇总统计方法

1、部分数据格式car.data

共有1728条数据,这里仅列举部分,前6列为属性,第7列为类别。

vhigh,vhigh,2,2,small,low,unacc
med,vhigh,3,4,small,high,acc
med,vhigh,3,4,med,low,unacc
med,vhigh,3,4,med,med,unacc
med,low,2,4,small,low,unacc
med,low,2,4,small,med,acc
med,low,2,4,small,high,good
med,low,2,4,med,high,good
med,low,2,4,big,high,vgood
med,low,2,more,small,low,unacc

2、代码实现demo.py

#-*-coding:utf-8-*-
# 此代码用于查看数据分布情况

import data_utils
import numpy as np
'''
url="http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"

raw_data=data_utils.download(url)
data_set=np.loadtxt(raw_data,delimiter=",",dtype=bytes).astype(str)
attribute=data_set[:,:6]
'''
with open('car.data','r') as data_set:
    labels=data_set[:,6]

classes={
    'unacc':0,
    'acc':0,
    'good':0,
    'vgood':0,
}

for label in labels:
    classes[label]+=1

sum=0
for label in classes:
    sum+=classes[label]

print("\n数据类别分布情况如下:\n")
print("类别".ljust(9)+"数目".ljust(8)+"百分比%".ljust(9))
print('-'*30)
print("unacc".ljust(10)+"{:<10d}".format(classes['unacc'])+"{:<10.3f}".format(classes['unacc']*1.0/sum*100))
print("acc".ljust(10)+"{:<10d}".format(classes['acc'])+"{:<10.3f}".format(classes['acc']*1.0/sum*100))
print("good".ljust(10)+"{:<10d}".format(classes['good'])+"{:<10.3f}".format(classes['good']*1.0/sum*100))
print("vgood".ljust(10)+"{:<10d}".format(classes['vgood'])+"{:<10.3f}".format(classes['vgood']*1.0/sum*100))
print("总计".ljust(8)+"{:<10d}".format(sum)+"{:<10.3f}".format(1))

3、结果展示

文件car.data已准备完毕!

数据类别分布情况如下:

类别       数目      百分比%
------------------------------
unacc     1210      70.023
acc       384       22.222
good      69        3.993
vgood     65        3.762
总计      1728      1.000

4、之前也用过pandas里的groupby,只统计了类别+数量,这个类别+数量+比例,以后应该用得到。

你可能感兴趣的:(大四暑假)