一个数据分类汇总统计方法
1、部分数据格式car.data
共有1728
条数据,这里仅列举部分,前6
列为属性,第7
列为类别。
vhigh,vhigh,2,2,small,low,unacc
med,vhigh,3,4,small,high,acc
med,vhigh,3,4,med,low,unacc
med,vhigh,3,4,med,med,unacc
med,low,2,4,small,low,unacc
med,low,2,4,small,med,acc
med,low,2,4,small,high,good
med,low,2,4,med,high,good
med,low,2,4,big,high,vgood
med,low,2,more,small,low,unacc
2、代码实现demo.py
#-*-coding:utf-8-*-
# 此代码用于查看数据分布情况
import data_utils
import numpy as np
'''
url="http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
raw_data=data_utils.download(url)
data_set=np.loadtxt(raw_data,delimiter=",",dtype=bytes).astype(str)
attribute=data_set[:,:6]
'''
with open('car.data','r') as data_set:
labels=data_set[:,6]
classes={
'unacc':0,
'acc':0,
'good':0,
'vgood':0,
}
for label in labels:
classes[label]+=1
sum=0
for label in classes:
sum+=classes[label]
print("\n数据类别分布情况如下:\n")
print("类别".ljust(9)+"数目".ljust(8)+"百分比%".ljust(9))
print('-'*30)
print("unacc".ljust(10)+"{:<10d}".format(classes['unacc'])+"{:<10.3f}".format(classes['unacc']*1.0/sum*100))
print("acc".ljust(10)+"{:<10d}".format(classes['acc'])+"{:<10.3f}".format(classes['acc']*1.0/sum*100))
print("good".ljust(10)+"{:<10d}".format(classes['good'])+"{:<10.3f}".format(classes['good']*1.0/sum*100))
print("vgood".ljust(10)+"{:<10d}".format(classes['vgood'])+"{:<10.3f}".format(classes['vgood']*1.0/sum*100))
print("总计".ljust(8)+"{:<10d}".format(sum)+"{:<10.3f}".format(1))
3、结果展示
文件car.data已准备完毕!
数据类别分布情况如下:
类别 数目 百分比%
------------------------------
unacc 1210 70.023
acc 384 22.222
good 69 3.993
vgood 65 3.762
总计 1728 1.000
4、之前也用过pandas
里的groupby
,只统计了类别+数量,这个类别+数量+比例,以后应该用得到。