直方图(histogram)展示离散型数据分布情况,直观理解为将数据按照一定规律分区间,统计每个区间中落入的数据频数,绘制区间与频数的柱状图即为直方图。
欢迎随缘关注@pythonic生物人
1、绘图数据集准备
2、matplotlib.pyplot.hist直方图参数详解
2.1、bins='auto',# 可选'auto', 'fd', 'doane', 'scott', 'stone', 'rice', 'sturges', or 'sqrt'.各选项详细解释
auto’ (maximum of the ‘sturges’ and ‘fd’ estimators)
‘fd’ (Freedman Diaconis Estimator)
‘scott’
‘rice’
‘sturges’
‘doane’
‘sqrt’
3、参考资料
4、我的公众号
使用sklearn内置的鸢尾花iris数据集,数据集详细介绍见:Matplotlib-02-iris鸢尾花数据集|scatter散点图
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from sklearn import datasets
iris=datasets.load_iris()
x, y = iris.data, iris.target
pd_iris = pd.DataFrame(np.hstack((x, y.reshape(150, 1))),columns=['sepal length(cm)','sepal width(cm)','petal length(cm)','petal width(cm)','class'] )
选取pd_iris['sepal length(cm)']数据绘制直方图 ,查看数据基本情况:
pd_iris['sepal length(cm)'].head()
0 5.1
1 4.9
2 4.7
3 4.6
4 5.0
Name: sepal length(cm), dtype: float64
pd_iris['sepal length(cm)'].describe()
count 150.000000
mean 5.843333
std 0.828066
min 4.300000
25% 5.100000
50% 5.800000
75% 6.400000
max 7.900000
Name: sepal length(cm), dtype: float64
修改对应参数,即可体验对应参数的功能;
大部分参数使用默认值即可。
import palettable
import random
plt.figure(dpi=150)
data=pd_iris['sepal length(cm)']
n, bins, patches=plt.hist(x=data,
##箱子数(bins)设置,以下三种不能同时并存
#bins=20,#default: 10
#bins=[4,6,8],#分两个箱子,边界分别为[4,6),[6,8]
#bins='auto',# 可选'auto', 'fd', 'doane', 'scott', 'stone', 'rice', 'sturges', or 'sqrt'.
#选择最合适的bin宽,绘制一个最能反映数据频率分布的直方图
#range=(5,7),#最左边和最右边箱子边界,不指定时,为(x.min(), x.max())
#density=True, #默认为False,y轴显示频数;为True y轴显示频率,频率统计结果=该区间频数/(x中总样本数*该区间宽度)
#weights=np.random.rand(len(x)),#对x中每一个样本设置权重,这里随机设置了权重
cumulative=False,#默认False,是否累加频数或者频率,及后面一个柱子是前面所有柱子的累加
bottom=0,#设置箱子y轴方向基线,默认为0,箱子高度=bottom to bottom + hist(x, bins)
histtype='bar',#直方图的类型默认为bar{'bar', 'barstacked', 'step', 'stepfilled'}
align='mid',#箱子边界值的对齐方式,默认为mid{'left', 'mid', 'right'}
orientation='vertical',#箱子水平还是垂直显示,默认垂直显示('vertical'),可选'horizontal'
rwidth=1.0,#每个箱子宽度,默认为1,此时显示50%
log=False,#y轴数据是否取对数,默认不取对数为False
color=palettable.colorbrewer.qualitative.Dark2_7.mpl_colors[3],
label='sepal length(cm)',#图例
#normed=0,#功能和density一样,二者不能同时使用
facecolor='black',#箱子颜色
edgecolor="black",#箱子边框颜色
stacked=False,#多组数据是否堆叠
alpha=0.5#箱子透明度
)
plt.xticks(bins)#x轴刻度设置为箱子边界
for patch in patches:#每个箱子随机设置颜色
patch.set_facecolor(random.choice(palettable.colorbrewer.qualitative.Dark2_7.mpl_colors))
#直方图三个返回值
print(n)#频数
print(bins)#箱子边界
print(patches)#箱子数
#直方图绘制分布曲线
plt.plot(bins[:10],n,'--',color='#2ca02c')
plt.hist(x=[i+0.1 for i in data],label='new sepal length(cm)',alpha=0.3)
plt.legend()
[ 9. 23. 14. 27. 16. 26. 18. 6. 5. 6.]
[4.3 4.66 5.02 5.38 5.74 6.1 6.46 6.82 7.18 7.54 7.9 ]
每一个参数都是一种分bins算法,详细如下,感兴趣可自行食用:
Notes
The methods to estimate the optimal number of bins are well founded in literature, and are inspired by the choices R provides for histogram visualisation. Note that having the number of bins proportional to is asymptotically optimal, which is why it appears in most estimators. These are simply plug-in methods that give good starting points for number of bins. In the equations below,h is the binwidth and is the number of bins. All estimators that compute bin counts are recast to bin width using the ptp
of the data. The final bin count is obtained from np.round(np.ceil(range / h))
.
A compromise to get a good value. For small datasets the Sturges value will usually be chosen, while larger datasets will usually default to FD. Avoids the overly conservative behaviour of FD and Sturges for small and large datasets respectively. Switchover point is usually.
The binwidth is proportional to the interquartile range (IQR) and inversely proportional to cube root of a.size. Can be too conservative for small datasets, but is quite good for large datasets. The IQR is very robust to outliers.
The binwidth is proportional to the standard deviation of the data and inversely proportional to cube root of x.size
. Can be too conservative for small datasets, but is quite good for large datasets. The standard deviation is not very robust to outliers. Values are very similar to the Freedman-Diaconis estimator in the absence of outliers.
The number of bins is only proportional to cube root of a.size
. It tends to overestimate the number of bins and it does not take into account data variability.
The number of bins is the base 2 log of a.size
. This estimator assumes normality of data and is too conservative for larger, non-normal datasets. This is the default method in R’s hist
method.
An improved version of Sturges’ formula that produces better estimates for non-normal datasets. This estimator attempts to account for the skew of the data.
The simplest and fastest estimator. Only takes into account the data size.
- https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html?highlight=hist#matplotlib.pyplot.hist
- https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges
欢迎随缘关注@pythonic生物人