quantile(x, ...)
给定一个系列 x x x,可以求出给定累积概率 p p p对应的分位数。
计算分位数有9种方法 1 ^1 1:
假设方法 i i i( 1 ≤ i ≤ 9 1 \le i \le 9 1≤i≤9),对应概率p的计算公式是:
Q ( p ) = ( 1 − γ ) x j + γ x j + 1 , Q(p) = (1 - \gamma)\ x_j + \gamma\ x_{j+1}, Q(p)=(1−γ) xj+γ xj+1,
j − m n ≤ p < ( j + 1 ) − m n j = f l o o r ( n p + m ) g = ( n p + m ) − j γ = f ( j , g ) \frac{j-m}{n} ≤ p < \frac{(j+1)-m}{n} \\ j=floor(np+m) \\ g = (np + m) - j \\ \gamma = f(j, g) nj−m≤p<n(j+1)−mj=floor(np+m)g=(np+m)−jγ=f(j,g)
x j x_j xj是第 j j j个顺序统计量;
n n n是 x x x的长度(样本量);
m m m是个常数,不同的方法 i i i取不同的值;
j j j的值由公式: j = f l o o r ( n p + m ) j=floor(np+m) j=floor(np+m)确定;
还有个 j j j的gap值 g g g: g = ( n p + m ) − j g = (np + m) - j g=(np+m)−j
γ \gamma γ值由 j j j和 g g g值共同确定(如下表):
type | m m m value | γ \gamma γ value | desc |
---|---|---|---|
1 | 0 0 0 | γ = { 0 ; if g=0; 1 ; if others \gamma = \begin{cases} & 0; \text{ if g=0;} \\ &1; \text { if others } \end{cases} γ={0; if g=0;1; if others | Inverse of empirical distribution function |
2 | 0 0 0 | γ = { 0.5 ; if g=0; 1 ; if others \gamma = \begin{cases} & 0.5; \text{ if g=0;} \\ &1; \text {\ \ if others } \end{cases} γ={0.5; if g=0;1; if others | Similar to type 1 but with averaging at discontinuities |
3 | − 1 / 2 -1/2 −1/2 | γ = { 0 ; if g=0 and j is even 1 ; if others \gamma = \begin{cases} & 0; \text{ if g=0 and j is even} \\ &1; \text {\ \ if others } \end{cases} γ={0; if g=0 and j is even1; if others | SAS definition: nearest even order statistic. |
4 | 0 0 0 | p [ k ] = k n p[k] =\frac{ k }n p[k]=nk | linear interpolation of the empirical cdf. |
5 | 1 / 2 1/2 1/2 | p [ k ] = k − 0.5 n p[k] = \frac{k - 0.5} n p[k]=nk−0.5 | That is a piecewise linear function where the knots are the values midway through the steps of the empirical cdf. This is popular amongst hydrologists. |
6 | p p p | p [ k ] = k n + 1 p[k] = \frac{k }{n + 1} p[k]=n+1k | Thus p [ k ] = E [ F ( x [ k ] ) ] p[k] = E[F(x[k])] p[k]=E[F(x[k])]. This is used by Minitab and by SPSS. |
7 | 1 − p 1-p 1−p | p [ k ] = k − 1 n − 1 p[k] =\frac {k - 1} {n - 1} p[k]=n−1k−1 | In this case, p [ k ] = m o d e [ F ( x [ k ] ) ] p[k] = mode[F(x[k])] p[k]=mode[F(x[k])]. This is used by S. |
8 | p + 1 3 \frac{p+1}3 3p+1 | p [ k ] = k − 1 / 3 n + 1 / 3 p[k] = \frac{k - 1/3} {n + 1/3} p[k]=n+1/3k−1/3 | Then p[k] =~ median[F(x[k])]. The resulting quantile estimates are approximately median-unbiased regardless of the distribution of x x x. |
9 | p / 4 + 3 / 8 p/4 + 3/8 p/4+3/8 | p [ k ] = k − 3 / 8 n + 1 / 4 p[k] = \frac{k - 3/8} {n + 1/4} p[k]=n+1/4k−3/8 | The resulting quantile estimates are approximately unbiased for the expected order statistics if x x x is normally distributed. |
ecdf(x, ...)
F n ( t ) = # { x i ≤ t } n = ∑ i = 1 n I n d i c a t o r ( x i ≤ t ) n . F_n(t) = \frac{\#\{x_i \leq t\}}{n} = \frac{\sum_{i=1}^{n} Indicator(x_i \leq t)}{n}. Fn(t)=n#{xi≤t}=n∑i=1nIndicator(xi≤t).
其中,Indicator是指示函数,
Indicator(TRUE) = 1; Indicator(FALSE) = 0
;
我们可以看见,对于顺序统计量 x i x_i xi,每往后增加一个元素,累积概率增加 1 / n 1/n 1/n。
再[0, 1]上,“均匀”地产生 n n n个概率点
> ppoints
function (n, a = if (n <= 10) 3/8 else 1/2)
{
if (length(n) > 1L)
n <- length(n)
if (n > 0)
(1L:n - a)/(n + 1 - 2 * a)
else numeric()
}
从ppoints的在线帮助可以知道 2 ^2 2:
ppoints用在qqnorm中产生标准正态分布对应的分位点。
x <- qnorm(ppoints(n))[order(order(y))]
function (y, ylim, main = "Normal Q-Q Plot", xlab = "Theoretical Quantiles",
ylab = "Sample Quantiles", plot.it = TRUE, datax = FALSE, ...)
{
if (has.na <- any(ina <- is.na(y))) {
yN <- y
y <- y[!ina]
}
if (0 == (n <- length(y)))
stop("y is empty or has only NAs")
if (plot.it && missing(ylim))
ylim <- range(y)
x <- qnorm(ppoints(n))[order(order(y))]
if (has.na) {
y <- x
x <- yN
x[!ina] <- y
y <- yN
}
if (plot.it)
if (datax)
plot(y, x, main = main, xlab = ylab, ylab = xlab,
xlim = ylim, ...)
else plot(x, y, main = main, xlab = xlab, ylab = ylab,
ylim = ylim, ...)
invisible(if (datax) list(x = y, y = x) else list(x = x, y = y))
}
有函数的实现可以知道,y是待比较的样本分位点,x是通过ppoints产生的标准正态分布的理论分位点,调用plot画出散点图。
> qqplot
function (x, y, plot.it = TRUE, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)), ...)
{
sx <- sort(x)
sy <- sort(y)
lenx <- length(sx)
leny <- length(sy)
if (leny < lenx)
sx <- approx(1L:lenx, sx, n = leny)$y
if (leny > lenx)
sy <- approx(1L:leny, sy, n = lenx)$y
if (plot.it)
plot(sx, sy, xlab = xlab, ylab = ylab, ...)
invisible(list(x = sx, y = sy))
}
从函数的实现可以发现,qqplot可以画出QQ图,也能画出PP图,取决于传入的x和y是分位点还是累积概率点
qqplot不局限于是不是满足标准正态分布,可以画出样本分位点和任意分布的理论分位点的QQ图。也可以比较两个系列y,x是否满足同一分布(任意分布)。
> qqline
function (y, datax = FALSE, distribution = qnorm, probs = c(0.25,
0.75), qtype = 7, ...)
{
stopifnot(length(probs) == 2, is.function(distribution))
y <- quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE)
x <- distribution(probs)
if (datax) {
slope <- diff(x)/diff(y)
int <- x[1L] - slope * y[1L]
}
else {
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
}
abline(int, slope, ...)
}
有函数的实现可以知道,qqline默认从y选取Q1和Q3两个分位点,x根据理论分布产生对应的四分位点,过 ( x 1 , y 1 ) , ( x 2 , y 2 ) (x_1, y_1), (x_2, y_2) (x1,y1),(x2,y2)画一条直线。
Ref: