分位数与QQ图

QQ图基本知识

Sample Quantiles 样本分位数

quantile(x, ...)
给定一个系列 x x x,可以求出给定累积概率 p p p对应的分位数。
计算分位数有9种方法 1 ^1 1

假设方法 i i i 1 ≤ i ≤ 9 1 \le i \le 9 1i9),对应概率p的计算公式是:
Q ( p ) = ( 1 − γ )   x j + γ   x j + 1 , Q(p) = (1 - \gamma)\ x_j + \gamma\ x_{j+1}, Q(p)=(1γ) xj+γ xj+1,

j − m n ≤ p < ( j + 1 ) − m n j = f l o o r ( n p + m ) g = ( n p + m ) − j γ = f ( j , g ) \frac{j-m}{n} ≤ p < \frac{(j+1)-m}{n} \\ j=floor(np+m) \\ g = (np + m) - j \\ \gamma = f(j, g) njmp<n(j+1)mj=floor(np+m)g=(np+m)jγ=f(j,g)
x j x_j xj是第 j j j个顺序统计量;
n n n x x x的长度(样本量);
m m m是个常数,不同的方法 i i i取不同的值;
j j j的值由公式: j = f l o o r ( n p + m ) j=floor(np+m) j=floor(np+m)确定;
还有个 j j j的gap值 g g g g = ( n p + m ) − j g = (np + m) - j g=(np+m)j
γ \gamma γ值由 j j j g g g值共同确定(如下表):

type m m m value γ \gamma γ value desc
1 0 0 0 γ = { 0 ;  if g=0; 1 ;  if others  \gamma = \begin{cases} & 0; \text{ if g=0;} \\ &1; \text { if others } \end{cases} γ={0; if g=0;1; if others  Inverse of empirical distribution function
2 0 0 0 γ = { 0.5 ;  if g=0; 1 ;     if others  \gamma = \begin{cases} & 0.5; \text{ if g=0;} \\ &1; \text {\ \ if others } \end{cases} γ={0.5; if g=0;1;    if others  Similar to type 1 but with averaging at discontinuities
3 − 1 / 2 -1/2 1/2 γ = { 0 ;  if g=0 and j is even 1 ;     if others  \gamma = \begin{cases} & 0; \text{ if g=0 and j is even} \\ &1; \text {\ \ if others } \end{cases} γ={0; if g=0 and j is even1;    if others  SAS definition: nearest even order statistic.
4 0 0 0 p [ k ] = k n p[k] =\frac{ k }n p[k]=nk linear interpolation of the empirical cdf.
5 1 / 2 1/2 1/2 p [ k ] = k − 0.5 n p[k] = \frac{k - 0.5} n p[k]=nk0.5 That is a piecewise linear function where the knots are the values midway through the steps of the empirical cdf. This is popular amongst hydrologists.
6 p p p p [ k ] = k n + 1 p[k] = \frac{k }{n + 1} p[k]=n+1k Thus p [ k ] = E [ F ( x [ k ] ) ] p[k] = E[F(x[k])] p[k]=E[F(x[k])]. This is used by Minitab and by SPSS.
7 1 − p 1-p 1p p [ k ] = k − 1 n − 1 p[k] =\frac {k - 1} {n - 1} p[k]=n1k1 In this case, p [ k ] = m o d e [ F ( x [ k ] ) ] p[k] = mode[F(x[k])] p[k]=mode[F(x[k])]. This is used by S.
8 p + 1 3 \frac{p+1}3 3p+1 p [ k ] = k − 1 / 3 n + 1 / 3 p[k] = \frac{k - 1/3} {n + 1/3} p[k]=n+1/3k1/3 Then p[k] =~ median[F(x[k])]. The resulting quantile estimates are approximately median-unbiased regardless of the distribution of x x x.
9 p / 4 + 3 / 8 p/4 + 3/8 p/4+3/8 p [ k ] = k − 3 / 8 n + 1 / 4 p[k] = \frac{k - 3/8} {n + 1/4} p[k]=n+1/4k3/8 The resulting quantile estimates are approximately unbiased for the expected order statistics if x x x is normally distributed.

Empirical Cumulative Distribution Function 经验累积分布函数

ecdf(x, ...)

F n ( t ) = # { x i ≤ t } n = ∑ i = 1 n I n d i c a t o r ( x i ≤ t ) n . F_n(t) = \frac{\#\{x_i \leq t\}}{n} = \frac{\sum_{i=1}^{n} Indicator(x_i \leq t)}{n}. Fn(t)=n#{xit}=ni=1nIndicator(xit).
其中,Indicator是指示函数,
Indicator(TRUE) = 1; Indicator(FALSE) = 0

我们可以看见,对于顺序统计量 x i x_i xi,每往后增加一个元素,累积概率增加 1 / n 1/n 1/n

ppoints

再[0, 1]上,“均匀”地产生 n n n个概率点

> ppoints
function (n, a = if (n <= 10) 3/8 else 1/2) 
{
    if (length(n) > 1L) 
        n <- length(n)
    if (n > 0) 
        (1L:n - a)/(n + 1 - 2 * a)
    else numeric()
}

从ppoints的在线帮助可以知道 2 ^2 2

  1. ppoints产生的概率点在[0, 1]上是对称的, p i + p n − i + 1 = 1 ;   i = 1.. n p_i + p_{n-i + 1} = 1;\ i=1..n pi+pni+1=1; i=1..n
  2. 默认情况下, n ≤ 10 n \le 10 n10时, a = 3 / 8 a= 3/8 a=3/8, 此时, p i = i − 3 / 8 n + 1 / 4 p_i = \frac{i-3/8}{n+1/4} pi=n+1/4i3/8; n > 10 n > 10 n>10时, a = 1 / 2 a= 1/2 a=1/2, 此时, p i = i − 0.5 n p_i = \frac{i-0.5}{n} pi=ni0.5; 此时,ppoints一般用于产生标准正态分布对应的累积概率。

ppoints用在qqnorm中产生标准正态分布对应的分位点。x <- qnorm(ppoints(n))[order(order(y))]

  1. 不同a的值,对应quantile()函数的type。

QQ图

qqnorm

function (y, ylim, main = "Normal Q-Q Plot", xlab = "Theoretical Quantiles", 
    ylab = "Sample Quantiles", plot.it = TRUE, datax = FALSE, ...) 
{
    if (has.na <- any(ina <- is.na(y))) {
        yN <- y
        y <- y[!ina]
    }
    if (0 == (n <- length(y))) 
        stop("y is empty or has only NAs")
    if (plot.it && missing(ylim)) 
        ylim <- range(y)
    x <- qnorm(ppoints(n))[order(order(y))]
    if (has.na) {
        y <- x
        x <- yN
        x[!ina] <- y
        y <- yN
    }
    if (plot.it) 
        if (datax) 
            plot(y, x, main = main, xlab = ylab, ylab = xlab, 
                xlim = ylim, ...)
        else plot(x, y, main = main, xlab = xlab, ylab = ylab, 
            ylim = ylim, ...)
    invisible(if (datax) list(x = y, y = x) else list(x = x, y = y))
}

有函数的实现可以知道,y是待比较的样本分位点,x是通过ppoints产生的标准正态分布的理论分位点,调用plot画出散点图。

qqplot

> qqplot
function (x, y, plot.it = TRUE, xlab = deparse(substitute(x)), 
    ylab = deparse(substitute(y)), ...) 
{
    sx <- sort(x)
    sy <- sort(y)
    lenx <- length(sx)
    leny <- length(sy)
    if (leny < lenx) 
        sx <- approx(1L:lenx, sx, n = leny)$y
    if (leny > lenx) 
        sy <- approx(1L:leny, sy, n = lenx)$y
    if (plot.it) 
        plot(sx, sy, xlab = xlab, ylab = ylab, ...)
    invisible(list(x = sx, y = sy))
}

从函数的实现可以发现,qqplot可以画出QQ图,也能画出PP图,取决于传入的x和y是分位点还是累积概率点
qqplot不局限于是不是满足标准正态分布,可以画出样本分位点和任意分布的理论分位点的QQ图。也可以比较两个系列y,x是否满足同一分布(任意分布)。

qqline

> qqline
function (y, datax = FALSE, distribution = qnorm, probs = c(0.25, 
    0.75), qtype = 7, ...) 
{
    stopifnot(length(probs) == 2, is.function(distribution))
    y <- quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE)
    x <- distribution(probs)
    if (datax) {
        slope <- diff(x)/diff(y)
        int <- x[1L] - slope * y[1L]
    }
    else {
        slope <- diff(y)/diff(x)
        int <- y[1L] - slope * x[1L]
    }
    abline(int, slope, ...)
}

有函数的实现可以知道,qqline默认从y选取Q1和Q3两个分位点,x根据理论分布产生对应的四分位点,过 ( x 1 , y 1 ) , ( x 2 , y 2 ) (x_1, y_1), (x_2, y_2) (x1,y1),(x2,y2)画一条直线。

Ref:

  1. R语言:help(quantile)
  2. R语言:help(ppoints)

你可能感兴趣的:(Stat,分位数,QQ图)