如何在R中画出高效美观的相关性分析图

转载自:http://site.douban.com/182577/widget/notes/12866356/note/267793990/

如何在R中画出高效美观的相关性分析图_第1张图片

 

 

 

 

 

作为一个开源软件,R常常通过在人们厌烦了使用某个特定函数的时候提供一些代替函数来表现其友好性。例如,如果你再也不想在画相关性分析图时输入p-l-o-t这几个字母了,欢迎你输入s-p-l-o-m,当然在此之前你需要下载好lattice包并将其载入库内。
As open source software, R behaves quite friendly that it commonly gives people alternatives when they are tired of using one specific function :) For example, if you are unwilling to type p-l-o-t any more when you need to draw pair plots, you are welcomed to enter s-p-l-o-m with the lattice package downloaded from the internet and loaded to the library ahead.

> # Plot pair plots of returns using the splom function from the lattice package
> library(lattice)
> splom(Capm/100,pch=19,col=rgb(0,0,100,50,maxColorValue=255))

 
 

 


老实说,我个人不认为使用splom()函数是比plot()函数更好的选择。p-l-o-t是R中一个日常使用率极高的函数,即使我在夜里做梦时也能大声并清晰地说出这个函数的名字,但我却十有八九没法儿拼出所谓的s-p-o-x-x还是s-p-l-x-x函数的名字,更不用说splom()毫无进步,除非让画出来的图变得更加难以辨认和乱七八糟也算作进步。但是不要灰心气馁哦,亲!更好的替代函数的确是存在的。例如,你只需使用另一个基本绘画函数paris()。这个函数的使用方法和plot()函数几乎一样。
To be honest, I personally do not consider using splom() function rather than plot() function as a better choice. The p-l-o-t is a daily-used function in R of which the name can be spoken aloud even I am dreaming at night while I am unable to spell the name of the so-called s-p-o-x-x or s-p-l-x-x function nine times out of ten, let alone the splom() function makes no improvement unless making a plot unreadable and messy counts. However, keep your chin up, my friend. Better rather than worse alternatives do exist. For example, you can simply use another basic plotting function, pairs(). This function can be applied almost in the same way as the plot() function.

pairs(Capm/100,labels=c("Food","Durables","Construction","Market","Risk-free"),
      pch=19,col=rgb(0,0,100,50,maxColorValue=255))

 

 

 
 

 


注意到在这个例子中,我使用了一个新的变量labels。这个变量可在不改变原始数据列名的情况下对相关性图进行重新标注。
Note that in this example, I add a new argument labels which relabels the pairs in intuitive names while does not rewrite the column names of raw data.

此外,我们还可以投入pcaPP包中的plotcov()函数和ellipse包中的plotcorr()函数的怀抱。
Also, we can turn to the plotcov() function in the pcaPP package and the plotcorr() function in the ellipse package.

plotcov()函数高效简洁省事地同时给出了数据组内各组数据间的相关性数值和分布形状。
The function plotcov() provides both the values and the shape of correlations of the dataset at the same time which is informative, concise and space-saving. 

> # Plot pair plots of returns using the plotcov function from the pacPP package
> library(pcaPP)
> plotcov(cor(Capm/100),method1="correlation")

 

 

 
 

 


偶尔人们会过度追求极简主义,所以我们得出了如下由plotcorr()函数画出的图。
Sometimes people try so hard to make things concise, so they get an output as the function plotcorr() does.

> # Plot pair plots of returns using the plotcorr function from the ellipse package
> library(ellipse)
> plotcorr(cor(Capm/100),num=T,diag=F,type="upper")

 

 

 
 

 


作为一个无印良品风格控,在我不需要考虑-0.02691534和0,-0.09965299、-0.07210179、-0.07680539和-1,0.66885253、0.72093253和7 即 7 %,0.77307668、0.78501333和8即0.8之间的“巨大”差异时,我会毫不犹豫地表达我对plotcorr风格的喜爱。多数情况下,我们只需要用相关性分析图帮助我们判断线性相关性的存在趋势,因此0.77307668和0.78501333并没有鸿沟般的差别。同时,plotcorr()函数使人们通过设置type参数为“upper”、“lower” 和 “full”可以得到上三角、下三角或者完整的相关性矩阵。
I am a Muji-style addict which means I will undoubtedly love the plotcorr-style if I do not have to care about the “big” difference between -0.02691534 and 0, -0.09965299, -0.07210179, -0.07680539 and -1, 0.66885253, 0.72093253 and 7 meaning 7 percent, 0.77307668, 0.78501333 and 8 meaning 0.8. In most time, we use pair plots simply to identify trends of linear relationships, thus 0.77307668 does not significantly differ from 0.78501333. Also, the plotcorr() allows people to have a upper-triangle, lower-triangle or full correlation matrix on their plots by equaling the type argument to “upper”, “lower” and “full”.

 

 

 

 

 

 

又有些时候,geek们(我不在此列)认为他们需要做些高端的事情在智商平均水平的人类们(我在此列)前面显示其优越性,因此他们的plotcorr()函数还可以通过计算如下检验统计量给出样本相关系数的重要性。
Sometimes, geeks (not me) think they have to do something advanced to distinguish them from people with average IQ (like me), so their plotcorr() function is capable of informing the significance of sample correlation coefficient by using the test statistic shown below.
 
 
 


即如果相关性的值在其置信区间内,它们对应的椭圆会被涂成蓝色,否则在大于置信区间的上边界时会被涂成红色,小于下边界时被涂成黄色。因为Capm例子中的样本相关性相当靠谱,所以我们得到的全是蓝色椭圆。
That is, if the values of correlation are inside their confidence intervals, their corresponding ellipses will be filled in blue, otherwise will be red if values are larger than the upper bound of the confidence intervals and yellow if smaller than the lower bound. Since the sample correlations are quite reliable in the Capm example, we get all blue ellipses.

> # Plot pair plots of returns with test statistic
> sig.r <- function(p,n)
+ {
+ df <- n-2
+ t.stat <- qt(p,df)
+ sig.r <- t.stat/sqrt(t.stat^2+df)
+ return(sig.r)
+ }
> r.threshold <- sig.r(0.975,4) 
> col <- ifelse(cor(Capm/100)>r.threshold,"red",ifelse(cor(Capm/100)< -r.threshold,"yellow","blue"))
> plotcorr(cor(Capm/100),col=col,diag=F,cex.lab=0.75,type="upper",numbers=F)
 
 


最后,因为plotcov()函数的设计初衷是在一幅图中完成两个估计的协方差矩阵的直接比较,我们接下来用它来比较Capm的样本相关性和稳健相关性。
Finally, since the plotcov() function is initially designed to allow a direct comparison of two estimations of the covariance matrix in a plot , we use it to compare the sample correlations and robust correlations of the Capm dataset as follows.

> # Compare sample correlation matrix with robust correlation matrix
> library(robust)
> cor.sample <- cor(Capm/100)
> cor.robust <- covRob(Capm/100,cor=T)
> plotcov(cov1=cor.sample,cov2=cor.robust,method1="sample",method2="robust")
 
 


多数情况下,Capm的样本相关性接近于其稳健相关性。上三角区域里几乎重叠的椭圆也证明了我们可以对我们的相关性数值抱有信心。
At most time, the values of sample correlation are close to that of robust correlation in the Capm example. The almost-overlapped ellipses in the upper triangle also prove that we can be confident with our values of correlations.

最后,船长大人向我推荐了corrgram包的corrgram()函数。作为一个相关性图专业户,corrgram()函数可通过设置面板参数以多种形式给出数据组间的关系。
Last but not the least, the corrgram() function in the corrgram package is introduced to me by Captain. Expertized in correlation plotting, the function demonstrates the relationship between data in various forms by setting the types of panels.

corrgram(Capm/100,labels=c("Food","Durables","Construction","Market","Risk-free"),
         lower.panel=panel.shade,upper.panel=panel.pie,text.panel=panel.txt)
 
 


在上图中,面板下半部分斜线的方向将相关性分成正相关和负相关两类。另外,蓝色代表正相关,粉色代表负相关。颜色越深,涂色面积越大,意味着相关性越强。
In the above plot, the directions of slashes in the lower panel divide relationships into two categories, positive and negative. Also, blue denotes positive relationships while pink denotes negative relationships. The darker the colors and the bigger the painted areas are, the stronger the relationships between data.

corrgram(Capm/100,labels=c("Food","Durables","Construction","Market","Risk-free"),
         lower.panel=panel.pts, upper.panel=panel.conf, diag.panel=panel.density)
 
 


corrgram(Capm/100,labels=c("Food","Durables","Construction","Market","Risk-free"),
         panel=panel.ellipse, text.panel=panel.txt, diag.panel=panel.minmax)
 
 




PS:10号一鼓作气写完的初稿 因为final拖到今天干掉最后一科才闲下来小加小改后传上来= = 在豆瓣混了多年 还是头一遭提笔写这种类型的文章 数理金融的魅力果然大呀 春假结束前争取再写一篇关于hypothesis testing的小文 还请大家多多指教!
 

 

 

 

你可能感兴趣的:(R语言系列)