Reference information
Book Name: ggplot2 - Elegant Graphics for Data Analysis
Author: Hadley Wickham
Publisher: Springer
ISBN: 978-0-387-98140-6
e-ISBN: 978-0-387-98141-3
Intro
- Create new graphics that are precisely tailored for your problem
Resources:
- The same as but more visible than R help documentation
http://had.co.nz/ggplot2 - If you useggplot2 regularly, it’s a good idea to sign up for the ggplot2 mailing list
http://groups.google.com/group/ggplot2 - the book website, provides updates to this book.
http://had.co.nz/ggplot2/book
Grammar of graphics
A statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars).
The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system.
Faceting can be used to generate the same plot for different subsets of the dataset.
- data
- aes: describing how variables in the data are mapped to aesthetic attributes
- geom: what actually see, points, lines, polygons, etc
- stat: statistical transformations
- scale: map values in the data space to values in an aesthetic space, colour, size, shape
- coord: data coordinates mapped to the plane of the graphic, axes and gridlines, Cartesian, polar and map projection
- facet: break up the data into subsets, display those subsets
Relevant resources
- Which plot to produce: Chambers et al. (1983);Cleveland (1993a); Robbins (2004); Tukey (1977).
- Create an attractive plot: Tufte (1990, 1997, 2001, 2006).
- Dynamic and interactive graphics: Cook and Swayne (2007), rggobi package.
qplot
short for quick plot
Basic use
The first two arguments to qplot() are x and y.
qplot(carat,price, data=diamonds)
qplot(log(carat), log(price), data=diamonds)
qplot(carat, x*y*z, data=diamonds)
Colour, size, shape and other aesthetic attributes
- With plot it's your responsibility to convert a categorical variable in your data into something that plot knows how to use.
- qplot can do this for you automatically, and it will automatically provide a legend that maps the displayed attributes to the data values.
Augment the plot of carat and price with information about diamond colour and cut.
qplot(carat, price, data=dsmall, colour=color)
qplot(carat, price, data=dsmall, shape=cut)
You can also manually set the aesthetics using I().
For large datasets, semitransparent points are often useful to alleviate some of the overplotting.
It's often useful to specify the transparency as a fraction, e.g., 1/10 or 2/10, as the denominator specifies the number of points that must overplot to get a completely opaque colour.
qplot(carat, price, data=diamonds, alpha=I(1/10)
Plot geoms
- geom='point'
default - geom='smooth'
fits a smoother to the data and displays the smooth and its standard error - geom='boxplot'
- geom='path' and geom = 'line'
A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction. - 1d distributions, continuous variables
geom='histogram' draws a histogram(default), geom='freqpoly' a frequency polygon, and geom='density' created a density plot - 1d distribution, discrete variables
**geom='bar' makes a bar chart
Adding a smoother to a plot
qplot(carat, price, data=diamonds, geom=c('point','smooth')
If you want to turn the confidence interval off, use se = FALSE .
There are many different smoothers you can choose between by using the method argument.
- method='loess'
default for small n, uses a smooth local regression.
The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1(not so wiggly).
qplot(carat, price, data=dsmall, geom=c('point','smooth'), span=0.2)
Loess does not work well for large datasets.
- method='gam'
load library mgcv
formula=y~s(x) to fit a generalised additive model.
Similar to using a spline with lm, but the degree of smoothness is estimated from the data.
For large data, use the formula y~s(x,bs='cs') .(default when more than 1000 points.)
library(mgcv)
qplot(carat, price, data = dsmall, geom=c('point', 'smooth'), method='gam', formula=y~s(x))
qplot(carat, price, data = diamonds, geom=c('point','smooth'),
method='gam', formula=y~s(x,bs='cs'))
- method='lm'
- default: a straight line.
- formula=y~poly(x,2) : specify a degree 2 polynomial
- formula=y~ns(x,2) : load the splines packages and use a natural spline. (the second parameter is the degrees of freedom, a higher number will create a wigglier curve.)
library(splines)
qplot(carat, price, data=dsmall, geom=c('point','smooth'),method='lm')
qplot(carat, price, data=dsmall, geom=c('point','smooth'),method='lm',formula=y~ns(x,5)
Boxplots and jittered points
How the values of the continuous variables vary with the levels of the categorical variable.
- geom='jitter'
- geom='boxplot'
Boxplots summarise the bulk of the distribution with only several of the numbers, while jittered plots show every point but can suffer from overplotting.
The boxplots can give information of the median and adjacent quartiles.
The overplotting seen in the plot of jittered values can be alleviated somewhat by using semi-transparent points using the alpha argument.
qplot(color, price/carat, data=diamonds, geom='jitter', alpha=I(1/50)
**aesthetics: ** size, colour, shape, fill(boxplot)
Histogram and density plots
qplot(carat, data = diamonds, geom='histogram')
qplot(carat, data= diamonds, geom='density')
For the density plot, the adjust argument controls the degree of smoothness (high values of adjust produce smoother plots).
For the histogram, the binwidth argument controls the amount of smoothing by setting the bin size. (Break points can also be specified explicitly, using the breaks argument.)
- Gross features of the data show up well at a large bin width, while finer features require a very narrow width.
To compare the distributions of different subgroups, just add an aesthetic mapping, as in the following code.
qplot(carat, data=diamonds, geom='density', colour = color)
qplot(carat, data=diamonds, geom='histogram', fill = color)
The density plot is more appealing at first because it seems easy to read and compare the various curves. However, it is more difficult to understand exactly what a density plot is showing.
In addition, the density plot makes some assumptions that may not be true for our data, i.e. that it is unbounded, continuous and smooth.
Bar charts
The discrete analogue of histogram is the bar chart.
geom='bar'
The bar geom counts the number of instances of each class so that you don't need to tabulate your values beforehand.
If you'd like to tabulate class members in some other way, such as by summing up a continuous variable, you can use the weight geom.
qplot(color, data=diamonds, geom='bar',weight=carat)+scale_y_continuous('carat'))
Time series with line and path plots
Line and path plots are typically used for time series data.
- Line join the points from left to right
- Path join them in the order that they appear in the dataset.
qplot(data, unemploy/pop, data = economics, geom='line')
We could draw a scatterplot of unemployment rate vs. length of unemployment, but then we could no longer see the evolution over time. The solution is to join points adjacent in time with line segments, forming a path plot.
Apply the colour aesthetic to the line to make it easier to see the direction of time.
qplot(unemploy/pop, uempmed, data = economics, geom='path', colour = year(date)) + scale_area()
Faceting
We have already discussed using aesthetics (colour and shape) to compare subgroups, drawing all groups on the same plot. Faceting takes an alternative approach.
qplot(carat, data=diamonds, facets=color~., geom='histogram',binwidth=0.1, xlim=c(0,3))
qplot(carat, ..density.., data=diamonds, facets=color~., geom='histogram', binwidth=0.1, xlim=c(0,3))
Other options
xlim , ylim
log : e.g. log='x' will log the x-axis, log='xy' will log both.
main : main title of the plot, can be a string or an expression
xlab, ylab