2021-02-18 Data Visualization with ggplot2 ch1-4

preparation: use library() to load the package and use str() to explore the structure of the data

# Load the ggplot2 package
library(ggplot2)

# Explore the mtcars data frame with str()
str(mtcars)

# Execute the following command
  ggplot(mtcars, aes(cyl, mpg)) +
  geom_point()

cyl (the number of cylinders) is categorical, you probably noticed that it is classified as numeric in mtcars. This is really misleading because the representation in the plot doesn't match the actual data type. You'll have to explicitly tell ggplot2 that cyl is a categorical variable.

diamond dataset:
The diamonds dataset contains details of 1,000 diamonds. Among the variables included are carat (a measurement of the diamond's size) and price.

You'll use two common geom layer functions:

  • geom_point() adds points (as in a scatter plot).
  • geom_smooth() adds a smooth trend curve.

As you saw previously, these are added using the +operator.

ggplot(data, aes(x, y)) +
  geom_*()

Where * is the specific geometry needed.
use geom_smooth() to draw a line connecting those points.

 Add geom_smooth() with +
ggplot(diamonds, aes(carat, price)) +
  geom_point()+
  geom_smooth()

geom_point() has an alpha argument that controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible). Values in between specify transparency.

图片.png

Plots can be saved as variables:

# From previous step
plt_price_vs_carat <- ggplot(diamonds, aes(carat, price))

# Edit this to map color to clarity,
# Assign the updated plot to a new object
plt_price_vs_carat_by_clarity <- plt_price_vs_carat + geom_point(aes(color=clarity))

# See the plot
plt_price_vs_carat_by_clarity

change the shape and size of the points:

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Set the shape and size of the points
  geom_point(shape=1,size=4)

shape=1 means hollow(空心点)

Typically, the color aesthetic changes the outline of a geom and the fill aesthetic changes the inside. geom_point() is an exception: you use color (not fill) for the point color. However, some shapes have special behavior.

The default geom_point() uses shape = 19: a solid circle. An alternative is shape = 21: a circle that allow you to use both fill for the inside and color for the outline. This is lets you to map two aesthetics to each point.

All shape values are described on the points() help page.

fcyl and fam are the cyl and am columns converted to factors, respectively.

# Map color to fam
ggplot(mtcars, aes(wt, mpg, fill = fcyl,color=fam)) +
  geom_point(shape = 21, size = 4, alpha = 0.6)

the default shape for points only has a color attribute and not a fill attribute! Use fill when you have another shape (such as a bar), or when using a point that does have a fill and a color attribute, such as shape = 21, which is a circle with an outline. Any time you use a solid color, make sure to use alpha blending to account for over plotting.

save the plot as variable, and then use the geom_point(aes(x=,y=))

use geom_text(): must include aes(label=*)

# Base layer
plt_mpg_vs_wt <- ggplot(mtcars, aes(wt, mpg))

# Use text layer and map fcyl to label
plt_mpg_vs_wt +
  geom_text(aes(label = fcyl))

label and shape are only applicable to categrical varible.

in geom()point, when the color and size etc should not be wrappped in aes()

# A hexadecimal color
my_blue <- "#4ABEFF"

# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
  # Set point size to 10; shape to 1
  geom_point(color = my_blue, size = 10, shape = 1)

geom_text:add the text description after the point
label: add the label to the text (文本框形式)
in this exercise, we do not need to use geom_text(aes())

ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
geom_text(label=rownames(mtcars),color="red")
  • labs() to set the x- and y-axis labels. It takes strings for each argument.
  • scale_color_manual() defines properties of the color scale (i.e. axis). The first argument sets the legend title. values is a named vector of colors to use.

use position to acomplish the plot:

palette <- c(automatic = "#377EB8", manual = "#E41A1C")

# Set the position
ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar(position = 'dodge') +
  labs(x = "Number of Cylinders", y = "Count")
  scale_fill_manual("Transmission", values = palette)

geom_bar(position = 'dodge')

Adjustment for overlapping
identity: do not change anything
dodge: avoid the overlapping
stack: stack all the elements
fill:set all width of elements to 1
jitter: add some disturb to avoid the overlapping

univerable settings:

ggplot(mtcars, aes(mpg, 0)) +
  geom_jitter() +
# Set the y-axis limits
  ylim(-2,2)

use ase(x,0) to set y=0 and set limits to the y-axis

Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque, hollow shapes.
Small points are suitable for large datasets with regions of high density (lots of overlapping).
Let's use the diamonds dataset to practice dealing with the large dataset case.
shape should be set in the geom_point(), rather than in ggplot()

# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))

# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()

# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam + 
geom_point(position = position_jitterdodge(jitter.width=0.3,dodge.width=0.3))

geom_point(position = position_jitterdodge(jitter.width=0.3,dodge.width=0.3))

alternative ways:
1.geom_point(alpha=0.5,position="jitter")
2.geom_point(alpha = 0.5,position=position_jitter(width=0.1))

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Use a jitter position function with width 0.1
  geom_point(alpha = 0.5,position=position_jitter(width=0.1))

Notice that jitter can be a geom itself (i.e. geom_jitter()), an argument in geom_point() (i.e. position = "jitter"), or a position function, (i.e. position_jitter()).

replace the "geom_point()" with geom_jitter:


图片.png
Integer data

This can be type integer (i.e. 1 ,2, 3…) or categorical (i.e. class factor) variables. factor is just a special class of type integer.

You'll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don't realize that integer and factor data are the same as low precision data.

The Vocab dataset provided contains the years of education and vocabulary test scores from respondents to US General Social Surveys from 1972-2004.

ggplot(Vocab, aes(education, vocabulary)) +
  # Set the shape to 1
  geom_jitter(alpha = 0.2, shape=1)

Drawing histograms
by default, maps the internally calculated count variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density can be accessed by using the .. notation, i.e. ..density... Plotting this variable will show the relative frequency, which is the height times the width of each bin.

# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
  geom_histogram(binwidth = 1)

add color:

datacamp_light_blue <- "#51A8C9"
ggplot(mtcars, aes(mpg, ..density..)) +
  # Set the fill color to datacamp_light_blue
  geom_histogram(binwidth = 1, fill=datacamp_light_blue)

use the position_dodge() we can specify the width :

ggplot(mtcars, aes(cyl, fill = fam)) +
  # Set the transparency to 0.6
  geom_bar(position = position_dodge(width = 0.2),alpha=0.6)

generate plot with scale() to set palette

# Add a bar layer with position "fill"
  geom_bar(position = "fill") +
  # Add a brewer fill scale with default palette
  scale_fill_brewer()
Warning message: n too large, allowed maximum for palette Blues is 9
Returning the palette you asked for with that many colors

represents the proportion of the population that is unemployed.

Use line graph:

# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
  geom_line()

# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
  geom_line()

# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
  geom_line(aes(group = Species))

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y =Capture, color = Species)) +
  geom_line(aes(group = Species))

To change stylistic elements of a plot, call theme() and set plot properties to a new value. For example, the following changes the legend position.

p + theme(legend.position = new_value)

Here, the new value can be

  • "top", "bottom", "left", or "right'": place it at that side of the plot.
  • "none": don't draw it.
  • c(x, y): c(0, 0) means the bottom-left and c(1, 1) means the top-right.
# Position the legend inside the plot at (0.6, 0.1)
plt_prop_unemployed_over_time +
theme(legend.position=c(0.6,0.1))

Many plot elements have multiple properties that can be set. For example, line elements in the plot such as axes and gridlines have a color, a thickness (size), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line(). For example, to make the axis lines into red, dashed lines, you would use the following.

p + theme(axis.line = element_line(color = "red", linetype = "dashed"))

Similarly, element_rect() changes rectangles and element_text() changes text. You can remove a plot element using element_blank().

Give all rectangles in the plot, (the rect element) a fill color of "grey92" (very pale grey).
Remove the legend.key's outline by setting its color to be missing.

plt_prop_unemployed_over_time +
  theme(
    # For all rectangles, set the fill color to grey92
    rect = element_rect(fill = "grey92"),
    # For the legend key, turn off the outline
    legend.key = element_rect(color = NA)
  )

Remove the axis ticks, axis.ticks by making them a blank element.
Remove the panel gridlines, panel.grid in the same way

theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    # Turn off axis ticks
    axis.ticks = element_blank(),
    # Turn off the panel grid
    panel.grid = element_blank()
  )
plt_prop_unemployed_over_time +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(
      color = "white",
      size = 0.5,
      linetype = "dotted"
    ),
    # Set the axis text color to grey25
    axis.text=element_text(color="grey25"),
    # Set the plot title font face to italic and font size to 16
   plot.title=element_text(size=16,face="italic")
  )

Modifying whitespace(泛空格符)

Whitespace means all the non-visible margins and spacing in the plot.

To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure.

Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe.

The default unit is "pt" (points), which scales well with text. Other options include "cm", "in" (inches) and "lines" (of text).

plt_mpg_vs_wt_by_cyl is available. The panel and legend are wrapped in blue boxes so you can see how they change.

plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the axis tick length to 2 lines
    axis.ticks.length = unit(2, "lines")
  )
plt_mpg_vs_wt_by_cyl +
  theme(
# Set the legend margin to (20, 30, 40, 50) points
  legend.margin=margin(t=20,r=30,b=40,l=50,unit="pt")
  )
plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the plot margin to (10, 30, 50, 70) millimeters
    plot.margin=margin(t=10,r=30,b=50,l=70,unit="mm")
  )

theme settings

# Theme layer saved as an object, theme_recession
theme_recession <- theme(
  rect = element_rect(fill = "grey92"),
  legend.key = element_rect(color = NA),
  axis.ticks = element_blank(),
  panel.grid = element_blank(),
  panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
  axis.text = element_text(color = "grey25"),
  plot.title = element_text(face = "italic", size = 16),
  legend.position = c(0.6, 0.1)
)

# Combine the Tufte theme with theme_recession
theme_tufte_recession <- theme_tufte() + theme_recession

segmentation:

Add a geom_segment() layer

ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2)

# Add the recession theme to the plot
plt_prop_unemployed_over_time + theme_tufte_recession

To remove the legend, we should use legend.position

Segment plot


图片.png

label the plot appropriately using labs():

Make the title "Highest and lowest life expectancies, 2007".
Add a reference by setting caption to "Source: gapminder".
# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]

# Add a title and caption
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
  scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
  scale_color_gradientn(colors = palette) +
  labs(title="Highest and lowest life expectancies, 2007",caption="Source: gapminder")

Add a vertical line:

# Add a vertical line
plt_country_vs_lifeExp +
  step_1_themes +
  geom_vline(xintercept=global_mean, color="grey40", linetype=3)

Add an arrow to the plot:

# Add a curve
plt_country_vs_lifeExp +  
  step_1_themes +
  geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
  step_3_annotation +
  annotate(
    "curve",
    x = x_start, y = y_start,
    xend = x_end, yend = y_end,
    arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
    color = "grey40"
  )
图片.png

Use sum to deal with the integer data:

ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum() 

Modify the size aesthetic with the appropriate scale function. Add a scale_size() function to set the range from 1 to 10. (this operation is equal to :Inside stat_sum(), set size to ..prop.. so circle size represents the proportion of the whole dataset.)

# Amend the stat to use proportion sizes
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum(aes(size = ..prop..))
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum() +
  # Add a size scale, from 1 to 10
  scale_size(range=c(1,10))

If a few data points overlap, jittering is great. When you have lots of overlaps (particularly where continuous data has been rounded), using stat_sum() to count the overlaps is more useful.

Use abbreviation to the code: position=posn_jd

# Add jittering and dodging
p_wt_vs_fcyl_by_fam +
  geom_point(position=posn_jd)

Add error bars representing the standard deviation.
Set the data function to mean_sdl (without parentheses). Draw 1 standard deviation each side of the mean, pass arguments to the mean_sdl() function by assigning them to fun.args in the form of a list.
Use posn_d to set the position.

p_wt_vs_fcyl_by_fam_jit +
  # Add a summary stat of std deviation limits
  stat_summary(fun.data=mean_sdl,fun.args=list(mult=1),position=posn_d)

你可能感兴趣的:(2021-02-18 Data Visualization with ggplot2 ch1-4)