Select the following four columns from the counties
variable:
You don’t need to save the result to a variable.
Instruction:
counties
variable.# Select the columns
counties %>%
select(state, county, population, poverty)
Here you see the counties_selected
dataset with a few interesting variables selected. These variables: private_work
, public_work
, self_employed
describe whether people work for the government, for private companies, or for themselves.
In these exercises, you’ll sort these observations to find the most interesting cases.
Instruction:
public_work
variable in descending order.counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
# Add a verb to sort in descending order of public_work
counties_selected %>%
arrange(desc(public_work))
You use the filter()
verb to get only observations that match a particular condition, or match multiple conditions.
Instruction 1:
1000000
).counties_selected <- counties %>%
select(state, county, population)
# Filter for counties with a population above 1000000
counties_selected %>%
filter(population > 1000000)
Instruction 2:
1000000
).counties_selected <- counties %>%
select(state, county, population)
# Filter for counties in the state of California that have a population above 1000000
counties_selected %>%
filter(state == "California" & population > 1000000)
We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.
Instruction:
counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>%
filter(state == 'Texas', population > 10000)%>%
arrange(desc(private_work))
In the video, you used the unemployment
variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you’ll do the same with another percentage variable: public_work
.
The code provided already selects the state
, county
, population
, and public_work
columns.
Instruction 1:
mutate()
to add a column called public_workers
to the dataset, with the number of people employed in public (government) work.counties_selected <- counties %>%
select(state, county, population, public_work)
# Add a new column public_workers with the number of people employed in public work
counties_selected %>%
mutate(public_workers = population * public_work / 100)
Instruction 2:
counties_selected <- counties %>%
select(state, county, population, public_work)
# Sort in descending order of the public_workers column
counties_selected %>%
mutate(public_workers = public_work * population / 100) %>%
arrange(desc(public_workers))
The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population
variable, to compute the fraction of men (or women) within each county.
In this exercise, you’ll select the relevant columns yourself.
Instruction:
state
, county
, population
, men
, and women
.proportion_women
with the fraction of the county’s population made up of women.# Select the columns state, county, population, men, and women
counties_selected <- counties %>%
select(state, county, population, men,women)
# Calculate proportion_women as the fraction of the population made up of women
counties_selected %>%
mutate(proportion_women = women / population)
In this exercise, you’ll put together everything you’ve learned in this chapter (select()
, mutate()
, filter()
and arrange()
), to find the counties with the highest proportion of men.
Instruction:
state
, county
, population
, men
, and women
.proportion_men
with the fraction of the county’s population made up of men.counties %>%
# Select the five columns
select(state, county, population, men, women)%>%
# Add the proportion_men variable
mutate(proportion_men = men / population)%>%
# Filter for population of at least 10,000
filter(population >= 10000)%>%
# Arrange proportion of men in descending order
arrange(desc(proportion_men))
The counties
dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected
table. In this exercise, you’ll focus on the region
column.
counties_selected <- counties %>%
select(region, state, population, citizens)
Instruction:
count()
to find the number of counties in each region, using a second argument to sort in descending order.# Use count to find the number of counties in each region
counties_selected %>%
count(region, sort = TRUE)
You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.
counties_selected <- counties %>%
select(region, state, population, citizens)
Instruction:
citizens
column, and sorted in descending order.# Find number of counties per state, weighted by citizens
counties_selected %>%
count(state, wt = citizens, sort = TRUE)
You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”
You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.
counties_selected <- counties %>%
select(region, state, population, walk)
Instruction:
mutate()
to calculate and add a column called population_walk
, containing the total number of people who walk to work in a county.count()
to find the total number of people who walk to work in each state.counties_selected %>%
# Add population_walk containing the total number of people who walk to work
mutate(population_walk = walk * population / 100) %>%
# Count weighted by the new column
count(state, wt = population_walk, sort = TRUE)
The summarize()
verb is very useful for collapsing a large dataset into a single observation.
counties_selected <- counties %>%
select(county, population, income, unemployment)
Instruction:
min_population
(with the smallest population), max_unemployment
(with the maximum unemployment), and average_income
(with the mean of the income variable).# Summarize to find minimum population, maximum unemployment, and average income
counties_selected %>%
summarise(min_population = min(population),
max_unemployment = max(unemployment),
average_income = mean(income))
Another interesting column is land_area
, which shows the land area in square miles. Here, you’ll summarize both population and land area by state, with the purpose of finding the density (in people per square miles).
counties_selected <- counties %>%
select(state, county, population, land_area)
Instruction 1:
total_area
(with total area in square miles) and total_population
(with total population).# Group by state and find the total area and population
counties_selected %>%
group_by(state) %>%
summarise(total_area = sum(land_area), total_population = sum(population))
Instruction 2:
density
column with the people per square mile, then arrange in descending order.# Add a density column, then sort in descending order
counties_selected %>%
group_by(state) %>%
summarize(total_area = sum(land_area),
total_population = sum(population)) %>%
mutate(density = total_population / total_area) %>%
arrange(desc(density))
You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.
counties_selected <- counties %>%
select(region, state, county, population)
Instruction 1:
total_pop
, in each combination of region and state.# Summarize to find the total population
counties_selected %>%
group_by(region, state) %>%
summarize(total_pop = sum(population))
Instruction 2:
average_pop
) and the median state population in each region (median_pop
).# Calculate the average_pop and median_pop columns
counties_selected %>%
group_by(region, state) %>%
summarize(total_pop = sum(population)) %>%
summarize(average_pop = mean(total_pop),
median_pop = median(total_pop))
Previously, you used the walk
column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.
Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.
counties_selected <- counties %>%
select(region, state, county, metro, population, walk)
Instruction:
# Group by region and find the greatest number of citizens who walk to work
counties_selected %>%
group_by(region)%>%
top_n(1,walk)
You’ve been learning to combine multiple dplyr
verbs together. Here, you’ll combine group_by()
, summarize()
, and top_n()
to find the state in each region with the highest income.
When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y)
then summarize, the result will still be grouped by X
.
counties_selected <- counties %>%
select(region, state, county, population, income)
Instruction:
average_income
) of counties within each region and state (notice the group_by()
has already been done for you).counties_selected %>%
group_by(region, state) %>%
# Calculate average income
summarize(average_income = mean(income))%>%
# Find the highest income state in each region
top_n(1,average_income)
In this chapter, you’ve learned to use five dplyr
verbs related to aggregation: count()
, group_by()
, summarize()
, ungroup()
, and top_n()
. In this exercise, you’ll use all of them to answer a question: In how many states do more people live in metro areas than non-metro areas?
Recall that the metro
column has one of the two values “Metro” (for high-density city areas) or “Nonmetro” (for suburban and country areas).
counties_selected <- counties %>%
select(state, metro, population)
Instruction 1:
state
and metro
, find the total population as total_pop
.# Find the total population for each combination of state and metro
counties_selected %>%
group_by(state, metro) %>%
summarize(total_pop = sum(population))
Instruction 2:
Metro
or Nonmetro
.# Extract the most populated row for each state
counties_selected %>%
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
top_n(1, total_pop)
Instruction 3:
# Count the states with more people in Metro or Nonmetro areas
counties_selected %>%
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
top_n(1, total_pop) %>%
ungroup() %>%
count(metro)
Using the select verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:
) is useful for getting many columns at a time.
Instruction:
glimpse()
to examine all the variables in the counties
table.professional
, service
, office
, construction
, and production
.service
to find which counties have the highest rates of working in the service industry.# Glimpse the counties table
glimpse(counties)
counties %>%
# Select state, county, population, and industry-related columns
select(state, county, population,professional, service, office, construction, production)%>%
# Arrange service in descending order
arrange(desc(service))
In the video you learned about the select helper starts_with()
. Another select helper is ends_with()
, which finds the columns that end with a particular string.
Instruction:
work
.counties %>%
# Select the state, county, population, and those ending with "work"
select(state, county, population, ends_with('work'))%>%
# Filter for counties that have at least 50% of people engaged in public work
filter(public_work >= 50)
The rename()
verb is often useful for changing the name of a column that comes out of another verb, such as count()
. In this exercise, you’ll rename the n column from count()
(which you learned about in Chapter 2) to something more descriptive.
Instruction 1:
count()
to determine how many counties are in each state.# Count the number of counties in each state
counties %>%
count(state)
Instruction 2:
n
column in the output; use rename()
to rename that to num_counties
.# Rename the n column to num_counties
counties %>%
count(state) %>%
rename(num_counties = n)
rename()
isn’t the only way you can choose a new name for a column: you can also choose a name as part of a select()
.
Instruction:
state
, county
, and poverty
from the counties
dataset; in the same step, rename the poverty
column to poverty_rate
.# Select state, county, and poverty as poverty_rate
counties %>%
select(state, county, poverty_rate = poverty)
As you learned in the video, the transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.
Instruction:
state
, county
, and population
columns, and add a new column, density
, that contains the population
per land_area
.counties %>%
# Keep the state, county, and populations columns, and add a density column
transmute(state, county, population, density = population / land_area)%>%
# Filter for counties with a population greater than one million
filter(population > 1000000)%>%
# Sort density in ascending order
arrange(density)
In this chapter you’ve learned about the four verbs: select, mutate, transmute, and rename. Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.
Instruction:
unemployment
column to unemployment_rate
.state
, county
, and the ones containing poverty
.fraction_women
with the fraction of the population made up of women, without dropping any columns.employed / population
, which you’ll call employment_rate
.# Change the name of the unemployment column
counties %>%
rename(unemployment_rate = unemployment)
# Keep the state and county columns, and the columns containing poverty
counties %>%
select(state, county, contains("poverty"))
# Calculate the fraction_women column without dropping the other columns
counties %>%
mutate(fraction_women = women / population)
# Keep only the state, county, and employment_rate columns
counties %>%
transmute(state, county, employment_rate = employed / population)
The dplyr
verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.
Instruction:
babynames %>%
# Filter for the year 1990
filter(year == 1990)%>%
# Sort the number column in descending order
arrange(desc(number))
You saw that you could use filter()
and arrange()
to find the most common names in one year. However, you could also use group_by
and top_n
to find the most common name in every year.
Instruction:
group_by
and top_n
to find the most common name for US babies in each year.# Find the most common name in each year
babynames %>%
group_by(year)%>%
top_n(1, number)
The dplyr
package is very useful for exploring data, but it’s especially useful when combined with other tidyverse
packages like ggplot2
.
Instruction 1:
selected_names
.# Filter for the names Steven, Thomas, and Matthew
selected_names <- babynames %>%
filter(name %in% c("Steven","Thomas","Matthew"))
Instruction 2:
# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
geom_line()
In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.
To do this, you’ll be combining the grouped mutate approach with a top_n
.
Instruction:
# Find the year each name is most common
babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total) %>%
group_by(name) %>%
top_n(1, fraction)
In the video, you learned how you could group by the year and use mutate()
to add a total for that year.
In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.
Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutate
s slower.
Instruction 1:
Use a grouped mutate to add two columns:
name_total
, with the total number of babies born with that name in the entire dataset.name_max
, with the highest number of babies born in any year.# Add columns name_total and name_max for each name
babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number))
Instruction 2:
babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
# Ungroup the table
ungroup() %>%
# Add the fraction_max column containing the number by the name maximum
mutate(fraction_max = number / name_max)
You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.
In this exercise, you’ll visualize the normalized popularity of each name. Your work from the previous exercise, names_normalized
, has been provided for you.
names_normalized <- babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
ungroup() %>%
mutate(fraction_max = number / name_max)
Instruction:
names_normalized
table to limit it to the three names Steven, Thomas, and Matthew.fraction_max
for those names over time.# Filter for the names Steven, Thomas, and Matthew
names_filtered <- names_normalized %>%
filter(name %in% c('Steven', 'Thomas', 'Matthew'))
# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
geom_line()
In the video, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?
You’ll start with the babynames_fraction data already, so that you can consider the popularity of each name within each year.
Instruction:
babynames_fraction %>%
# Arrange the data in order of name, then year
arrange(name, year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio between each year
mutate(ratio = fraction / lag(fraction))
Previously, you added a ratio
column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered
, to look further into the names that experienced the biggest jumps in popularity in consecutive years.
babynames_ratios_filtered <- babynames_fraction %>%
arrange(name, year) %>%
group_by(name) %>%
mutate(ratio = fraction / lag(fraction)) %>%
filter(fraction >= 0.00001)
Instruction:
ratio
; note the data is already grouped by name
.ratio
column in descending order.babynames_ratios_filtered
data further by filtering the fraction
column to only display results greater than or equal to 0.001
.babynames_ratios_filtered %>%
# Extract the largest ratio from each name
top_n(1,ratio) %>%
# Sort the ratio column in descending order
arrange(desc(ratio)) %>%
# Filter for fractions greater than or equal to 0.001
filter(fraction >= 0.001)