Introduction
Exploring ggplot2
, part 1
To get a first feel for ggplot2
, let’s try to run some basic ggplot2
commands. Together, they build a plot of the mtcars
dataset that contains information about 32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.
Instructions
- Load the
ggplot2
package using library()
.
- Use
str()
to explore the structure of the mtcars
dataset.
- Execute the example code below. See if you can understand what ggplot does with the data.
Exploring ggplot2
, part 2
The plot from the previous exercise wasn’t really satisfying. Although cyl
(the number of cylinders) is categorical, it is classified as numeric in mtcars
. You’ll have to explicitly tell ggplot2
that cyl
is a categorical variable.
Instructions
- Change the
ggplot()
command by wrapping factor()
around cyl
.
- Executer and see if the resulting plot is better this time.
Exploring ggplot2
, part 3
We’ll use several datasets throughout the courses to showcase the concepts discussed in the videos. In the previous exercises, you already got to know mtcars
. Let’s dive a little deeper to explore the three main topics in this course: The data, aesthetics, and geom layers.
The mtcars
dataset contains information about 32 cars from 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.
You’re encouraged to think about how the examples and concepts we discuss throughout these data viz courses apply to your own data-sets!
Instructions
ggplot2
has already been loaded for you. Take a look at the first command. It plots the mpg
(miles per galon) against the weight
(in thousands of pounds). You don’t have to change anything about this command.
- In the second call of
ggplot()
change the color
argument in aes()
(which stands for aesthetics). The color should be dependent on the displacement of the car engine, found in disp
.
- In the third call of
ggplot()
change the size
argument in aes()
(which stands for aesthetics). The size should be dependent on the displacement of the car engine, found in disp
.
Understanding Variables
In the previous exercise you saw that disp
can be mapped onto a color gradient or onto a continuous size scale.
Another argument of aes()
is the shape of the points. There are a finite number of shapes which ggplot()
can automatically assign to the points. However, if you try this command in the console to the right:
It gives an error. What does this mean?
Instructions
Possible Answers
- shape is not a defined argument.
- shape only makes sense with categorical data, and disp is continuous.
- shape only makes sense with continuous data, and disp is categorical.
- shape is not a variable in your dataset.
- shape has to be defined as a function.
Exploring ggplot2
, part 4
The diamonds
data frame contains information on the prices and various metrics of 50,000 diamonds. Among the variables included are carat
(a measurement of the size of the diamond) and price. For the next exercises, you’ll be using a subset of 1,000 diamonds.
Here you’ll use two common geom layer functions: geom_point()
and geom_smooth()
. We already saw in the earlier exercises how these are added using the +
operator.
Instructions
- Explore the
diamonds
data frame with the str()
function.
- Use the
+
operator to add geom_point()
to the first ggplot()
command. This will tell ggplot2
to draw points on the plot.
- Use the
+
operator to add geom_point()
and geom_smooth()
. These just stack on each other! geom_smooth()
will draw a smoothed line over the points.
Exploring ggplot2
, part 5
The code for last plot of the previous exercise is available in the script on the right. It builds a scatter plot of the diamonds
dataset, with carat
on the x-axis and price
on the y-axis. geom_smooth()
is used to add a smooth line.
With this plot as a starting point, let’s explore some more possibilities of combining geoms.
Instructions
- Plot 2 - Copy and paste plot 1, but show only the smooth line, no points.
- Plot 3 - Show only the smooth line, but color according to clarity by placing the argument
color = clarity
in the aes()
function of your ggplot()
call.
- Plot 4 - Draw translucent colored points.
- Copy the
ggplot()
command from plot 3 (with clarity
mapped to color
).
- Remove the smooth layer.
- Add the points layer back in.
- Set
alpha = 0.4
inside geom_point()
. This will make the points 40% transparent.
Understanding the grammar, part 1
Here you’ll explore some of the different grammatical elements. Throughout this course, you’ll discover how they can be combined in all sorts of ways to develop unique plots.
In the following instructions, you’ll start by creating a ggplot
object from the diamonds dataset. Next, you’ll add layers onto this object to build beautiful & informative plots.
Instructions
- Define the data (
diamonds
) and aesthetics layers. Map carat
on the x axis and price
on the y axis. Assign it to an object: dia_plot
.
- Using
+
, add a geom_point()
layer (with no arguments), to the dia_plot
object. This can be in a single or multiple lines.
- Note that you can also call
aes()
within the geom_point()
function. Map clarity
to the color
argument in this way.
Understanding the grammar, part 2
Continuing with the previous exercise, here you’ll explore mixing arguments and aesthetics in a single geometry.
You’re still working on the diamonds
dataset.
Instructions
- The
dia_plot
object has been created for you.
- Update
dia_plot
so that it contains all the functions to make a scatter plot by using geom_point()
for the geom layer. Set alpha = 0.2
.
- Using
+
, plot the dia_plot
object with a geom_smooth()
layer on top. You don’t want any error shading, which can be achieved by setting the se = FALSE
in geom_smooth()
.
- Modify the
geom_smooth()
function from the previous instruction so that it contains aes()
and map clarity
to the col
argument.
Data
base
package and ggplot2
, part 1 - plot
These courses are about understanding data visualization in the context of the grammar of graphics. To gain a better appreciation of ggplot2
and to understand how it operates differently from base
package, it’s useful to make some comparisons.
First, let’s focus on base package. You want to make a plot of mpg
(miles per gallon) against wt
(weight in thousands of pounds) in the mtcars
data frame, but this time you want the dots colored according to the number of cylinders, cyl
. How would you do that in base package? You can use a little trick to color the dots by specifying a factor
variable as a color. This works because factors are just a special class of the integer
type.
Instructions
- Using the
base
package plot()
, make a scatter plot with mtcars$wt
on the x-axis and mtcars$mpg
on the y-axis, colored according to mtcars$cyl
(use the col
argument). You can specify data =
but you’ll just do it the long way here.
- Add a new column,
fcyl
, to the mtcars
data frame. This should be cyl
converted to a factor.
- Create a similar plot to instruction 1, but this time, use
fcyl
(which is cyl
as a factor) to set the col
.
base
package and ggplot2
, part 2 - lm
If you want to add a linear model to your plot, shown right, you can define it with lm()
and then plot the resulting linear model with abline()
. However, if you want a model for each subgroup, according to cylinders, then you have a couple of options.
You can subset your data, and then calculate the lm()
and plot each subset separately. Alternatively, you can vectorize over the cyl
variable using lapply()
and combine this all in one step. This option is already prepared for you.
The code below contains a call to the function lapply()
, which you might not have seen before. This function takes as input a vector and a function. Then lapply()
applies the function it was given to each element of the vector and returns the results in a list. In this case, lapply()
takes each element of mtcars$cyl
and calls the function defined in the second argument. This function takes a value of mtcars$cyl
and then subsets the data so that only rows with cyl == x
are used. Then it fits a linear model to the filtered dataset and uses that model to add a line to the plot with the abline()
function.
Now that you have an interesting plot, there is a very important aspect missing - the legend!
In base package you have to take care of this using the legend()
function. This has been done for you in the predefined code.
Instructions
- Fill in the
lm()
function to calculate a linear model of mpg
described by wt
and save it as an object called carModel
.
- Draw the linear model on the scatterplot.
- Write code that calls
abline()
with carModel
as the first argument. Set the line type by passing the argument lty = 2
.
- Run the code that generates the basic plot and the call to
abline()
all at once by highlighting both parts of the script and hitting control/command
+ enter
on your keyboard. These lines must all be run together in the console so that R will be able to find the plot you want to add a line to.
- Run the code already given to generate the plot with a different model for each group. You don’t need to modify any of this.
# Use lm() to calculate a linear model and save it as carModel
___ <- lm(___ ~ ___, data = mtcars)
# Basic plot
mtcars$cyl <- as.factor(mtcars$cyl)
plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)
# Call abline() with carModel as first argument and set lty to 2
___(___, lty = ___)
# Plot each subset efficiently with lapply
# You don't have to edit this code
plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)
lapply(mtcars$cyl, function(x) {
abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
})
# This code will draw the legend of the plot
# You don't have to edit this code
legend(x = 5, y = 33, legend = levels(mtcars$cyl),
col = 1:3, pch = 1, bty = "n")
base
package and ggplot2
, part 3
In this exercise you’ll recreate the base package plot in ggplot2
.
The code for base R plotting is given at the top. The first line of code already converts the cyl
variable of mtcars
to a factor.
Instructions
- Plot 1: add
geom_point()
in order to make a scatter plot.
- Plot 2: copy and paste Plot 1
- Add a linear model for each subset according to
cyl
by adding a geom_smooth()
layer
- Inside this
geom_smooth()
, set method to "lm"
and se
to FALSE
. Note: geom_smooth()
will automatically draw a line per cyl
subset. It recognizes the groups you want to identify by color
in the aes()
call within the ggplot()
command.
- Plot 3: copy and paste Plot 2
- Plot a linear model for the entire dataset, do this by adding another
geom_smooth()
layer
- Set the
group
aesthetic inside this geom_smooth()
layer to 1
. This has to be set within the aes()
function.
- Set
method
to "lm"
, se
to FALSE
and linetype
to 2
. These have to be set outside aes()
of the geom_smooth()
.
Note: the group
aesthetic will tell ggplot()
to draw a single linear model through all the points.
# Convert cyl to factor (don't need to change)
mtcars$cyl <- as.factor(mtcars$cyl)
# Example from base R (don't need to change)
plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)
abline(lm(mpg ~ wt, data = mtcars), lty = 2)
lapply(mtcars$cyl, function(x) {
abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
})
legend(x = 5, y = 33, legend = levels(mtcars$cyl),
col = 1:3, pch = 1, bty = "n")
# Plot 1: add geom_point() to this command to create a scatter plot
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
___ # Fill in using instructions Plot 1
# Plot 2: include the lines of the linear models, per cyl
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
___ + # Copy from Plot 1
___ # Fill in using instructions Plot 2
# Plot 3: include a lm for the entire dataset in its whole
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
___ + # Copy from Plot 2
___ + # Copy from Plot 2
___ # Fill in using instructions Plot 3
ggplot2
compared to base
package
ggplot2
has become very popular and for many people it’s the go-to plotting package in R. What does ggplot2
do that base
package doesn’t?
Answer the question
Possible Answers
ggplot2
creates plotting objects, which can be manipulated.
ggplot2
takes care of a lot of the leg work for you, such as choosing nice color pallettes and making legends.
ggplot2
is built upon the grammar of graphics plotting philosophy, making it more flexible and intuitive for understanding the relationship between your visuals and your data.
- Options 1, 2, and 3.
ggplot2
is effectively a replacement for all base-package plotting functions.
Plotting the ggplot2 way
There are different ggplot2
calls to plot two groups of data onto the same plot:
# Option 1
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_point(aes(x = Petal.Length, y = Petal.Width), col = "red")
iris.wide <- rbind(data.frame(Species = iris$Species, Part = "Sepal", Length = iris$Sepal.Length, Width = iris$Sepal.Width), data.frame(Species = iris$Species, Part = "Petal", Length = iris$Petal.Length, Width = iris$Petal.Width))
# Option 2
ggplot(iris.wide, aes(x = Length, y = Width, col = Part)) +
geom_point()
Which one is preferable? Both iris
and iris.wide
are available in the workspace, so you can experiment in the R Console straight away!
Instructions
Possible Answers
- Option 1.
- Option 2.
- Both are equally preferable.
Variables to visuals, part 1
iris.wide2 <- rbind(
data.frame(Measure = "Length", Part = "Petal", Setosa = iris[iris$Species == "setosa","Petal.Length"], Versicolor = iris[iris$Species == "versicolor","Petal.Length"], Virginica = iris[iris$Species == "virginica","Petal.Length"]),
data.frame(Measure = "Width", Part = "Petal", Setosa = iris[iris$Species == "setosa","Petal.Width"], Versicolor = iris[iris$Species == "versicolor","Petal.Width"], Virginica = iris[iris$Species == "virginica","Petal.Width"]),
data.frame(Measure = "Length", Part = "Sepal", Setosa = iris[iris$Species == "setosa","Sepal.Length"], Versicolor = iris[iris$Species == "versicolor","Sepal.Length"], Virginica = iris[iris$Species == "virginica","Sepal.Length"]),
data.frame(Measure = "Width", Part = "Sepal", Setosa = iris[iris$Species == "setosa","Sepal.Width"], Versicolor = iris[iris$Species == "versicolor","Sepal.Width"], Virginica = iris[iris$Species == "virginica","Sepal.Width"]))
iris.tidy <- rbind(
data.frame(Species = iris$Species, Part = "Sepal", Measure = "Length", Value = iris$Sepal.Length),
data.frame(Species = iris$Species, Part = "Sepal", Measure = "Width", Value = iris$Sepal.Length),
data.frame(Species = iris$Species, Part = "Petal", Measure = "Length", Value = iris$Petal.Length),
data.frame(Species = iris$Species, Part = "Petal", Measure = "Width", Value = iris$Petal.Width))
So far you’ve seen four different forms of the iris dataset: iris
, iris.wide
, iris.wide2
and iris.tidy
. Don’t let all these different forms confuse you! It’s exactly the same data, just rearranged so that your plotting functions become easier.
To see this in action, consider the plot in the graphics device at right. Which form of the dataset would be the most appropriate to use here?
Instructions
- Look at the structures of
iris
, iris.wide
and iris.tidy
using str()
.
- Fill in the
ggplot
function with the appropriate data frame and variable names. The variable names of the aesthetics of the plot will match the ones you found using the str()
command in the previous step.
Variables to visuals, part 1b
In the last exercise you saw how iris.tidy
was used to make a specific plot. It’s important to know how to rearrange your data in this way so that your plotting functions become easier. In this exercise you’ll use functions from the tidyr
package to convert iris
to iris.tidy
.
The resulting iris.tidy
data should look as follows:
Species Part Measure Value
1 setosa Sepal Length 5.1
2 setosa Sepal Length 4.9
3 setosa Sepal Length 4.7
4 setosa Sepal Length 4.6
5 setosa Sepal Length 5.0
6 setosa Sepal Length 5.4
...
You can have a look at the iris
dataset by typing head(iris)
in the console.
Note: If you’re not familiar with %>%
, gather()
and separate()
, you may want to take the Cleaning Data
in R course. In a nutshell, a dataset is called tidy when every row is an observation and every column is a variable. The gather()
function moves information from the columns to the rows. It takes multiple columns and gathers them into a single column by adding rows. The separate()
function splits one column into two or more columns according to a pattern you define. Lastly, the %>%
(or “pipe”) operator passes the result of the left-hand side as the first argument of the function on the right-hand side.
Instructions
You’ll use two functions from the tidyr
package:
gather()
rearranges the data frame by specifying the columns that are categorical variables with a - notation. Complete the command. Notice that only one variable is categorical in iris.
separate()
splits up the new key column, which contains the former headers, according to .
. The new column names “Part” and “Measure” are given in a character vector. Don’t forget the quotes.
Variables to visuals, part 2
Here you’ll take a look at another plot variant. Which of your data frames would be used to produce this plot?

Instructions
- Look at the heads of
iris
, iris.wide
and iris.tidy
using head()
.
- Fill in the
ggplot
function with the appropriate data frame and variable names. The names of the aesthetics of the plot will match with variable names in your dataset. The previous instruction will help you match variable names in datasets with the ones in the plot.
Variables to visuals, part 2b
You saw previously how you can derive iris.tidy
from iris
. Now you’ll move on to produce iris.wide
.
The head of the iris.wide
should look like this in the end:
Species Part Length Width
1 setosa Petal 1.4 0.2
2 setosa Petal 1.4 0.2
3 setosa Petal 1.3 0.2
4 setosa Petal 1.5 0.2
5 setosa Petal 1.4 0.2
6 setosa Petal 1.7 0.4
...
You can have a look at the iris
dataset by typing head(iris)
in the console.
Instructions
- Before you begin, you need to add a new column called
Flower
that contains a unique identifier for each row in the data frame. This is because you’ll rearrange the data frame afterwards and you need to keep track of which row, or which specific flower, each value came from. It’s done for you, no need to add anything yourself.
gather()
rearranges the data frame by specifying the columns that are categorical variables with a - notation. In this case, Species
and Flower
are categorical. Complete the command.
separate()
splits up the new key column, which contains the former headers, according to .
. The new column names "Part"
and "Measure"
are given in a character vector.
- The last step is to use
spread()
to distribute the new Measure
column and associated value
column into two columns.
Aesthetics
All about aesthetics, part 1
There are 9 different aesthetics that can be mapped:
x
: the X-axis position
y
: the Y-axis position
color
: the color of dots and the outline of other shapes
fill
: fill color
size
: diameter of points, thickness of lines
alpha
: the transparency, 0-transparent, 1-opaque
linetype
: line dash pattern
labels
: text on a plot or axes
shape
: the shape
Let’s apply them to a categorical variable - the cylinders in mtcars
, cyl
.
(You’ll consider line type when you encounter line plots in the next chapter).
These are the aesthetics you can consider within aes()
in this chapter: x
, y
, color
, fill
, size
, alpha
, labels
and shape
.
In the following exercise you can assume that the cyl
column is categorical. It has already been transformed into a factor
for you.
Instructions
The mtcars
data frame is available in your workspace. For each of the following four plots, use geom_point()
:
- Map
mpg
onto the x
aesthetic, and cyl
onto the y
.
- Reverse the mappings of the first plot.
- Map
wt
onto x,,
mpgonto
y, and
cylonto
color`.
- Modify the previous plot by changing the
shape
argument of the geom to 1
and increase the size
to 4
. These are attributes that you should specify inside geom_point()
.
library(datasets)
mtcars$cyl <- as.factor(mtcars$cyl)
# 1 - Map mpg to x and cyl to y
ggplot(___, aes(___, ___)) +
geom_point()
# 2 - Reverse: Map cyl to x and mpg to y
ggplot(___, aes(___, ___)) +
geom_point()
# 3 - Map wt to x, mpg to y and cyl to col
ggplot(___, aes(___, ___, ___)) +
geom_point()
# 4 - Change shape and size of the points in the above plot
ggplot(mtcars, aes(___, ___, ___)) +
geom_point(___, ___)
All about aesthetics, part 2
The color
aesthetic typically changes the outside outline of an object and the fill
aesthetic is typically the inside shading. However, as you saw in the last exercise, geom_point()
is an exception. Here you use color
, instead of fill
for the inside of the point. But it’s a bit subtler than that.
Which shape to use? The default geom_point()
uses shape = 19
(a solid circle with an outline the same colour as the inside). Good alternatives are shape = 1
(hollow) and shape = 16
(solid, no outline). These all use the col
aesthetic (don’t forget to set alpha
for solid points).
A really nice alternative is shape = 21
which allows you to use both fill for the inside and col for the outline! This is a great little trick for when you want to map two aesthetics to a dot.
What happens when you use the wrong aesthetic mapping? This is a very common mistake! The code from the previous exercise is in the editor. Using this as your starting point complete the instructions.
Instructions
Note: In the mtcars
dataset, cyl
and am
have been converted to factor for you.
- Copy & paste the first plot’s code. Change the aesthetics so that
cyl
maps to fill
rather than col
.
- Copy & paste the second plot’s code. In
geom_point()
change the shape
argument to 21
and add an alpha
argument set to 0.6
.
- Copy & paste the third plot’s code. In the
ggplot()
aesthetics, map am
to col
.
All about aesthetics, part 3
Now that you’ve got some practice with incrementally building up plots, you can try to do it from scratch! The mtcars
dataset is pre-loaded in the workspace.
Instructions
Use ggplot()
to create a basic scatter plot. Inside aes()
, map wt
onto x
and mpg
onto y
. Typically, you would say “mpg described by wt” or “mpg vs wt”, but in aes()
, it’s x
first, y
second. Use geom_point()
to make three scatter plots:
cyl
on size
cyl
on alpha
cyl
on shape
Try this last variant:
cyl
on label
. In order to correctly show the test (i.e. label
), use geom_text()
.
All about attributes, part 1
You can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x
, y
, color
, fill
, size
, alpha
, label
and shape
.
This time you’ll use these arguments to set attributes of the plot, not aesthetics. However, there are some pitfalls you’ll have to watch out for: these attributes can overwrite the aesthetics of your plot!
A word about shapes: In the exercise “All about aesthetics, part 2”, you saw that shape = 21
results in a point that has a fill and an outline. Shapes in R can have a value from 1-25. Shapes 1-20 can only accept a color aesthetic, but shapes 21-25 have both a color and a fill aesthetic. See the pch
argument in par()
for further discussion.
A word about hexadecimal colours: Hexadecimal, literally “related to 16”, is a base-16 alphanumeric counting system. Individual values come from the ranges 0-9 and A-F. This means there are 256 possible two-digit values (i.e. 00 - FF). Hexadecimal colours use this system to specify a six-digit code for Red, Green and Blue values (“#RRGGBB”) of a colour (i.e. Pure blue: “#0000FF”, black: “#000000”, white: “#FFFFFF”). R can accept hex codes as valid colours.
Instructions
- You will continue to work with
mtcars
. Use ggplot()
to create a basic scatter plot: map wt
onto x
, mpg
onto y
and cyl
onto color
.
- Overwrite the color of the points inside
geom_point()
to my_color
. Notice how this cancels out the colors given to the points by the number of cylinders!
- Starting with plot 2, map
cyl
to fill
instead of col
and set the attributes size
to 10
, shape
to 23
and color
to my_color
inside geom_point()
.
All about attributes, part 2
You can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x
, y
, color
, fill
, size
, alpha
, label
and shape
.
In this exercise you will set all kinds of attributes of the points!
You will continue to work with mtcars
.
Instructions
- Add to the first command: draw points with
alpha
set to 0.5
.
- Add to the second command: draw points of shape
24
in the color yellow
.
- Add to the third command: draw text with label
rownames(mtcars)
in the color red
. Don’t use geom_point()
here! You should get a scatter plot with the names of the cars instead of points.
Note: Remember to specify characters with quotation marks ("yellow"
, not yellow
).
Going all out
In this exercise, you will gradually add more aesthetics layers to the plot. You’re still working with the mtcars
dataset, but this time you’re using more features of the cars. For completeness, here is a list of all the features of the observations in mtcars
:
mpg
– Miles/(US) gallon
cyl
– Number of cylinders
disp
– Displacement (cu.in.)
hp
– Gross horsepower
drat
– Rear axle ratio
wt
– Weight (lb/1000)
qsec
– 1/4 mile time
vs
– V/S engine.
am
– Transmission (0 = automatic, 1 = manual)
gear
– Number of forward gears
carb
– Number of carburetors
Notice that adding more aesthetics to your plot is not always a good idea. Adding aesthetic mappings to a plot will increase its complexity, and thus decrease its readability.
Instructions
Note: In this chapter you saw aesthetics and attributes. Variables in a data frame are mapped to aesthetics in aes()
. (e.g. aes(col = cyl)
) within ggplot()
. Visual elements are set by attributes in specific geom layers (geom_point(col = "red")
). Don’t confuse these two things - here you’re focusing on aesthetic mappings.
- Draw a scatter plot of
mtcars
with mpg
on the x-axis, qsec
on the y-axis and factor(cyl)
as colors.
- Expand the previous plot to include
factor(am)
as the shape
of the points.
- Expand the previous plot to include the ratio of horsepower to weight (i.e.
(hp/wt)
) as the size of the points.
Aesthetics for categorical and continuous variables
Many of the aesthetics can be mapped onto continuous or categorical variables, but some are restricted to categorical data. Which aesthetics are they?
Instructions
Possible Answers
- color & fill
- alpha & size
- label & shape
- alpha & label
- x & y
Position
Bar plots suffer from their own issues of overplotting, as you’ll see here. Use the "stack"
, "fill"
and "dodge"
positions to reproduce the plot in the viewer.
The ggplot2
base layers (data and aesthetics) have already been coded; they’re stored in a variable cyl.am
. It looks like this:
Instructions
- Add a
geom_bar()
call to cyl.am
. By default, the position
will be set to "stack"
.
- Fill in the second
ggplot
command. Explicitly set position
to "fill"
inside geom_bar()
.
- Fill in the third
ggplot
command. Set position
to "dodge"
.
- The
position = "dodge"
version seems most appropriate. Finish off the fourth ggplot
command by completing the three scale_
functions:
scale_x_discrete()
takes as its only argument the x-axis label: "Cylinders"
.
scale_y_continuous()
takes as its only argument the y-axis label: "Number"
.
scale_fill_manual()
fixes the legend. The first argument is the title of the legend: "Transmission"
. Next, values
and labels
are set to predefined values for you. These are the colors and the labels in the legend.
Setting a dummy aesthetic
In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very conveniently left out x
and y
. That’s because although you can make univariate plots (such as histograms, which you’ll get to in the next chapter), a y-axis will always be provided, even if you didn’t ask for it.
In the base package you can make univariate plots with stripchart()
directly and it will take care of a fake y
axis for us. Since this is univariate data, there is no real y
axis.
You can get the same thing in ggplot2
, but it’s a bit more cumbersome. The only reason you’d really want to do this is if you were making many plots and you wanted them to be in the same style, or you wanted to take advantage of an aesthetic mapping (e.g. colour
).
Instructions
- Try to run
ggplot(mtcars, aes(x = mpg)) + geom_point()
in the console. x
is only one of the two essential aesthetics for geom_point()
, which is why you get an error message.
- 1 - To fix this, map a value, e.g. 0, instead of a variable, onto
y
. Use geom_jitter()
to avoid having all the points on a horizontal line.
- 2 - To make everything look nicer, copy & paste the code for plot 1 and change the limits of the
y
axis using the appropriate scale_y_...()
function. Set the limits
argument to c(-2, 2)
.
Overplotting 1 - Point shape and transparency
In the previous section you saw that there are lots of ways to use aesthetics. Perhaps too many, because although they are possible, they are not all recommended. Let’s take a look at what works and what doesn’t.
So far you’ve focused on scatter plots since they are intuitive, easily understood and very common. A major consideration in any scatter plot is dealing with overplotting. You’ll encounter this topic again in the geometries layer, but you can already make some adjustments here.
You’ll have to deal with overplotting when you have:
- Large datasets,
- Imprecise data and so points are not clearly separated on your plot (you saw this in the video with the
iris
dataset),
- Interval data (i.e. data appears at fixed values), or
- Aligned data values on a single axis.
One very common technique that I’d recommend to always use when you have solid shapes it to use alpha blending (i.e. adding transparency). An alternative is to use hollow shapes. These are adjustments to make before even worrying about positioning. This addresses the first point as above, which you’ll see again in the next exercise.
Instructions
- Begin by making a basic scatter plot of
mpg
(y) vs. wt
(x), map cyl
to color
and make the size = 4
. cyl
has already been converted to a factor variable for you.
- Modify the above plot to set
shape
to 1
. This allows for hollow circles.
- Modify the first plot to set
alpha
to 0.6
.
Overplotting 2 - alpha with large datasets
In a previous exercise we defined four situations in which you’d have to adjust for overplotting. You’ll consider the last two here with the diamonds dataset:
- Large datasets.
- Aligned data values on a single axis
Instructions
- The
diamonds
data frame is available in the ggplot2()
package. Begin by making a basic scatter plot of price
(y) vs. carat
(x) and map clarity
onto color
.
- Copy the above functions and set the
alpha
to 0.5
. This is a good start to dealing with the large dataset.
- Align all the diamonds within a clarity class, by plotting
carat
(y) vs. clarity
(x). Map price
onto color
. alpha
should still be 0.5
.
- In the previous plot, all the individual values line up on a single axis within each clarity category, so you have not overcome overplotting. Modify the above plot to use the
position = "jitter"
inside geom_point()
.
Geometries
Scatter plots and jittering (1)
You already saw a few examples using geom_point()
where the result was not a scatter plot. For example, in the plot given below, a continuous variable, wt
, is mapped to the y
aesthetic, and a categorical variable, cyl
, is mapped to the x
aesthetic. This also leads to over-plotting, since the points are arranged on a single x position. You previously dealt with overplotting by setting the position = jitter
inside geom_point()
. Let’s look at some other solutions here.
Instructions
Beginning with the code for the plot in the viewer (given), make these modifications
- Use a shortcut geom,
geom_jitter()
, instead of geom_point()
.
- Unfortunately, the width of the jitter is a bit too wide to be useful. Adjust this by setting the argument
width = 0.1
inside geom_jitter()
.
- Finally, return to
geom_point()
and set the position argument here to position_jitter(0.1)
, which will set the jittering width directly inside a points layer.
Note: For convenience, you could have saved the data and aesthetic layers as a ggplot2 object and re-used it in all solutions. We’ve made each plot explicit so that you can see all plotting instructions.
Scatter plots and jittering (2)
In the chapter on aesthetics you saw different ways in which you will have to compensate for overplotting.
Another issue is when you have interval data. This can be continuous data measured on an interval (i.e. 1 ,2, 3 …), as opposed to numeric (i.e. 1.1, 1.4, 1.5, …), scale, or two categorical (e.g. factor) variables, which are just type interval under-the-hood.
In such a case you’ll have a small, defined number of intersections between the two variables.
You will be using the Vocab
dataset. The Vocab
dataset contains information about the years of education and integer score on a vocabulary test for over 21,000 individuals based on US General Social Surveys from 1972-2004.
Instructions
- The
Vocab
data frame has been loaded for you. Both the education and vocabulary variables are classified as integers. You can imagine these as factor variables, but here, integers are more convenient to work with. First, get familiar with the dataset by looking at its structure with str()
.
- Make a basic scatter plot of
vocabulary
(y) vs. education
(x). Here it becomes apparent that you have issues with overplotting because of the integer scales.
- Use
geom_jitter()
instead of geom_point()
.
- Using the jittered plot, set
alpha
to 0.2
(very low).
- Using the jittered plot, set
shape
to 1
.
Histograms
Histograms are one of the most common and intuitive ways of showing distributions. In this exercise you’ll use the mtcars
data frame to explore typical variations of simple histograms. But first, some background:
The x axis/aesthetic: The documentation for geom_histogram()
states the argument stat = "bin"
as a default. Recall that histograms cut up a continuous variable into discrete bins - thats what the stat “bin” is doing. You always get 30 evenly-sized bins by default, which is specified with the default argument binwidth = range/30
. This is a pretty good starting point if you don’t know anything about the variable being ploted and want to start exploring.
The y axis/aesthetic: geom_histogram()
only requires one aesthetic: x
. But there is clearly a y
axis on your plot, so where does it come from? Actually, there is a variable mapped to the y aesthetic, it’s called ..count..
. When geom_histogram()
executed the binning statistic (see above), it not only cut up the data into discrete bins, but it also counted how many values are in each bin. So there is an internal data frame where this information is stored. The .. calls the variable count from this internal data frame. This is what appears on the y
aesthetic. But it gets better! The density has also been calculated. This is the proportional frequency of this bin in relation to the whole data set. You use ..density..
to access this information.
Instructions
- Use the
mtcars
data frame and make a univariate histogram by mapping mpg
onto the x
aesthetic. Use geom_histogram()
for the geom layer.
- Take plot 1 and manually create 1-unit wide bins with the
binwidth = 1
argument in geom_histogram()
.
- Take plot 2, and map
..density..
onto the y
aesthetic (i.e. inside an aes()
) inside geom_histogram()
. You’ll have two aes()
functions: one inside ggplot()
and another inside geom_histogram()
. (See the intro text for a discussion of ..density..
).
- Take plot 3 and set the attribute
fill
, the inside of the bars, to the value "#377EB8"
in geom_histogram()
. This should not appear in aes()
, since it’s an attribute, not an aesthetic mapping.
Position
In the previous chapter you saw that there are lots of ways to position scatter plots. Likewise, the geom_bar()
and geom_histogram()
geoms also have a position
argument, which you can use to specify how to draw the bars of the plot.
Three position
arguments will be introduced here:
stack
: place the bars on top of each other. Counts are used. This is the default position.
fill
: place the bars on top of each other, but this time use proportions.
dodge
: place the bars next to each other. Counts are used.
In this exercise you’ll draw the total count of cars having a given number of cylinders
(cyl), according to manual or automatic transmission type (am
).
Since, in the built-in mtcars
data set, cyl
and am
are integers, you have to convert them into factor variables.
Instructions
- Using
mtcars
, map cyl
onto the x
aesthetic and am
onto fill
. Use geom_bar()
to make a bar plot.
- Take plot 1 and explicitly set
position = "stack"
in geom_bar()
. This doesn’t change anything, does it? It was mentioned above that "stack"
is the default.
- Take plot 2 and set
position = "fill"
in geom_bar()
.
- Take plot 3 and set
position = "dodge"
in geom_bar()
.
Overlapping bar plots
So far you’ve seen three different positions for bar plots: stack
(the default), dodge
(preferred), and fill
(to show proportions).
However, you can go one step further by adjusting the dodging, so that your bars partially overlap each other. For this example you’ll again use the mtcars
dataset. Like last time you have to convert cyl
and am
into factors inside mtcars
.
Instead of using position = "dodge"
you’re going to use position_dodge()
, like you did with position_jitter()
in the Scatter plots and jittering (1) exercise. Here, you’ll save this as an object, posn_d
, so that you can easily reuse it.
Remember, the reason you want to use position_dodge()
(and position_jitter()
) is to specify how much dodging (or jittering) you want.
Instructions
- The last plot from the last exercise has been provided for you.
- Define a new object called
posn_d
by calling position_dodge()
with the argument width = 0.2
.
- Take plot 1 and make slightly overlapping bars by using the
position = posn_d
argument.
- Take plot 3 and set
alpha = 0.6
to see the overlap in bars.
Overlapping histograms
Overlapping histograms pose similar problems to overlapping bar plots, but there is a unique solution here: a frequency polygon.
This is a geom specific to binned data that draws a line connecting the value of each bin. Like geom_histogram()
, it takes a binwidth
argument and by default stat = "bin"
and position = "identity"
.
Instructions
- The code for a basic histogram of
mpg
, which you’ve already seen, is provided. Extend the code to map cyl
onto fill
inside aes()
.
- The default position for histograms is
"stack"
. Copy your solution to the first exercise and set the position
for the histogram bars to "identity"
.
- Using the same data and base layers as in the previous two plots, create a plot with a
geom_freqpoly()
. Because you’re no longer working with bars, change the aes()
function: cyl
should be mapped onto col
, not onto fill
. This will correctly color the geom.
Bar plots with color ramp, part 1
In this example of a bar plot, you’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color series.
You’ll be using the Vocab
dataset from earlier. Since this is a much larger dataset with more categories, you’ll also compare it to a simpler dataset, mtcars
. Both datasets are ordinal.
Instructions
- The bar plot from the previous exercise is provided -
cyl
is on the x-axis and filled according to transmission type, am
. Notice how you can set the color palette used to fill the bars with scale_fill_brewer()
. For a full list of possible color sets, have a look at ?brewer.pal
.
- Explore
Vocab
with str()
. Notice that the education
and vocabulary
variables have already been converted to factor variables for you.
- Make a filled bar chart with the
Vocab
dataset.
- Map
education
to x
and vocabulary
to fill
.
- Inside
geom_bar()
, make sure to set position = "fill"
.
- Allow color brewer to choose a default color palette by using the appropriate scale function, without arguments. Notice how this generates a warning message and an incomplete plot.
Bar plots with color ramp, part 2
In the previous exercise, you ended up with an incomplete bar plot. This was because for continuous data, the default RColorBrewer
palette that scale_fill_brewer()
calls is "Blues"
. There are only 9 colours in the palette, and since you have 11 categories, your plot looked strange.
In this exercise, you’ll manually create a color palette that can generate all the colours you need. To do this you’ll use a function called colorRampPalette()
.
The input is a character vector of 2 or more colour values, e.g. "#FFFFFF"
(white) and "#0000FF"
(pure blue). (See All about attributes, part 1 for a discussion on hexadecimal codes).
The output is itself a function! So when you assign it to an object, that object should be used as a function. To see what we mean, execute the following three lines:
new_col()
is a function that takes one argument: the number of colours you want to extrapolate. You want to use nicer colours, so we’ve assigned the entire "Blues"
colour palette from the RColorBrewer
package to the character vector blues
.
Instructions
- Like in the example code above, create a new function called
blue_range
that uses colorRampPalette()
to extrapolate over all 9 values of the blues
character vector.
- Take the plot code from the last exercise (provided), and change
scale_fill_brewer()
to be scale_fill_manual()
. Set the argument values = blue_range(11)
inside scale_fill_manual()
.
Overlapping histograms (2)
As a last example of bar plots, you’ll return to histograms (which you now see are just a special type of bar plot). You saw a nice trick in a previous exercise of how to slightly overlap bars, but now you’ll see how to overlap them completely. This would be nice for multiple histograms, as long as there are not too many different overlaps!
You’ll make a histogram using the mpg
variable in the mtcars
data frame.
Instructions
- A basic histogram plot is provided.
- Take plot 1 and map
am
onto fill
within the aes()
function. The default position is "stack"
.
- Take plot 2 and add the
position
argument within geom_histogram()
. Set it to "dodge"
.
- Take plot 3 and change the
position
argument to "fill"
. In this case, none of these positions really work well, because it’s difficult to compare the distributions directly.
- Take plot 4 and change the
position
argument to "identity"
and set alpha = 0.4
. This produces overlapping bars.
- Take plot 5 and change the aesthetic mapping. Map
cyl onto fill
.
Line plots
In the video you saw how to make line plots using time series data. To explore this topic, you’ll use the economics
data frame, which contains time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the US. The data is contained in the ggplot2
package.
To begin with, you can look at how the median unemployment time and the unemployment rate (the number of unemployed people as a proportion of the population) change over time.
In the next exercises, you’ll explore to how add embellishments to the line plots, such as recession periods.
Instructions
- Print out the
head()
of the economics
data frame.
- Use the
economics
data frame to plot date
on the x
axis and unemploy
on the y
axis. Use geom_line()
.
- Copy, paste and adjust the code for the previous instruction: instead of
unemploy
, plot unemploy/pop
to represent the fraction of the total population that is unemployed.
Periods of recession
By themselves, time series often contain enough valuable information, but you always want to maximize the number of variables you can show in a plot. This allows you (and your viewers) to begin making comparisons between those variables that would otherwise be difficult or impossible.
Here, you’ll add shaded regions to the background to indicate recession periods. How do unemployment rate and recession period interact with each other?
In addition to the economics
dataset from before, you’ll also use the recess
dataset for the periods of recession. The recess
data frame contains 2 variables: the begin
period of the recession and the end
. It’s already available in your workspace.
Instructions
Expand the command from the previous exercise with geom_rect()
. You will use this geom layer to draw rectangles across the recession periods. There a few pitfalls here:
geom_rect()
uses the recess dataset
, so pass this directly as data = recess
inside geom_rect()
.
- The
geom_rect()
command shouldn’t inherit aesthetics from the base ggplot()
command it belongs to. It would result in an error, since you’re using a different dataset and it doesn’t contain unemploy
or pop
. That’s why you should specify inherit.aes = FALSE
in geom_rect()
.
geom_rect()
needs four aesthetics: xmin
, xmax
, ymin
and ymax
. These should be set to begin
, end
and -Inf
, +Inf
, respectively. Define them within aes()
.
- The rectangles you add will be black and opaque by default. Set
fill
to "red"
and alpha
to 0.2
to improve this. Define them outside aes()
.
recess <- data.frame(
begin = c("1969-12-01","1973-11-01","1980-01-01","1981-07-01","1990-07-01","2001-03-01"),
end = c("1970-11-01","1975-03-01","1980-07-01","1982-11-01","1991-03-01","2001-11-01"),
stringsAsFactors = F
)
library(lubridate)
recess$begin <- ymd (recess$begin)
recess$end <- ymd (recess$end)
# Basic line plot
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_line()
# Expand the following command with geom_rect() to draw the recess periods
ggplot(economics, aes(x = date, y = unemploy/pop)) +
___(___ = ___,
aes(___ = ___, ___ = ___, ___ = ___, ___ = ___),
___ = FALSE, ___ = "red", ___ = 0.2) +
geom_line()
