1 Introduction

1.1 Exploring ggplot2, part 1

To get a first feel for ggplot2, let’s try to run some basic ggplot2 commands. Together, they build a plot of the mtcars dataset that contains information about 32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.

1.1.1 Instructions

  • Load the ggplot2 package using library().
  • Use str() to explore the structure of the mtcars dataset.
  • Execute the example code below. See if you can understand what ggplot does with the data.

1.2 Exploring ggplot2, part 2

The plot from the previous exercise wasn’t really satisfying. Although cyl (the number of cylinders) is categorical, it is classified as numeric in mtcars. You’ll have to explicitly tell ggplot2 that cyl is a categorical variable.

1.2.1 Instructions

  • Change the ggplot() command by wrapping factor() around cyl.
  • Executer and see if the resulting plot is better this time.

1.3 Exploring ggplot2, part 3

We’ll use several datasets throughout the courses to showcase the concepts discussed in the videos. In the previous exercises, you already got to know mtcars. Let’s dive a little deeper to explore the three main topics in this course: The data, aesthetics, and geom layers.

The mtcars dataset contains information about 32 cars from 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.

You’re encouraged to think about how the examples and concepts we discuss throughout these data viz courses apply to your own data-sets!

1.3.1 Instructions

  • ggplot2 has already been loaded for you. Take a look at the first command. It plots the mpg (miles per galon) against the weight (in thousands of pounds). You don’t have to change anything about this command.
  • In the second call of ggplot() change the color argument in aes() (which stands for aesthetics). The color should be dependent on the displacement of the car engine, found in disp.
  • In the third call of ggplot() change the size argument in aes() (which stands for aesthetics). The size should be dependent on the displacement of the car engine, found in disp.

1.4 Understanding Variables

In the previous exercise you saw that disp can be mapped onto a color gradient or onto a continuous size scale.

Another argument of aes() is the shape of the points. There are a finite number of shapes which ggplot() can automatically assign to the points. However, if you try this command in the console to the right:

It gives an error. What does this mean?

1.4.1 Instructions

Possible Answers

  • shape is not a defined argument.
  • shape only makes sense with categorical data, and disp is continuous.
  • shape only makes sense with continuous data, and disp is categorical.
  • shape is not a variable in your dataset.
  • shape has to be defined as a function.

1.5 Exploring ggplot2, part 4

The diamonds data frame contains information on the prices and various metrics of 50,000 diamonds. Among the variables included are carat (a measurement of the size of the diamond) and price. For the next exercises, you’ll be using a subset of 1,000 diamonds.

Here you’ll use two common geom layer functions: geom_point() and geom_smooth(). We already saw in the earlier exercises how these are added using the + operator.

1.5.1 Instructions

  • Explore the diamonds data frame with the str() function.
  • Use the + operator to add geom_point() to the first ggplot() command. This will tell ggplot2 to draw points on the plot.
  • Use the + operator to add geom_point() and geom_smooth(). These just stack on each other! geom_smooth() will draw a smoothed line over the points.

1.6 Exploring ggplot2, part 5

The code for last plot of the previous exercise is available in the script on the right. It builds a scatter plot of the diamonds dataset, with carat on the x-axis and price on the y-axis. geom_smooth() is used to add a smooth line.

With this plot as a starting point, let’s explore some more possibilities of combining geoms.

1.6.1 Instructions

  • Plot 2 - Copy and paste plot 1, but show only the smooth line, no points.
  • Plot 3 - Show only the smooth line, but color according to clarity by placing the argument color = clarity in the aes() function of your ggplot() call.
  • Plot 4 - Draw translucent colored points.
    • Copy the ggplot() command from plot 3 (with clarity mapped to color).
    • Remove the smooth layer.
    • Add the points layer back in.
    • Set alpha = 0.4 inside geom_point(). This will make the points 40% transparent.

1.7 Understanding the grammar, part 1

Here you’ll explore some of the different grammatical elements. Throughout this course, you’ll discover how they can be combined in all sorts of ways to develop unique plots.

In the following instructions, you’ll start by creating a ggplot object from the diamonds dataset. Next, you’ll add layers onto this object to build beautiful & informative plots.

1.7.1 Instructions

  • Define the data (diamonds) and aesthetics layers. Map carat on the x axis and price on the y axis. Assign it to an object: dia_plot.
  • Using +, add a geom_point() layer (with no arguments), to the dia_plot object. This can be in a single or multiple lines.
  • Note that you can also call aes() within the geom_point() function. Map clarity to the color argument in this way.

1.8 Understanding the grammar, part 2

Continuing with the previous exercise, here you’ll explore mixing arguments and aesthetics in a single geometry.

You’re still working on the diamonds dataset.

1.8.1 Instructions

  • The dia_plot object has been created for you.
  • Update dia_plot so that it contains all the functions to make a scatter plot by using geom_point() for the geom layer. Set alpha = 0.2.
  • Using +, plot the dia_plot object with a geom_smooth() layer on top. You don’t want any error shading, which can be achieved by setting the se = FALSE in geom_smooth().
  • Modify the geom_smooth() function from the previous instruction so that it contains aes() and map clarity to the col argument.

2 Data

2.1 base package and ggplot2, part 1 - plot

These courses are about understanding data visualization in the context of the grammar of graphics. To gain a better appreciation of ggplot2 and to understand how it operates differently from base package, it’s useful to make some comparisons.

First, let’s focus on base package. You want to make a plot of mpg (miles per gallon) against wt (weight in thousands of pounds) in the mtcars data frame, but this time you want the dots colored according to the number of cylinders, cyl. How would you do that in base package? You can use a little trick to color the dots by specifying a factor variable as a color. This works because factors are just a special class of the integer type.

2.1.1 Instructions

  • Using the base package plot(), make a scatter plot with mtcars$wt on the x-axis and mtcars$mpg on the y-axis, colored according to mtcars$cyl (use the col argument). You can specify data = but you’ll just do it the long way here.
  • Add a new column, fcyl, to the mtcars data frame. This should be cyl converted to a factor.
  • Create a similar plot to instruction 1, but this time, use fcyl (which is cyl as a factor) to set the col.

2.2 base package and ggplot2, part 2 - lm

If you want to add a linear model to your plot, shown right, you can define it with lm() and then plot the resulting linear model with abline(). However, if you want a model for each subgroup, according to cylinders, then you have a couple of options.

You can subset your data, and then calculate the lm() and plot each subset separately. Alternatively, you can vectorize over the cyl variable using lapply() and combine this all in one step. This option is already prepared for you.

The code below contains a call to the function lapply(), which you might not have seen before. This function takes as input a vector and a function. Then lapply() applies the function it was given to each element of the vector and returns the results in a list. In this case, lapply() takes each element of mtcars$cyl and calls the function defined in the second argument. This function takes a value of mtcars$cyl and then subsets the data so that only rows with cyl == x are used. Then it fits a linear model to the filtered dataset and uses that model to add a line to the plot with the abline() function.

Now that you have an interesting plot, there is a very important aspect missing - the legend!

In base package you have to take care of this using the legend() function. This has been done for you in the predefined code.

2.2.1 Instructions

  • Fill in the lm() function to calculate a linear model of mpg described by wt and save it as an object called carModel.
  • Draw the linear model on the scatterplot.
    • Write code that calls abline() with carModel as the first argument. Set the line type by passing the argument lty = 2.
    • Run the code that generates the basic plot and the call to abline() all at once by highlighting both parts of the script and hitting control/command + enter on your keyboard. These lines must all be run together in the console so that R will be able to find the plot you want to add a line to.
  • Run the code already given to generate the plot with a different model for each group. You don’t need to modify any of this.

2.3 base package and ggplot2, part 3

In this exercise you’ll recreate the base package plot in ggplot2.

The code for base R plotting is given at the top. The first line of code already converts the cyl variable of mtcars to a factor.

2.3.1 Instructions

  • Plot 1: add geom_point() in order to make a scatter plot.
  • Plot 2: copy and paste Plot 1
  • Add a linear model for each subset according to cyl by adding a geom_smooth() layer
  • Inside this geom_smooth(), set method to "lm" and se to FALSE. Note: geom_smooth() will automatically draw a line per cyl subset. It recognizes the groups you want to identify by color in the aes() call within the ggplot() command.
  • Plot 3: copy and paste Plot 2
  • Plot a linear model for the entire dataset, do this by adding another geom_smooth() layer
  • Set the group aesthetic inside this geom_smooth() layer to 1. This has to be set within the aes() function.
  • Set method to "lm", se to FALSE and linetype to 2. These have to be set outside aes() of the geom_smooth().

Note: the group aesthetic will tell ggplot() to draw a single linear model through all the points.

2.4 ggplot2 compared to base package

ggplot2 has become very popular and for many people it’s the go-to plotting package in R. What does ggplot2 do that base package doesn’t?

2.4.1 Answer the question

Possible Answers

  1. ggplot2 creates plotting objects, which can be manipulated.
  2. ggplot2 takes care of a lot of the leg work for you, such as choosing nice color pallettes and making legends.
  3. ggplot2 is built upon the grammar of graphics plotting philosophy, making it more flexible and intuitive for understanding the relationship between your visuals and your data.
  4. Options 1, 2, and 3.
  5. ggplot2 is effectively a replacement for all base-package plotting functions.

2.5 Plotting the ggplot2 way

There are different ggplot2 calls to plot two groups of data onto the same plot:

Which one is preferable? Both iris and iris.wide are available in the workspace, so you can experiment in the R Console straight away!

2.5.1 Instructions

Possible Answers

  • Option 1.
  • Option 2.
  • Both are equally preferable.

2.6 Variables to visuals, part 1

So far you’ve seen four different forms of the iris dataset: iris, iris.wide, iris.wide2 and iris.tidy. Don’t let all these different forms confuse you! It’s exactly the same data, just rearranged so that your plotting functions become easier.

To see this in action, consider the plot in the graphics device at right. Which form of the dataset would be the most appropriate to use here?

2.6.1 Instructions

  • Look at the structures of iris, iris.wide and iris.tidy using str().
  • Fill in the ggplot function with the appropriate data frame and variable names. The variable names of the aesthetics of the plot will match the ones you found using the str() command in the previous step.

2.7 Variables to visuals, part 1b

In the last exercise you saw how iris.tidy was used to make a specific plot. It’s important to know how to rearrange your data in this way so that your plotting functions become easier. In this exercise you’ll use functions from the tidyr package to convert iris to iris.tidy.

The resulting iris.tidy data should look as follows:

   Species  Part Measure Value
 1  setosa Sepal  Length   5.1
 2  setosa Sepal  Length   4.9
 3  setosa Sepal  Length   4.7
 4  setosa Sepal  Length   4.6
 5  setosa Sepal  Length   5.0
 6  setosa Sepal  Length   5.4
    ...

You can have a look at the iris dataset by typing head(iris) in the console.

Note: If you’re not familiar with %>%, gather() and separate(), you may want to take the Cleaning Data in R course. In a nutshell, a dataset is called tidy when every row is an observation and every column is a variable. The gather() function moves information from the columns to the rows. It takes multiple columns and gathers them into a single column by adding rows. The separate() function splits one column into two or more columns according to a pattern you define. Lastly, the %>% (or “pipe”) operator passes the result of the left-hand side as the first argument of the function on the right-hand side.

2.7.1 Instructions

You’ll use two functions from the tidyr package:

  • gather() rearranges the data frame by specifying the columns that are categorical variables with a - notation. Complete the command. Notice that only one variable is categorical in iris.
  • separate() splits up the new key column, which contains the former headers, according to .. The new column names “Part” and “Measure” are given in a character vector. Don’t forget the quotes.

2.8 Variables to visuals, part 2

Here you’ll take a look at another plot variant. Which of your data frames would be used to produce this plot?

2.8.1 Instructions

  • Look at the heads of iris, iris.wide and iris.tidy using head().
  • Fill in the ggplot function with the appropriate data frame and variable names. The names of the aesthetics of the plot will match with variable names in your dataset. The previous instruction will help you match variable names in datasets with the ones in the plot.

2.9 Variables to visuals, part 2b

You saw previously how you can derive iris.tidy from iris. Now you’ll move on to produce iris.wide.

The head of the iris.wide should look like this in the end:

  Species  Part Length Width
1  setosa Petal    1.4   0.2
2  setosa Petal    1.4   0.2
3  setosa Petal    1.3   0.2
4  setosa Petal    1.5   0.2
5  setosa Petal    1.4   0.2
6  setosa Petal    1.7   0.4
...

You can have a look at the iris dataset by typing head(iris) in the console.

2.9.1 Instructions

  • Before you begin, you need to add a new column called Flower that contains a unique identifier for each row in the data frame. This is because you’ll rearrange the data frame afterwards and you need to keep track of which row, or which specific flower, each value came from. It’s done for you, no need to add anything yourself.
  • gather() rearranges the data frame by specifying the columns that are categorical variables with a - notation. In this case, Species and Flower are categorical. Complete the command.
  • separate() splits up the new key column, which contains the former headers, according to .. The new column names "Part" and "Measure" are given in a character vector.
  • The last step is to use spread() to distribute the new Measure column and associated value column into two columns.

3 Aesthetics

3.1 All about aesthetics, part 1

There are 9 different aesthetics that can be mapped:

  • x: the X-axis position
  • y: the Y-axis position
  • color: the color of dots and the outline of other shapes
  • fill: fill color
  • size: diameter of points, thickness of lines
  • alpha: the transparency, 0-transparent, 1-opaque
  • linetype: line dash pattern
  • labels: text on a plot or axes
  • shape: the shape

Let’s apply them to a categorical variable - the cylinders in mtcars, cyl.

(You’ll consider line type when you encounter line plots in the next chapter).

These are the aesthetics you can consider within aes() in this chapter: x, y, color, fill, size, alpha, labels and shape.

In the following exercise you can assume that the cyl column is categorical. It has already been transformed into a factor for you.

3.1.1 Instructions

The mtcars data frame is available in your workspace. For each of the following four plots, use geom_point():

  • Map mpg onto the x aesthetic, and cyl onto the y.
  • Reverse the mappings of the first plot.
  • Map wt onto x,,mpgontoy, andcylontocolor`.
  • Modify the previous plot by changing the shape argument of the geom to 1 and increase the size to 4. These are attributes that you should specify inside geom_point().

3.2 All about aesthetics, part 2

The color aesthetic typically changes the outside outline of an object and the fill aesthetic is typically the inside shading. However, as you saw in the last exercise, geom_point() is an exception. Here you use color, instead of fill for the inside of the point. But it’s a bit subtler than that.

Which shape to use? The default geom_point() uses shape = 19 (a solid circle with an outline the same colour as the inside). Good alternatives are shape = 1 (hollow) and shape = 16 (solid, no outline). These all use the col aesthetic (don’t forget to set alpha for solid points).

A really nice alternative is shape = 21 which allows you to use both fill for the inside and col for the outline! This is a great little trick for when you want to map two aesthetics to a dot.

What happens when you use the wrong aesthetic mapping? This is a very common mistake! The code from the previous exercise is in the editor. Using this as your starting point complete the instructions.

3.2.1 Instructions

Note: In the mtcars dataset, cyl and am have been converted to factor for you.

  • Copy & paste the first plot’s code. Change the aesthetics so that cyl maps to fill rather than col.
  • Copy & paste the second plot’s code. In geom_point() change the shape argument to 21 and add an alpha argument set to 0.6.
  • Copy & paste the third plot’s code. In the ggplot() aesthetics, map am to col.

3.3 All about aesthetics, part 3

Now that you’ve got some practice with incrementally building up plots, you can try to do it from scratch! The mtcars dataset is pre-loaded in the workspace.

3.3.1 Instructions

Use ggplot() to create a basic scatter plot. Inside aes(), map wt onto x and mpg onto y. Typically, you would say “mpg described by wt” or “mpg vs wt”, but in aes(), it’s x first, y second. Use geom_point() to make three scatter plots:

  • cyl on size
  • cyl on alpha
  • cyl on shape

Try this last variant:

  • cyl on label. In order to correctly show the test (i.e. label), use geom_text().

3.4 All about attributes, part 1

You can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x, y, color, fill, size, alpha, label and shape.

This time you’ll use these arguments to set attributes of the plot, not aesthetics. However, there are some pitfalls you’ll have to watch out for: these attributes can overwrite the aesthetics of your plot!

A word about shapes: In the exercise “All about aesthetics, part 2”, you saw that shape = 21 results in a point that has a fill and an outline. Shapes in R can have a value from 1-25. Shapes 1-20 can only accept a color aesthetic, but shapes 21-25 have both a color and a fill aesthetic. See the pch argument in par() for further discussion.

A word about hexadecimal colours: Hexadecimal, literally “related to 16”, is a base-16 alphanumeric counting system. Individual values come from the ranges 0-9 and A-F. This means there are 256 possible two-digit values (i.e. 00 - FF). Hexadecimal colours use this system to specify a six-digit code for Red, Green and Blue values (“#RRGGBB”) of a colour (i.e. Pure blue: “#0000FF”, black: “#000000”, white: “#FFFFFF”). R can accept hex codes as valid colours.

3.4.1 Instructions

  • You will continue to work with mtcars. Use ggplot() to create a basic scatter plot: map wt onto x, mpg onto y and cyl onto color.
  • Overwrite the color of the points inside geom_point() to my_color. Notice how this cancels out the colors given to the points by the number of cylinders!
  • Starting with plot 2, map cyl to fill instead of col and set the attributes size to 10, shape to 23 and color to my_color inside geom_point().

3.5 All about attributes, part 2

You can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x, y, color, fill, size, alpha, label and shape.

In this exercise you will set all kinds of attributes of the points!

You will continue to work with mtcars.

3.5.1 Instructions

  • Add to the first command: draw points with alpha set to 0.5.
  • Add to the second command: draw points of shape 24 in the color yellow.
  • Add to the third command: draw text with label rownames(mtcars) in the color red. Don’t use geom_point() here! You should get a scatter plot with the names of the cars instead of points.

Note: Remember to specify characters with quotation marks ("yellow", not yellow).

3.6 Going all out

In this exercise, you will gradually add more aesthetics layers to the plot. You’re still working with the mtcars dataset, but this time you’re using more features of the cars. For completeness, here is a list of all the features of the observations in mtcars:

  • mpg – Miles/(US) gallon
  • cyl – Number of cylinders
  • disp – Displacement (cu.in.)
  • hp – Gross horsepower
  • drat – Rear axle ratio
  • wt – Weight (lb/1000)
  • qsec – 1/4 mile time
  • vs – V/S engine.
  • am – Transmission (0 = automatic, 1 = manual)
  • gear – Number of forward gears
  • carb – Number of carburetors

Notice that adding more aesthetics to your plot is not always a good idea. Adding aesthetic mappings to a plot will increase its complexity, and thus decrease its readability.

3.6.1 Instructions

Note: In this chapter you saw aesthetics and attributes. Variables in a data frame are mapped to aesthetics in aes(). (e.g. aes(col = cyl)) within ggplot(). Visual elements are set by attributes in specific geom layers (geom_point(col = "red")). Don’t confuse these two things - here you’re focusing on aesthetic mappings.

  • Draw a scatter plot of mtcars with mpg on the x-axis, qsec on the y-axis and factor(cyl) as colors.
  • Expand the previous plot to include factor(am) as the shape of the points.
  • Expand the previous plot to include the ratio of horsepower to weight (i.e. (hp/wt)) as the size of the points.

3.7 Aesthetics for categorical and continuous variables

Many of the aesthetics can be mapped onto continuous or categorical variables, but some are restricted to categorical data. Which aesthetics are they?

3.7.1 Instructions

Possible Answers

  • color & fill
  • alpha & size
  • label & shape
  • alpha & label
  • x & y

3.8 Position

Bar plots suffer from their own issues of overplotting, as you’ll see here. Use the "stack", "fill" and "dodge" positions to reproduce the plot in the viewer.

The ggplot2 base layers (data and aesthetics) have already been coded; they’re stored in a variable cyl.am. It looks like this:

3.8.1 Instructions

  • Add a geom_bar() call to cyl.am. By default, the position will be set to "stack".
  • Fill in the second ggplot command. Explicitly set position to "fill" inside geom_bar().
  • Fill in the third ggplot command. Set position to "dodge".
  • The position = "dodge" version seems most appropriate. Finish off the fourth ggplot command by completing the three scale_ functions:
    • scale_x_discrete() takes as its only argument the x-axis label: "Cylinders".
    • scale_y_continuous() takes as its only argument the y-axis label: "Number".
    • scale_fill_manual() fixes the legend. The first argument is the title of the legend: "Transmission". Next, values and labels are set to predefined values for you. These are the colors and the labels in the legend.

3.9 Setting a dummy aesthetic

In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very conveniently left out x and y. That’s because although you can make univariate plots (such as histograms, which you’ll get to in the next chapter), a y-axis will always be provided, even if you didn’t ask for it.

In the base package you can make univariate plots with stripchart() directly and it will take care of a fake y axis for us. Since this is univariate data, there is no real y axis.

You can get the same thing in ggplot2, but it’s a bit more cumbersome. The only reason you’d really want to do this is if you were making many plots and you wanted them to be in the same style, or you wanted to take advantage of an aesthetic mapping (e.g. colour).

3.9.1 Instructions

  • Try to run ggplot(mtcars, aes(x = mpg)) + geom_point() in the console. x is only one of the two essential aesthetics for geom_point(), which is why you get an error message.
    • 1 - To fix this, map a value, e.g. 0, instead of a variable, onto y. Use geom_jitter() to avoid having all the points on a horizontal line.
    • 2 - To make everything look nicer, copy & paste the code for plot 1 and change the limits of the y axis using the appropriate scale_y_...() function. Set the limits argument to c(-2, 2).

3.10 Overplotting 1 - Point shape and transparency

In the previous section you saw that there are lots of ways to use aesthetics. Perhaps too many, because although they are possible, they are not all recommended. Let’s take a look at what works and what doesn’t.

So far you’ve focused on scatter plots since they are intuitive, easily understood and very common. A major consideration in any scatter plot is dealing with overplotting. You’ll encounter this topic again in the geometries layer, but you can already make some adjustments here.

You’ll have to deal with overplotting when you have:

  • Large datasets,
  • Imprecise data and so points are not clearly separated on your plot (you saw this in the video with the iris dataset),
  • Interval data (i.e. data appears at fixed values), or
  • Aligned data values on a single axis.

One very common technique that I’d recommend to always use when you have solid shapes it to use alpha blending (i.e. adding transparency). An alternative is to use hollow shapes. These are adjustments to make before even worrying about positioning. This addresses the first point as above, which you’ll see again in the next exercise.

3.10.1 Instructions

  • Begin by making a basic scatter plot of mpg (y) vs. wt (x), map cyl to color and make the size = 4. cyl has already been converted to a factor variable for you.
  • Modify the above plot to set shape to 1. This allows for hollow circles.
  • Modify the first plot to set alpha to 0.6.

3.11 Overplotting 2 - alpha with large datasets

In a previous exercise we defined four situations in which you’d have to adjust for overplotting. You’ll consider the last two here with the diamonds dataset:

  • Large datasets.
  • Aligned data values on a single axis

3.11.1 Instructions

  • The diamonds data frame is available in the ggplot2() package. Begin by making a basic scatter plot of price (y) vs. carat (x) and map clarity onto color.
  • Copy the above functions and set the alpha to 0.5. This is a good start to dealing with the large dataset.
  • Align all the diamonds within a clarity class, by plotting carat (y) vs. clarity (x). Map price onto color. alpha should still be 0.5.
  • In the previous plot, all the individual values line up on a single axis within each clarity category, so you have not overcome overplotting. Modify the above plot to use the position = "jitter" inside geom_point().

4 Geometries

4.1 Scatter plots and jittering (1)

You already saw a few examples using geom_point() where the result was not a scatter plot. For example, in the plot given below, a continuous variable, wt, is mapped to the y aesthetic, and a categorical variable, cyl, is mapped to the x aesthetic. This also leads to over-plotting, since the points are arranged on a single x position. You previously dealt with overplotting by setting the position = jitter inside geom_point(). Let’s look at some other solutions here.

4.1.1 Instructions

Beginning with the code for the plot in the viewer (given), make these modifications

  • Use a shortcut geom, geom_jitter(), instead of geom_point().
  • Unfortunately, the width of the jitter is a bit too wide to be useful. Adjust this by setting the argument width = 0.1 inside geom_jitter().
  • Finally, return to geom_point() and set the position argument here to position_jitter(0.1), which will set the jittering width directly inside a points layer.

Note: For convenience, you could have saved the data and aesthetic layers as a ggplot2 object and re-used it in all solutions. We’ve made each plot explicit so that you can see all plotting instructions.

4.2 Scatter plots and jittering (2)

In the chapter on aesthetics you saw different ways in which you will have to compensate for overplotting.

Another issue is when you have interval data. This can be continuous data measured on an interval (i.e. 1 ,2, 3 …), as opposed to numeric (i.e. 1.1, 1.4, 1.5, …), scale, or two categorical (e.g. factor) variables, which are just type interval under-the-hood.

In such a case you’ll have a small, defined number of intersections between the two variables.

You will be using the Vocab dataset. The Vocab dataset contains information about the years of education and integer score on a vocabulary test for over 21,000 individuals based on US General Social Surveys from 1972-2004.

4.2.1 Instructions

  • The Vocab data frame has been loaded for you. Both the education and vocabulary variables are classified as integers. You can imagine these as factor variables, but here, integers are more convenient to work with. First, get familiar with the dataset by looking at its structure with str().
  • Make a basic scatter plot of vocabulary (y) vs. education (x). Here it becomes apparent that you have issues with overplotting because of the integer scales.
  • Use geom_jitter() instead of geom_point().
  • Using the jittered plot, set alpha to 0.2 (very low).
  • Using the jittered plot, set shape to 1.

4.3 Histograms

Histograms are one of the most common and intuitive ways of showing distributions. In this exercise you’ll use the mtcars data frame to explore typical variations of simple histograms. But first, some background:

The x axis/aesthetic: The documentation for geom_histogram() states the argument stat = "bin" as a default. Recall that histograms cut up a continuous variable into discrete bins - thats what the stat “bin” is doing. You always get 30 evenly-sized bins by default, which is specified with the default argument binwidth = range/30. This is a pretty good starting point if you don’t know anything about the variable being ploted and want to start exploring.

The y axis/aesthetic: geom_histogram() only requires one aesthetic: x. But there is clearly a y axis on your plot, so where does it come from? Actually, there is a variable mapped to the y aesthetic, it’s called ..count... When geom_histogram() executed the binning statistic (see above), it not only cut up the data into discrete bins, but it also counted how many values are in each bin. So there is an internal data frame where this information is stored. The .. calls the variable count from this internal data frame. This is what appears on the y aesthetic. But it gets better! The density has also been calculated. This is the proportional frequency of this bin in relation to the whole data set. You use ..density.. to access this information.

4.3.1 Instructions

  • Use the mtcars data frame and make a univariate histogram by mapping mpg onto the x aesthetic. Use geom_histogram() for the geom layer.
  • Take plot 1 and manually create 1-unit wide bins with the binwidth = 1 argument in geom_histogram().
  • Take plot 2, and map ..density.. onto the y aesthetic (i.e. inside an aes()) inside geom_histogram(). You’ll have two aes() functions: one inside ggplot() and another inside geom_histogram(). (See the intro text for a discussion of ..density..).
  • Take plot 3 and set the attribute fill, the inside of the bars, to the value "#377EB8" in geom_histogram(). This should not appear in aes(), since it’s an attribute, not an aesthetic mapping.

4.4 Position

In the previous chapter you saw that there are lots of ways to position scatter plots. Likewise, the geom_bar() and geom_histogram() geoms also have a position argument, which you can use to specify how to draw the bars of the plot.

Three position arguments will be introduced here:

  • stack: place the bars on top of each other. Counts are used. This is the default position.
  • fill: place the bars on top of each other, but this time use proportions.
  • dodge: place the bars next to each other. Counts are used.

In this exercise you’ll draw the total count of cars having a given number of cylinders (cyl), according to manual or automatic transmission type (am).

Since, in the built-in mtcars data set, cyl and am are integers, you have to convert them into factor variables.

4.4.1 Instructions

  • Using mtcars, map cyl onto the x aesthetic and am onto fill. Use geom_bar() to make a bar plot.
  • Take plot 1 and explicitly set position = "stack" in geom_bar(). This doesn’t change anything, does it? It was mentioned above that "stack" is the default.
  • Take plot 2 and set position = "fill" in geom_bar().
  • Take plot 3 and set position = "dodge" in geom_bar().

4.5 Overlapping bar plots

So far you’ve seen three different positions for bar plots: stack (the default), dodge (preferred), and fill (to show proportions).

However, you can go one step further by adjusting the dodging, so that your bars partially overlap each other. For this example you’ll again use the mtcars dataset. Like last time you have to convert cyl and am into factors inside mtcars.

Instead of using position = "dodge" you’re going to use position_dodge(), like you did with position_jitter() in the Scatter plots and jittering (1) exercise. Here, you’ll save this as an object, posn_d, so that you can easily reuse it.

Remember, the reason you want to use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) you want.

4.5.1 Instructions

  • The last plot from the last exercise has been provided for you.
  • Define a new object called posn_d by calling position_dodge() with the argument width = 0.2.
  • Take plot 1 and make slightly overlapping bars by using the position = posn_d argument.
  • Take plot 3 and set alpha = 0.6 to see the overlap in bars.

4.6 Overlapping histograms

Overlapping histograms pose similar problems to overlapping bar plots, but there is a unique solution here: a frequency polygon.

This is a geom specific to binned data that draws a line connecting the value of each bin. Like geom_histogram(), it takes a binwidth argument and by default stat = "bin" and position = "identity".

4.6.1 Instructions

  • The code for a basic histogram of mpg, which you’ve already seen, is provided. Extend the code to map cyl onto fill inside aes().
  • The default position for histograms is "stack". Copy your solution to the first exercise and set the position for the histogram bars to "identity".
  • Using the same data and base layers as in the previous two plots, create a plot with a geom_freqpoly(). Because you’re no longer working with bars, change the aes() function: cyl should be mapped onto col, not onto fill. This will correctly color the geom.

4.7 Bar plots with color ramp, part 1

In this example of a bar plot, you’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color series.

You’ll be using the Vocab dataset from earlier. Since this is a much larger dataset with more categories, you’ll also compare it to a simpler dataset, mtcars. Both datasets are ordinal.

4.7.1 Instructions

  • The bar plot from the previous exercise is provided - cyl is on the x-axis and filled according to transmission type, am. Notice how you can set the color palette used to fill the bars with scale_fill_brewer(). For a full list of possible color sets, have a look at ?brewer.pal.
  • Explore Vocab with str(). Notice that the education and vocabulary variables have already been converted to factor variables for you.
  • Make a filled bar chart with the Vocab dataset.
    • Map education to x and vocabulary to fill.
    • Inside geom_bar(), make sure to set position = "fill".
    • Allow color brewer to choose a default color palette by using the appropriate scale function, without arguments. Notice how this generates a warning message and an incomplete plot.

4.8 Bar plots with color ramp, part 2

In the previous exercise, you ended up with an incomplete bar plot. This was because for continuous data, the default RColorBrewer palette that scale_fill_brewer() calls is "Blues". There are only 9 colours in the palette, and since you have 11 categories, your plot looked strange.

In this exercise, you’ll manually create a color palette that can generate all the colours you need. To do this you’ll use a function called colorRampPalette().

The input is a character vector of 2 or more colour values, e.g. "#FFFFFF" (white) and "#0000FF" (pure blue). (See All about attributes, part 1 for a discussion on hexadecimal codes).

The output is itself a function! So when you assign it to an object, that object should be used as a function. To see what we mean, execute the following three lines:

new_col() is a function that takes one argument: the number of colours you want to extrapolate. You want to use nicer colours, so we’ve assigned the entire "Blues" colour palette from the RColorBrewer package to the character vector blues.

4.8.1 Instructions

  • Like in the example code above, create a new function called blue_range that uses colorRampPalette() to extrapolate over all 9 values of the blues character vector.
  • Take the plot code from the last exercise (provided), and change scale_fill_brewer() to be scale_fill_manual(). Set the argument values = blue_range(11) inside scale_fill_manual().

4.9 Overlapping histograms (2)

As a last example of bar plots, you’ll return to histograms (which you now see are just a special type of bar plot). You saw a nice trick in a previous exercise of how to slightly overlap bars, but now you’ll see how to overlap them completely. This would be nice for multiple histograms, as long as there are not too many different overlaps!

You’ll make a histogram using the mpg variable in the mtcars data frame.

4.9.1 Instructions

  • A basic histogram plot is provided.
  • Take plot 1 and map am onto fill within the aes() function. The default position is "stack".
  • Take plot 2 and add the position argument within geom_histogram(). Set it to "dodge".
  • Take plot 3 and change the position argument to "fill". In this case, none of these positions really work well, because it’s difficult to compare the distributions directly.
  • Take plot 4 and change the position argument to "identity" and set alpha = 0.4. This produces overlapping bars.
  • Take plot 5 and change the aesthetic mapping. Map cyl onto fill.

4.10 Line plots

In the video you saw how to make line plots using time series data. To explore this topic, you’ll use the economics data frame, which contains time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the US. The data is contained in the ggplot2 package.

To begin with, you can look at how the median unemployment time and the unemployment rate (the number of unemployed people as a proportion of the population) change over time.

In the next exercises, you’ll explore to how add embellishments to the line plots, such as recession periods.

4.10.1 Instructions

  • Print out the head() of the economics data frame.
  • Use the economics data frame to plot date on the x axis and unemploy on the y axis. Use geom_line().
  • Copy, paste and adjust the code for the previous instruction: instead of unemploy, plot unemploy/pop to represent the fraction of the total population that is unemployed.

4.11 Periods of recession

By themselves, time series often contain enough valuable information, but you always want to maximize the number of variables you can show in a plot. This allows you (and your viewers) to begin making comparisons between those variables that would otherwise be difficult or impossible.

Here, you’ll add shaded regions to the background to indicate recession periods. How do unemployment rate and recession period interact with each other?

In addition to the economics dataset from before, you’ll also use the recess dataset for the periods of recession. The recess data frame contains 2 variables: the begin period of the recession and the end. It’s already available in your workspace.

4.11.1 Instructions

Expand the command from the previous exercise with geom_rect(). You will use this geom layer to draw rectangles across the recession periods. There a few pitfalls here:

  • geom_rect() uses the recess dataset, so pass this directly as data = recess inside geom_rect().
  • The geom_rect() command shouldn’t inherit aesthetics from the base ggplot() command it belongs to. It would result in an error, since you’re using a different dataset and it doesn’t contain unemploy or pop. That’s why you should specify inherit.aes = FALSE in geom_rect().
  • geom_rect() needs four aesthetics: xmin, xmax, ymin and ymax. These should be set to begin, end and -Inf, +Inf, respectively. Define them within aes().
  • The rectangles you add will be black and opaque by default. Set fill to "red" and alpha to 0.2 to improve this. Define them outside aes().
