--- title: "Data Visualization in R" output: html_notebook: highlight: haddock number_sections: yes theme: cerulean toc: yes toc_depth: 2 html_document: toc: yes toc_depth: 2 --- ```{r setup, echo = FALSE} knitr::opts_chunk$set(eval = FALSE) ``` # A Quick Introduction to Base R Graphics ## Creating an exploratory plot array In this exercise, you'll construct a simple exploratory plot from a data frame that gives values for three variables, recorded over two winter heating seasons. The variables are: - `Temp`: a measure of the outside temperature during one week - `Gas`: the amount of heating gas consumed during that week - `Insul`: a categorical variable with two values, indicating whether the measurements were made before or after an insulation upgrade was made to the house ### Instructions - Load the `MASS` package to make the `whiteside` data frame available. - Apply the `plot()` function to the `whiteside` data frame. ```{r} # Load MASS package # Plot whiteside data ``` ## Creating an explanatory scatterplot In constrast to the exploratory analysis plot you created in the previous exercise, this exercise asks you to create a simple explanatory scatterplot, suitable for sharing with others. Here, it is important to make editorial choices in constructing this plot, rather than depending entirely on default options. In particular, the important editorial aspects of this plot are: first, the variables to be plotted, and second, the axis labels, which are specified as strings to the `xlab` and `ylab` arguments to the `plot()` function. ### Instructions - Use the `plot()` function to construct a scatterplot of the heating gas consumption, `Gas`, versus the outside temperature, `Temp`, from the `whiteside` data frame. Label the x- and y-axes to indicate the variables in the plot (i.e. `"Outside temperature"` and `"Heating gas consumption"`, respectively.) ```{r} # Plot Gas vs. Temp ``` ## The plot() function is generic One of the key features of the `plot()` function is that it is generic, meaning that the results of applying the function depend on the nature of the object to which it is applied. In the first exercise of this chapter, applying the `plot()` function to the whiteside data frame resulted in a plot array. Here, we obtain a fundamentally different kind of result when we apply the same function to `Insul`, a factor variable from the same data frame. ### Instructions - Apply the `plot()` function to the `Insul` variable from the `whiteside` data frame. ```{r} # Apply the plot() function to Insul ``` ## Adding details to a plot using point shapes, color, and reference lines Adding additional details to your explanatory plots can help emphasize certain aspects of your data. For example, by specifying the `pch` and `col` arguments to the `plot()` function, you can add different point shapes and colors to show how different variables or subsets of your data relate to each other. In addition, you can add a new set of points to your existing scatterplot with the `points()` function, and add reference lines with the `abline()` function. This exercise asks you to use these methods to create an enhanced scatterplot that effectively shows how three variables in the `Cars93` data frame relate to each other. These variables are: - `Price`: the average sale price for a car - `Max.Price`: the highest recorded price for that car - `Min.Price`: the lowest recorded price for that car ### Instructions - Load the `MASS` package to make the `Cars93` data frame available. - Use the `plot()` function to create a scatterplot of the `Max.Price` variable versus the `Price` variable, specifying the `pch` and `col` parameters so the data points are represented as red solid triangles. The `pch` value to plot solid triangles is 17. - Use the `points()` function to add a second set of points to your scatterplot, representing `Min.Price` versus `Price`, where the new data points are represented as blue solid circles. The `pch` value for solid circles is 16. - Use the `abline()` function to add a dashed equality reference line (i.e., a line with y-intercept 0 and slope 1). See `abline()` to learn what arguments a and b refer to. ```{r} # Load the MASS package # Plot Max.Price vs. Price as red triangles # Add Min.Price vs. Price as blue circles # Add an equality reference line with abline() abline(a = ___, b = ___, lty = 2) ``` ## Creating multiple plot arrays You can plot multiple graphs on a single pane using the `par()` function with its `mfrow` parameter. For example, `par(mfrow = c(1, 2))` creates a plot array with 1 row and 2 columns, allowing you to view two graphs side-by-side. This way, you can compare and contrast different datasets or different views of the same dataset. This exercise asks you to compare two views of the `Animals2` dataset from the `robustbase` package, differing in how its variables are represented. The objective of this exercise is to emphasize that the original representation of the variables that we have in a dataset is not always the best one for visualization or analysis. By representing the original variables in log scale, for example, we can better see and understand the data. ### Instructions - Load the `robustbase` package to make the `Animals2` data frame available. - Use the `par()` function and set the `mfrow` parameter to create a side-by-side plot array with 1 row and 2 columns. - Use the `plot()` function to create a scatterplot of the variables `brain` versus `body` from this data frame, without specifying additional arguments. - See what happens when you run `title("Original representation")` after your first call to `plot()`. - Use the `plot()` function again, but this time with `log = "xy"`, to generate a plot of both the `x` and `y` variables in log scale. - Use the `title()` function to add `"Log-log plot"` as the title to the second plot. ```{r} # Load the robustbase package # Set up the side-by-side plot array # First plot: brain vs. body in its original form # Add the first title # Second plot: log-log plot of brain vs. body # Add the second title ``` ## Avoid pie charts The same dataset can be displayed or summarized in many different ways, but some are much more suitable than others. Despite their general popularity, pie charts are often a poor choice. Though R allows pie charts with the `pie()` function, even the help file for this function argues against their use. Specifically, the help file includes a "Note" that begins with the words: "Pie charts are a very bad way of displaying information." Bar charts are a recommended alternative and, in this exercise, you'll see why. ### Instructions - Load the `insuranceData` package and use the `data()` function to load the `dataCar` data frame from this package. - Use the `par()` function and set the `mfrow` parameter to create a side-by-side plot array with 1 row and 2 columns. - Use the `table()` function and the `sort()` function to create a table of counts of the distinct levels of the `veh_body` variable in the `dataCar` data frame, in decreasing order. Call this table `tbl`. - Pass `tbl` to the `pie()` function to create a pie chart representation of this data as the left-hand plot. Use `title()` to title this plot `"Pie chart"`. - Similarly, use the `barplot()` and `title()` functions to create a barplot representation of the same data, specifying `las = 2` to make both sets of labels perpendicular to the axes, and using `cex.names = 0.5` to make the name labels half the default size. Title this plot `"Bar chart"`. ```{r} # Load the insuranceData package # Use the data() function to get the dataCar data frame # Set up a side-by-side plot array # Create a table of veh_body record counts and sort tbl <- sort(table(___), decreasing = ___) # Create the pie chart and give it a title # Create the barplot with perpendicular, half-sized labels barplot(___, las = ___, cex.names = ___) # Add a title ``` # Different Plot Types ## The hist() and truehist() functions Histograms are probably the best-known way of looking at how the values of a numerical variable are distributed over their range, and R provides several different histogram implementations. The purpose of this exercise is to introduce two of these: - `hist()` is part of base R and its default option yields a histogram based on the number of times a record falls into each of the bins on which the histogram is based. - `truehist()` is from the `MASS` package and scales these counts to give an estimate of the probability density. ### Instructions - Use the `par()` function to set the `mfrow` parameter for a side-by-side array of two plots. - Use the `hist()` function to generate a histogram of the `Horsepower` variable from the `Cars93` data frame. Set its `main` argument equal to the title of the plot, `"hist() plot"`. - Use the `truehist()` function to generate an alternative histogram of the same variable. Title this plot, `"truehist() plot"` by specifying its `main` argument accordingly. ```{r} # Set up a side-by-side plot array # Create a histogram of counts with hist() # Create a normalized histogram with truehist() ``` ## Density plots as smoothed histograms While they are probably not as well known as the histogram, density estimates may be regarded as smoothed histograms, designed to give a better estimate of the density function for a random variable. In this exercise, you'll use the `ChickWeight` dataset, which contains a collection of chicks' weights. You will first select for the chicks that are 16 weeks old. Then, you'll create a histogram using the `truehist()` function, and add its density plot on top, using the `lines()` and `density()` functions with their default options. The density plot of this type of variable is often expected to conform approximately to the bell-shaped curve, otherwise known as the Gaussian distribution. Let's find out whether that's the case for this dataset. ### Instructions - Create the variable `index16` using the `which()` function that selects records from the `ChickWeight` data frame with `Time` equal to 16. - Create the variable `weights` that gives the weights of the 16-week old chicks. - Use the `truehist()` function to generate a histogram from `weights`. - Use the `lines()` and `density()` functions to overlay a density plot of the `weights` values on the histogram. ```{r} # Create index16, pointing to 16-week chicks # Get the 16-week chick weights # Plot the normalized histogram # Add the density curve to the histogram ``` ## Using the qqPlot() function to see many details in data A practical limitation of both histograms and density estimates is that, if we want to know whether the Gaussian distribution assumption is reasonable for our data, it is difficult to tell. The quantile-quantile plot, or QQ-plot, is a useful alternative: we sort our data, plot it against a specially-designed x-axis based on our reference distribution (e.g., the Gaussian "bell curve"), and look to see whether the points lie approximately on a straight line. In R, several QQ-plot implementations are available, but the most convenient one is the `qqPlot()` function in the `car` package. The first part of this exercise applies this function to the 16-week chick weight data considered in the last exercise, to show that the Gaussian distribution appears to be reasonable here. The second part of the exercise applies this function to another variable where the Gaussian distribution is obviously a poor fit, but the results also show the presence of repeated values (flat stretches in the plot) and portions of the data range where there are no observations (vertical "jumps" in the plot). ### Instructions - Load the `car` package to make the `qqPlot()` function available for use. - Create the variable `index16` using the `which()` function that selects records from the `ChickWeight` data frame with `Time` equal to 16. - Create the variable `weights` that gives the weights of the 16-week old chicks. - Apply the `qqPlot()` function to the `weights` data, noting that almost all of the points fall within the confidence intervals around the reference line, indicating a reasonable conformance with the Gaussian distribution for this data sequence. - Apply the `qqPlot()` function to the `tax` variable from the `Boston` data frame in the `MASS` package. ```{r} # Load the car package to make qqPlot() available # Create index16, pointing to 16-week chicks # Get the 16-week chick weights # Show the normal QQ-plot of the chick weights # Show the normal QQ-plot of the Boston$tax data ``` ## The sunflowerplot() function for repeated numerical data A scatterplot represents each (x, y) pair in a dataset by a single point. If some of these pairs are repeated (i.e. if the same combination of x and y values appears more than once and thus lie on top of each other), we can't see this in a scatterplot. Several approaches have been developed to deal with this problem, including jittering, which adds small random values to each x and y value, so repeated points will appear as clusters of nearby points. A useful alternative that is equally effective in representing repeated data points is the sunflowerplot, which represents each repeated point by a "sunflower," with one "petal" for each repetition of a data point. This exercise asks you to construct both a scatterplot and a sunflowerplot from the same dataset, one that contains repeated data points. Comparing these plots allows you to see how much information can be lost in a standard scatterplot when some data points appear many times. ### Instructions - Use the `par()` function to set the `mfrow` parameter for a side-by-side plot array. - For the left-hand plot, use the `plot()` function to construct a scatterplot of the `rad` variable versus the `zn` variable, both from the `Boston` data frame in the `MASS` package. - Use the `title()` function to add the title `"Standard scatterplot"` to this plot. - For the right-hand plot, apply the `sunflowerplot()` function to the same data to see the presence of repeated data points, not evident from the scatterplot on the left. - Use the `title()` function to add the title `"Sunflower plot"`. ```{r} # Set up a side-by-side plot array # Create the standard scatterplot # Add the title # Create the sunflowerplot # Add the title ``` ## Useful options for the `boxplot()` function The `boxplot()` function shows how the distribution of a numerical variable `y` differs across the unique levels of a second variable, `x`. To be effective, this second variable should not have too many unique levels (e.g., 10 or fewer is good; many more than this makes the plot difficult to interpret). The `boxplot()` function also has a number of optional parameters and this exercise asks you to use three of them to obtain a more informative plot: - `varwidth` allows for variable-width boxplots that show the different sizes of the data subsets. - `log` allows for log-transformed y-values. - `las` allows for more readable axis labels. This exercise also illustrates the use of the formula interface: `y ~ x` indicates that we want a boxplot of the `y` variable across the different levels of the `x` variable. See `?boxplot` for more details. ### Instructions - Using the formula interface, create a boxplot showing the distribution of numerical `crim` values over the different distinct `rad` values from the `Boston` data frame. Use the `varwidth` parameter to obtain variable-width boxplots, specify a log-transformed y-axis, and set the `las` parameter equal to 1 to obtain horizontal labels for both the x- and y-axes. - Use the `title()` function to add the title `"Crime rate vs. radial highway index"`. ```{r} # Create a variable-width boxplot with log y-axis & horizontal labels # Add a title ``` ## Using the mosaicplot() function A mosaic plot may be viewed as a scatterplot between categorical variables and it is supported in R with the `mosaicplot()` function. As this example shows, in addition to categorical variables, this plot can also be useful in understanding the relationship between numerical variables, either integer- or real-valued, that take only a few distinct values. More specifically, this exercise asks you to construct a mosaic plot showing the relationship between the numerical `carb` and `cyl` variables from the `mtcars` data frame, variables that exhibit 6 and 3 unique values, respectively. ### Instructions - Apply the `mosaicplot()` function with the formula interface to see how the levels of the `carb` variable vary with the levels of the `cyl` variable from the `mtcars` data frame. ```{r} # Create a mosaic plot using the formula interface ``` ## Using the bagplot() function A single box plot gives a graphical representation of the range of variation in a numerical variable, based on five numbers: - The minimum and maximum values - The median (or "middle") value - Two intermediate values called the lower and upper quartiles In addition, the standard box plot computes a nominal data range from three of these numbers and flags points falling outside this range as outliers, representing them as distinct points. The bag plot extends this representation to two numerical variables, showing their relationship, both within two-dimensional "bags" corresponding to the "box" in the standard boxplot, and indicating outlying points outside these limits. This exercise asks you to construct, first, side-by-side box plots of the `Min.Price` and `Max.Price` variables from the `mtcars` dataset, and then to use the `bagplot()` function from the `aplpack` package to construct the corresponding bag plot. ### Instructions - Use the `boxplot()` function to construct side-by-side box plots for `Min.Price` and `Max.Price` from the `Cars93` data frame. - Load the `aplpack` package to make the `bagplot()` function available. - Construct the bag plot showing the relationship between the `Min.Price` and `Max.Price` variables from the `Cars93` data frame. Use the `cex` parameter to make the point sizes in this plot 20 percent larger than the default size. - Use the `abline()` function to add a dashed equality reference line with intercept 0 and slope 1. ```{r} # Create a side-by-side boxplot summary # Load aplpack to make the bagplot() function available # Create a bagplot for the same two variables # Add an equality reference line ``` ## Plotting correlation matrices with the corrplot() function Correlation matrices were introduced in the video as a useful tool for obtaining a preliminary view of the relationships between multiple numerical variables. This exercise asks you to use the `corrplot()` function from the `corrplot` package to visualize this correlation matrix for the numerical variables from the `UScereal` data frame in the `MASS` package. Recall that in this version of these plots, ellipses that are thin and elongated indicate a large correlation value between the indicated variables, while ellipses that are nearly circular indicate correlations near zero. ### Instructions - Create a subset of the `UScereal` data frame that contains only the 9 numerical (i.e., non-factor) variables. Save the result as `numericalVars`. - Apply the `cor()` function to this subset to compute the correlation matrix containing the correlation coefficients between all variable pairs. Save the result as `corrMat`. - Apply the `corrplot()` function to display this correlation matrix, using the `"ellipse"` method. - Which two variables are most highly correlated with potassium? ```{r} # Load the corrplot library for the corrplot() function # Extract the numerical variables from UScereal # Compute the correlation matrix for these variables # Generate the correlation ellipse plot ``` ## Building and plotting rpart() models It was noted in the video that decision trees represent a popular form of predictive model because they are easy to visualize and explain. It was also noted that the `rpart` package is probably the most popular of several R packages that can be used to build and visualize these models. This exercise asks you to, first, build a decision tree model using the `rpart()` function from this package, and then display the resulting model structure using the generic functions `plot()` and `text()`. ### Instructions - Load the `rpart` package to make the `rpart()` modeling function and the associated methods for generic functions like `plot()` available. - Use the `rpart()` function to fit a decision tree model `tree_model` that predicts `medv` in the `Boston` data frame from all of the other variables in this data frame. - Apply the `plot()` function to `tree_model` to obtain an unlabelled plot of the structure of this decision tree model. - Apply the `text()` function to `tree_model` to label this plot. To make the labels easier to read, use the `cex` parameter to reduce the text to 70% of the default size. ```{r} # Load the rpart library # Fit an rpart model to predict medv from all other Boston variables # Plot the structure of this decision tree model # Add labels to this plot ``` # Adding Details to Plots ## Introduction to the par() function You already saw how the `mfrow` parameter to the `par()` function could be used to plot multiple graphs in one pane. The `par()` function also allows you to set many other graphics parameters, whose values will remain in effect until they are reset by a subsequent `par()` call. Specifically, a call to the `par()` function with no parameters specified will return a list whose element names each specify a graphics parameter and whose element values specify the corresponding default values of these parameters. These parameters may be set by a call in the form `par(name = value)` where name is the name of the parameter to be set and value is the value to be assigned to this parameter. The purpose of this exercise is to give an idea of what these graphics parameters are. In the subsequent exercises we'll show how some of these parameters can be used to enhance plot results. ### Instructions - Capture the return from the `par()` function as the character vector `plot_pars`. - Show the names of these graphics parameters by calling `names()` on `plot_pars`. - Show the number of parameters in this list by calling `length()`. ```{r} # Assign the return value from the par() function to plot_pars # Display the names of the par() function's list elements # Display the number of par() function list elements ``` ## Exploring the type option One of the important graphics parameters that can be set with the `par()` function is `mfrow`, which specifies the numbers of rows and columns in an array of plots. Valid values for this parameter are two-element numerical vectors, whose first element specifies the number of rows in the plot array and the second element specifies the number of rows. A more detailed discussion of using the `mfrow` parameter is given in Chapter 4 of this course. For now, note that to specify a plot array with three rows and one column, the command would be `par(mfrow = c(3, 1))`. The following exercise also introduces the type parameter for the `plot()` command, which specifies how the plot is drawn. The specific type values used here are: - `"p"` for "points" - `"l"` for "lines" - `"o"` for "overlaid" (i.e., lines overlaid with points) - `"s"` for "steps" ### Instructions - Use the `par()` function to set the `mfrow` parameter for a two-by-two plot array. - Generate a plot of brain weights from the `Animals2` data frame, with observations plotted as points and the title `"points"` by calling the `title()` function. - Repeat, with observations plotted as lines and the title `"lines"`. - Repeat, with observations plotted as overlaid points and lines and the title `"overlaid"`. - Repeat, with observations plotted as steps and the title `"steps"`. ```{r} # Set up a 2-by-2 plot array # Plot the Animals2 brain weight data as points # Add the title # Plot the brain weights with lines # Add the title # Plot the brain weights as lines overlaid with points # Add the title ``` ## The surprising utility of the type "n" option The `type = "n"` option was discussed in the video and this exercise provides a simple example. This option is especially useful is when we are plotting data from multiple sources on a common set of axes. In such cases, we can compute ranges for the x- and y-axes so that all data points will appear on the plot, and then add the data with subsequent calls to `points()` or `lines()` as appropriate. This exercise asks you to generate a plot that compares mileage vs. horsepower data from two different sources: the `mtcars` data frame in the `datasets` package and the `Cars93` data frame in the `MASS` package. To distinguish the different results from these data sources, the graphics parameter `pch` is used to specify point shapes. See `?points` for a comprehensive list of some `pch` values and their corresponding point shapes. ### Instructions - Compute `max_hp` as the maximum of `Horsepower` from the `Cars93` data frame and `hp` from the `mtcars` data frame. - Compute `max_mpg` as the maximum of `MPG.city` from the `Cars93` data frame, `MPG.highway` from this data frame, and `mpg` from the `mtcars` data frame. - Using the `type = "n"` option, set up a plot with an x-axis that runs from `zero` to `max_hp` and a y-axis that runs from `zero` to `max_mpg`, with labels `Horsepower` and `Miles per gallon`. - Using the `points()` function, add `mpg` vs. `hp` from the `mtcars` data frame to the plot as open circles (`pch = 1`). - Using the `points()` function, add `MPG.city` vs. `Horsepower` to the plot as solid squares (refer to the plot of `pch` values). - Using the `points()` function, add `MPG.highway` vs. `Horsepower` to the plot as open triangles pointing upwards (refer to the plot of `pch` values). ```{r} # Compute max_hp max_hp <- max(___, ___) # Compute max_mpg max_mpg <- max(___, ___, ___) # Create plot with type = "n" plot(___, ___, type = ___, xlim = c(0, max_hp), ylim = ___, xlab = ___, ylab = ___) # Add open circles to plot points(___, ___, pch = ___) # Add solid squares to plot # Add open triangles to plot ``` ## The lines() function and line types As noted in Chapter 2, numerical data is often assumed to conform approximately to a Gaussian probability distribution, characterized by the bell curve. One point of this exercise is to show what this bell curve looks like for exactly Gaussian data and the other is to show how the `lines()` function can be used to add lines to an existing plot. The curves you are asked to draw here have the same basic shape but differ in their details (specifically, the means and standard deviations of these Gaussian distributions are different). For this reason, it is useful to draw these curves with different line types to help us distinguish them. Note that line types are set by the `lty` argument, with the default value `lty = 1` specifying solid lines, `lty = 2` specifying dashed lines, and `lty = 3` specifying dotted lines. Also note that the `lwd` argument specifies the relative width. ### Instructions - Create a numerical variable `x` with `200` evenly-spaced values from `0` to `10`. - Using the `dnorm()` function, generate a vector `gauss1` of Gaussian probability densities for this range of `x` values, with mean `2` and standard deviation `0.2`. - Using the `dnorm()` function, generate a vector `gauss2` of Gaussian probability densities for this range of `x` values, with mean `4` and standard deviation `0.5`. - Generate a plot of `gauss1` vs. `x` with lines and a y-axis label `"Gaussian probability density"`. - Using the `lines()` function, add a second dashed line for `gauss2` vs. `x` with relative width 3 (refer to the line type plot to select the `lty` parameter). ```{r} # Create the numerical vector x x <- seq(___, ___, length = ___) # Compute the Gaussian density for x with mean 2 and standard deviation 0.2 gauss1 <- dnorm(___, mean = ___, sd = ___) # Compute the Gaussian density with mean 4 and standard deviation 0.5 # Plot the first Gaussian density # Add lines for the second Gaussian density ``` ## The points() function and point types One advantage of specifying the `pch` argument locally is that, in a call to functions like plot() or points(), local specification allows us to make pch depend on a variable in our dataset. This provides a simple way of indicating different data subsets with different point shapes or symbols. This exercise asks you to generate two plots of `mpg` vs. `hp` from the `mtcars` data frame in the `datasets` package. The first plot specifies the point shapes using numerical values of the `pch` argument defined by the `cyl` variable in the `mtcars` data frame. The second plot illustrates the fact that `pch` can also be specified as a vector of single characters, causing each point to be drawn as the corresponding character. ### Instructions - Create an empty plot of `mpg` vs. `hp` using the `type = "n"` option from the `mtcars` data frame, with axis labels `"Horsepower"` and `"Gas mileage",. - Using the `points()` function, add the `mpg` vs. `hp` data, with `pch` specified by the numeric values of `cyl`. - Repeat both of the previous steps, except with `pch` specified by the character values of `cyl`. ```{r} # Create an empty plot using type = "n" # Add points with shapes determined by cylinder number points(___, ___, pch = ___) # Create a second empty plot # Add points with shapes as cylinder characters ``` ## Adding trend lines from linear regression models The low-level plot function `abline()` adds a straight line to an existing plot. This line is specified by an intercept parameter `a` and a slope parameter `b`, and the simplest way to set these parameters is directly. For example, the command `abline(a = 0, b = 1)` adds an equality reference line with zero intercept and unit (i.e. 1) slope: points for which `y = x` fall on this reference line, while points with `y > x` fall above it, and points with `y < x` fall below it. An alternative way of specifying these parameters is through a linear regression model that determines them from data. One common application is to generate a scatterplot of `y` versus `x`, then fit a linear model that predicts `y` from `x`, and finally call `abline()` to add this best fit line to the plot. This exercise asks you to do this for the `Gas` versus `Temp` data from the `whiteside` data frame in the `MASS` package. The standard R function that fits linear regression models is `lm()`, which supports the formula interface. Thus, to fit a linear model that predicts `y` from `x` in the data frame `df`, the call would be `lm(y ~ x, data = df)`. This call returns a linear model object, which can then be passed as an argument to the `abline()` function to draw the desired line on our plot. ### Instructions - Use the `lm()` function to create `linear_model`, a linear regression model that predicts `Gas` from `Temp` from the `whiteside` data frame. - Generate a scatterplot of `Gas` vs. `Temp`. - Using the `abline()` function, add a dashed reference line that shows the predictions of `linear_model`. ```{r} # Build a linear regression model for the whiteside data # Create a Gas vs. Temp scatterplot from the whiteside data # Use abline() to add the linear regression line ``` ## Using the `text()` function to label plot features One of the main uses of the `text()` function is to add informative labels to a data plot. The `text()` function takes three arguments: - `x`, which specifies the value for the `x` variable, - `y`, which specifies the value for the `y` variable, and - `label`, which specifies the label for the `x-y` value pair. This exercise asks you to first create a scatterplot of city gas mileage versus horsepower from the Cars93 data, then identify an interesting subset of the data (i.e. the 3-cylinder cars) and label these points. You will find that assigning a vector to the `x`, `y`, and `label` arguments to `text()` will result in labeling multiple points at once. ### Instructions - Create a scatterplot of `MPG.city` vs. `Horsepower` from the `Cars93` data frame, with points represented as solid squares. Recall that the `pch` value for solid squares is `15`. - Create the variable `index3` using the `which()` function that identifies all 3-cylinder cars. - Label the `Make` of each 3-cylinder car in the `Cars93` data frame using the `text()` function. Use the `adj` argument to specify left-justified text in your labels. - Use the zoom feature in the Plots window to see this plot more clearly. ```{r} # Create MPG.city vs. Horsepower plot with solid squares # Create index3, pointing to 3-cylinder cars # Add text giving names of cars next to data points text(x = ___, y = ___, labels = ___, adj = ___) ``` ## Adjusting text position, size, and font The previous exercise added explanatory text to a scatterplot. The purpose of this exercise is to improve this plot by modifying the text placement, increasing the text size, and displaying the text in boldface. It was noted that the `adj` argument to the `text()` function determines the horizontal placement of the text and it can take any value between 0 and 1. In fact, this argument can take values outside this range. That is, making this value negative causes the text to start to the right of the specified `x` position. Similarly, making `adj` greater than 1 causes the text to end to the left of the `x` position. Another useful optional argument for the `text()` function is `cex`, which scales the default text size. As a specific example, setting `cex = 1.5` increases the text size by 50 percent, relative to the default value. Similarly, specifying `cex = 0.8` reduces the text size by 20 percent. Finally, the third optional parameter used here is `font`, which can be used to specify one of four text fonts: `font = 1` is the default text font (neither italic nor bold), `font = 2` specifies bold face text, `font = 3` specifies italic text, and `font = 4` specifies both bold and italic text. ### Instructions - Create a plot of `MPG.city` vs. `Horsepower` from the `Cars93` data frame, with data represented as open circles. - Construct the variable `index3` using the `which()` function that identifies the row numbers containing all 3-cylinder cars. - Use the `points()` function to overlay solid circles on top of all points in the plot that represent 3-cylinder cars. - Use the `text()` function with the `Make` variable as before to add labels to the right of the 3-cylinder cars, but now use `adj = -0.2` to move the labels further to the right, use the `cex` argument to increase the label size by 20 percent, and use the `font` argument to make the labels bold italic. ```{r} # Plot MPG.city vs. Horsepower as open circles # Create index3, pointing to 3-cylinder cars # Highlight 3-cylinder cars as solid circles # Add car names, offset from points, with larger bold text ``` ## Rotating text with the srt argument In addition to the optional arguments used in the previous exercises, the `text()` function also supports a number of other optional arguments that can be used to enhance the text. This exercise uses the `cex` argument to reduce the text size and introduces two new arguments. The first is the `col` argument that specifies the color used to display the text, and the second is the `srt` argument that allows us to rotate the text. Color has been used in several of the previous exercises to specify point colors, and the effective use of color is discussed further in Chapter 5. One of the points of this exercise is to show that the specification of text color with the `text()` function is essentially the same as the specification of point color with the `plot()` function. As a specific example, setting `col = "green"` in the `text()` function causes the text to appear in green. If `col` is not specified, the text appears in the default color set by the `par()` function, which is typically black. The `srt` parameter allows us to rotate the text through an angle specified in degrees. The typical default value (set by the `par()` function) is `0`, causing the text to appear horizontally, reading from left to right. Specifying `srt = 90` causes the text to be rotated `90` degrees counter-clockwise so that it reads from bottom to top instead of left to right. ### Instructions - Create a scatterplot of `Gas` vs. `Temp` from the `whiteside` data frame, as solid triangles. - Use the `which()` function to create a vector `indexB` that points to all data observations with `Insul` having the value `"Before"`. - Use the `which()` function to create a vector `indexA` that points to all data observations with `Insul` having the value `"After"`. - Use the `text()` function to overlay the text `"Before"` on the appropriate points, in blue, rotated `30` degrees, reducing the text size to `80` percent of the default. - Use the `text()` function to overlay the text `"After"` on the appropriate points, in red, rotated `-20` degrees, reducing the text size to `80` percent of the default. ```{r} # Plot Gas vs. Temp as solid triangles # Create indexB, pointing to "Before" data # Create indexA, pointing to "After" data # Add "Before" text in blue, rotated 30 degrees, 80% size text(x = ___, y = ___, labels = ___, col = ___, srt = ___, cex = ___) # Add "After" text in red, rotated -20 degrees, 80% size ``` ## Using the `legend()` function The video described and illustrated the use of the `legend()` function to add explanatory text to a plot. This exercise asks you to first create a scatterplot and then use this function to add explanatory text for the point shapes that identify two different data subsets. ### Instructions - Set up a scatterplot of `Gas` vs. `Temp` from the `whiteside` data frame using `type = "n"` option in the `plot()` call. Label the x-axis `"Outside temperature"` by specifying the `xlab` argument and the y-axis `"Heating gas consumption"` by specifying the `ylab` argument. - Use the `which()` function to create a vector `indexB` that points to all data observations with `Insul` having the value `"Before"`. - Use the `which()` function to create a vector `indexA` that points to all data observations with `Insul` having the value `"After"`. - Using the `points()` function, add the `"Before"` points to the plot, represented as solid triangles. - Using the `points()` function, add the `"After"` points to the plot, represented as open circles. - Using the `legend()` function, add a legend in the upper right corner of the plot with the names `"Before"` and `"After"` and the appropriate point shapes indicated. ```{r} # Set up and label empty plot of Gas vs. Temp plot(___, ___, type = ___, xlab = ___, ylab = ___) # Create indexB, pointing to "Before" data indexB <- ___ # Create indexA, pointing to "After" data indexA <- ___ # Add "Before" data as solid triangles # Add "After" data as open circles # Add legend that identifies points as "Before" and "After" legend("topright", pch = ___, legend = c(___, ___)) ``` ## Adding custom axes with the `axis()` function Typical base graphics functions like `boxplot()` provide x- and y-axes by default, with a label for the x-axis below the plot and one for the y-axis label to the left of the plot. These labels are generated automatically from the variable names used to generate the plot. Sometimes, we want to provide our own axes labels, and R makes this possible in two steps: first, we suppress the default axes when we create the plot by specifying `axes = FALSE`; then, we call the low-level graphics function `axis()` to create the axes we want. In this exercise, you're asked to create your own labels using the `axis()` function with the `side`, `at`, and `labels` arguments. The `side` argument tells the function which axis to create: a value of 1 adds an axis below the plot; 2 adds an axis on the left; 3 puts it across the top; and 4 puts it on the right side. The second argument, `at`, is a vector that defines points where tick-marks will be drawn on the axis. The third argument, `labels`, is a vector that defines labels at each of these tick-marks. One example of a boxplot with custom axes was presented in the video. This exercise asks you to create another example showing the relationship between the `sugars` variable and the `shelf` variable from the `UScereal` data frame in the `MASS` package. ### Instructions - Use the `boxplot()` function to create a boxplot of `sugars` vs. `shelf` from the `UScereal` data frame in the `MASS` package, with axes suppressed. - Use the `axis()` function with the `side` parameter specified to add a y-axis label to the left of the box plot showing the range of `sugars` values. - In your second call to `axis()`, add an x-axis label on the bottom side and specify the `at` parameter to add tick-marks at the numerical shelf values labelled 1, 2, and 3. - In your third call to `axis()`, add another x-axis label at the top and specify the `labels` parameter to show the physical `shelf` locations (`"floor"`, `"middle"`, `"top"`). ```{r} # Create a boxplot of sugars by shelf value, without axes # Add a default y-axis to the left of the boxplot axis(side = ___) # Add an x-axis below the plot, labelled 1, 2, and 3 axis(side = ___, at = ___) # Add a second x-axis above the plot axis(side = ___, at = ___, labels = ___) ``` ## Using the `supsmu()` function to add smooth trend curves Some scatterplots exhibit fairly obvious trends that are not linear. In such cases, we may want to add a curved trend line that highlights this behavior of the data and the `supsmu()` function represents one way of doing this. To use this function, we need to specify values for the required arguments `x` and `y`, but it also has a number of optional arguments. Here, we consider the optional `bass` argument, which controls the degree of smoothness in the resulting trend curve. The default value is 0, but specifying larger values (up to a maximum of 10) results in a smoother curve. This exercise asks you to use the `supsmu()` function to add two trend lines to a scatterplot, one using the default parameters and the other with increased smoothness. ### Instructions - Create a scatterplot of `MPG.city` vs. `Horsepower` from the `Cars93` data frame. - Create a `supsmu()` object named `trend1` with the `bass` parameter at its minimum (default) value, 1. - Use the `lines()` function to add the `trend1` curve to the plot as a solid line. There is no need to provide additional arguments. - Create a `supsmu()` object named `trend2` with the `bass` parameter at its maximum value, 10. - Use the `lines()` function to add the `trend2` curve to the plot as a dotted line of twice standard width. ```{r} # Create a scatterplot of MPG.city vs. Horsepower # Call supsmu() to generate a smooth trend curve, with default bass # Add this trend curve to the plot # Call supsmu() for a second trend curve, with bass = 10 # Add this trend curve as a heavy, dotted line ``` # How Much is Too Much? ## Too much is too much The first example presented in Chapter 1 applied the `plot()` function to a data frame, yielding an array of scatterplots with one for each pair of columns in the data frame. Thus, the number of plots in this array is equal to the square of the number of columns in the data frame. This means that if we apply the `plot()` function to a data frame with many columns, we will generate an enormous array of scatterplots, each of which will be too small to be useful. The purpose of this exercise is to provide a memorable example. ### Instructions Applying the `plot()` function to a data frame with `k` columns will create a `k` by `k` array of scatterplots. - Use the `ncol()` function to compute the number of plots in this array for the `Cars93` data frame. Just print the result to the console. - Call `plot()` on `Cars93` to generate the scatterplot array. Can you verify that your calculation was correct? ```{r} # Compute the number of plots to be displayed ncol(___)^___ # Plot the array of scatterplots ``` ## Deciding how many scatterplots is too many The `matplot()` function can be used to easily generate a plot with several scatterplots on the same set of axes. By default, the points in these scatterplots are represented by the numbers 1 through n, where n is the total number of scatterplots included, but most of the options available with the `plot()` function are also possible by specifying the appropriate arguments. This exercise asks you to set up a plot array with four of these multiple scatterplot displays, each including one more scatterplot than the previous one. The point of this exercise is to encourage you to judge for yourself how many scatterplots is too many?. ### Instructions - Run the first line of sample code provided to construct a character vector called `keep_vars` that names six variables from the `UScereal` data frame. - Using this vector of variable names, extract the data frame `df` containing only these variables from the `UScereal` data frame. - Use the `par()` function to set up the `mfrow` parameter to generate a two-by-two plot array. - Use the `matplot()` function to construct a two-scatterplot display of - `protein` and `fat` versus `calories`. Give this plot the title `"Two scatterplots"` using the `title()` function. Label the x-axis `"calories"` and the y-axis `""` to get rid of the default y-axis label. - `protein`, `fat` and `fibre` versus `calories`. Give this plot the title `"Three scatterplots"`. Similarly, label the x-axis `"calories"` and the y-axis `""`. - `protein`, `fat`, `fibre` and `carbo` versus `calories`. Give this plot the title `"Four scatterplots"`. Again, label the x-axis `"calories"` and the y-axis `""`. - `protein`, `fat`, `fibre`, `carbo`, and `sugars` versus `calories`. Give this plot the title `"Five scatterplots"`. Be sure to label the x-axis `"calories"` and the y-axis `""` again. ```{r} # Construct the vector keep_vars keep_vars <- c("calories", "protein", "fat", "fibre", "carbo", "sugars") # Use keep_vars to extract the desired subset of UScereal df <- ___[, ___] # Set up a two-by-two plot array # Use matplot() to generate an array of two scatterplots # Add a title # Use matplot() to generate an array of three scatterplots # Add a title # Use matplot() to generate an array of four scatterplots # Add a title # Use matplot() to generate an array of five scatterplots # Add a title ``` ## How many words is too many? The main point of the previous two exercises has been to show that scatterplot arrays lose their utility if they are allowed to become too complex, either including too many plots, or attempting to include too many scatterplots on one set of axes. More generally, *any* data visualization loses its utility if it becomes too complex. This exercise asks you to consider this problem with *wordclouds*, displays that present words in varying sizes depending on their frequency. That is, more frequent words appear larger in the display, while rarer words appear in a smaller font. In R, wordclouds are easy to generate with the `wordcloud()` function in the `wordcloud` package. This function is called with a character vector of words, and a second numerical vector giving the number of times each word appears in the collection used to generate the wordcloud. Two other useful arguments for this function are `scale` and `min.freq`. The `scale` argument is a two-component numerical vector giving the relative size of the largest word in the display and that of the smallest word. The wordcloud only includes those words that occur at least `min.freq` times in the collection and the default value for this argument is 3. ### Instructions Load the `wordcloud` package in your workspace. - Use the `table()` function to create the variable `mfr_table` that tabulates the number of times each level of the `Manufacturer` variable appears in the `Cars93` data frame. - Use the `wordcloud()` function to display this manufacturer data, setting the `scale` argument to `c(2, 0.25)` to make the wordcloud fit in the display window. - By default, the `wordcloud()` function shows only those words that appear 3 or more times. Use the `min.freq` argument to obtain a display with all `Manufacturer` values in the wordcloud. - Use the `table()` function to create the variable `model_table` that tabulates the number of times each level of the `Model` variable appears in the `Cars93` data frame. - Use the `wordcloud()` function with the `scale` argument set to `c(0.75, 0.25)` and the `min.freq` argument set to display all `Model` values. Use the zoom feature in the 'Plots' window to see this wordcloud more clearly. Does it convey useful information? ```{r} # Create mfr_table of manufacturer frequencies mfr_table <- ___ # Create the default wordcloud from this table wordcloud(words = names(___), freq = as.numeric(___), scale = ___) # Change the minimum word frequency wordcloud(words = names(___), freq = as.numeric(___), scale = ___, min.freq = ___) # Create model_table of model frequencies model_table <- ___ # Create the wordcloud of all model names with smaller scaling wordcloud(words = names(___), freq = as.numeric(___), scale = ___, min.freq = ___) ``` ## The Anscombe quartet This exercise and the next one are based on the Anscombe quartet, a collection of four datasets that appear to be essentially identical on the basis of simple summary statistics like means and standard deviations. For example, the mean x-values for these datasets are identical to three digits, while the mean y-values differ only in the third digit. In spite of these apparent similarities, the behavior of the four datasets is quite different and this becomes immediately apparent when we plot them. ### Instructions - Use the `par()` function to set up a two-by-two plot array. - Using `plot()`, create 4 separate plots from the `anscombe` data frame: - `y1` vs. `x1` - `y2` vs. `x2` - `y3` vs. `x3` - `y4` vs. `x4` ```{r} # Set up a two-by-two plot array # Plot y1 vs. x1 plot(anscombe$___, anscombe$___) # Plot y2 vs. x2 # Plot y3 vs. x3 # Plot y4 vs. x4 ``` ## The utility of common scaling and individual titles The plots you generated in the previous exercise showed that the four Anscombe quartet datasets have very different appearances, but a careful examination of these plots reveals that they exhibit different ranges of x and y values. The point of this exercise is to illustrate how much more clearly we can see the differences in these datasets if we plot all of them with the same x and y ranges. This exercise also illustrates the utility of improving the x- and y-axis labels and of adding informative plot titles. ### Instructions - Examine the range of x and y values from the previous four plots to determine a common range that covers the ranges of all four of these plots. Use integer values to keep things simple. - Set up a two-by-two plot array using the `par()` function. - Plot the `y1` variable against the `x1` variable using these common ranges, with x-axis label `"x value"` and y-axis label `"y value"`, and use the `main, argument to add the title `"First dataset"`. - Repeat for the other three Anscombe data pairs, adding the appropriate titles. ```{r} # Define common x and y limits for the four plots xmin <- ___ xmax <- ___ ymin <- ___ ymax <- ___ # Set up a two-by-two plot array # Plot y1 vs. x1 with common x and y limits, labels & title plot(anscombe$___, anscombe$___, xlim = c(xmin, xmax), ylim = c(ymin, ymax), xlab = "x value", ylab = "y value", main = ___) # Do the same for the y2 vs. x2 plot plot(anscombe$___, anscombe$___, xlim = c(xmin, xmax), ylim = c(ymin, ymax), xlab = "x value", ylab = "y value", main = ___) # Do the same for the y3 vs. x3 plot plot(anscombe$___, anscombe$___, xlim = c(xmin, xmax), ylim = c(ymin, ymax), xlab = "x value", ylab = "y value", main = ___) # Do the same for the y4 vs. x4 plot plot(anscombe$___, anscombe$___, xlim = c(xmin, xmax), ylim = c(ymin, ymax), xlab = "x value", ylab = "y value", main = ___) ``` ## Using multiple plots to give multiple views of a dataset Another useful application of multiple plot arrays besides comparison is presenting multiple related views of the same dataset. This exercise illustrates this idea, giving four views of the same dataset: a plot of the raw data values themselves, a histogram of these data values, a density plot, and a normal QQ-plot. ### Instructions - Note that the `MASS` and `car` packages have been pre-loaded, making the `geyser` data and the `truehist()` and `qqPlot()` fuctions available for your use. - Use the `par()` function to set the `mfrow` parameter for a two-by-two plot array. - In the upper left, use the `plot()` function to show the values of the `duration` variable from the `geyser` dataset, using the `main` argument to specify the plot title as `"Raw data"`. - In the upper right, use the `truehist()` function to generate a histogram of the `duration` data, using the `main` argument to give this plot the title `"Histogram"`. - In the lower left, use the `plot()` and `density()` functions to display the density of the `duration` values, using the `main` argument to give this plot the title `"Density"`. - In the lower right, use the `qqPlot()` function from the `car` package to display a normal QQ-plot of the `duration` data, using the `main` argument to give this plot the title `"QQ-plot"`. ```{r} # Set up a two-by-two plot array # Plot the raw duration data # Plot the normalized histogram of the duration data # Plot the density of the duration data # Construct the normal QQ-plot of the duration data ``` ## Constructing and displaying layout matrices You can think of the layout matrix as the plot pane, where a 0 represents empty space and other numbers represent the plot number, which is determined by the sequence of visualization function calls. For example, a 1 in the layout matrix refers to the visualization that was first called, a 2 refers to the plot of the second visualization call, etc. This exercise asks you to create your own 3 x 2 layout matrix, using the `c()` function to concatenate numbers into vectors that will form the rows of the matrix. You will then use the `matrix()` function to convert these rows into a matrix and apply the `layout()` function to set up the desired plot array. The convenience function `layout.show()` can then be used to verify that the plot array has the shape you want. ### Instructions - Using the `matrix()` function, create a matrix called `layoutMatrix` with three rows and two columns: - the first row designates an empty plot to the left and plot 1 to the right. - the second row designates plot 2 to the left and an empty plot to the right. - the third row designates an empty plot to the left and plot 3 to the right. - Use the `layout()` function to set up the desired plot array. - Use the `layout.show()` function to show the arrangement of all three plots. ```{r} # Use the matrix function to create a matrix with three rows and two columns layoutMatrix <- matrix( c( ___, ___, ___, ___, ___, ___ ), byrow = ___, nrow = ___ ) # Call the layout() function to set up the plot array # Show where the three plots will go ``` ## Creating a triangular array of plots The previous exercise asked you to create a plot array using the `layout()` function. Recall the layout matrix from the previous exercise: ``` > layoutMatrix [,1] [,2] [1,] 0 1 [2,] 2 0 [3,] 0 3 ``` This exercise asks you to use this array to give three different views of the `whiteside` data frame. The first plot, on the upper right of the plot array, shows the relationship of `Gas` and `Temp` using all data from `whiteside`. The second plot, in the center left of the plot array, shows the relationship of the two variables using data where `Insul` is equal to `"Before"`. Finally, the third plot, on the lower left of the plot array, shows the relationship using data where `Insul` is equal to `"After"`. The primary motivation for this exercise is that it is not possible to construct a plot array in this format using the `mfrow` parameter, since the array is not rectangular. ### Instructions The layout matrix you set up in the previous exercise is available in your workspace as `layoutMatrix`. The `whiteside` data frame is already available in your workspace as well. - Call the `layout()` function on `layoutMatrix` to set up the plot array. - Construct a vector `indexB` that points to only those records with the `Insul` value `"Before"` and a vector `indexA` that points to only those records with the `Insul` value `"After"`. - Plot the `Gas` versus `Temp` values from the `indexB` data in the upper right plot in your array, using the y-axis limits `c(0, 8)`. Give this plot the title `"Before data only"`. - Plot the `Gas` versus `Temp` values from the complete dataset in the center left plot, using the same y-axis limits as the first plot. Give this plot the title `"Complete dataset"`. - Plot the `Gas` versus `Temp` values from the `indexA` data in the lower right plot, again using the same y-axis limits. Give this plot the title `"After data only"`. ```{r} # Set up the plot array # Construct vectors indexB and indexA indexB <- which(___) indexA <- which(___) # Create plot 1 and add title plot(whiteside$___[___], whiteside$___[___], ylim = ___) title(___) # Create plot 2 and add title # Create plot 3 and add title ``` ## Creating arrays with different sized plots Besides creating non-rectangular arrays, the `layout()` function can be used to create plot arrays with different sized component plots -- something else that is not possible by setting the `par()` function's `mfrow` parameter. This exercise illustrates the point, asking you to create a standard scatterplot of the `zn` versus `rad` variables from the `Boston` data frame as a smaller plot in the upper left, with a larger sunflower plot of the same data in the lower right. ### Instructions The `Boston` data frame in the `MASS` package is already available in your workspace. - Using the `c()` function: - Create the three-element vector `row1` with values `(1, 0, 0)`. - Create the three-element vector `row2` with values `(0, 2, 2)`. - Combine the first vector with two copies of the second vector into `layoutVector`, a vector of length 9. - Using the `matrix()` function, convert `layoutVector` into the 3-by-3 matrix `layoutMatrix, whose first row is `row1` and whose second and third rows are `row2`. - Use the `layout()` function with `layoutMatrix` to configure a two-element plot array. - In the first (smaller) plot, create a standard scatterplot of the `zn` variable versus the `rad` variable from the `Boston` data frame. - In the second (larger) plot, create a sunflower plot using the `sunflowerplot()` function and the same variables. ```{r} # Create row1, row2, and layoutVector row1 <- ___ row2 <- ___ layoutVector <- ___ # Convert layoutVector into layoutMatrix layoutMatrix <- matrix(___, byrow = ___, nrow = ___) # Set up the plot array # Plot scatterplot # Plot sunflower plot ``` # Advanced Plot Customization and Beyond ## Some plot functions also return useful information Calling the `barplot()` function has the side effect of creating the plot we want, but it also returns a numerical vector with the center positions of each bar in the plot. This value is returned invisibly so we don't normally see it, but we can capture it with an assignment statement. These return values can be especially useful when we want to overlay text on the bars of a horizontal barplot. Then, we capture the return values and use them as the `y` parameter in a subsequent call to the `text()` function, allowing us to place the text at whatever `x` position we want but overlaid in the middle of each horizontal bar. This exercise asks you to construct a horizontal barplot that exploits these possibilities. ### Instructions The `Cars93` data frame from the `MASS` package is already available in your workspace. - Use the `table()` function to generate the object `tbl` summarizing the number of records listing each value of the `Cylinders` variable from the `Cars93` data frame. - Use the `barplot()` function with the `horiz` argument set to `TRUE` to construct a horizontal barplot of this record summary, specifying the color of the bars to be `"transparent"` and returning the vector mids giving the vertical positions of the centers of each bar in the plot. Specify the `names.arg` argument as `""` to suppress the y-axis legend on this plot. - Use the `text()` function to label the `Cylinders` value for each bar in the barplot at the horizontal position `20`. The `names()` function may prove useful here. - Use the `text()` function to list the counts for each cylinders value at the horizontal position `35`. Here, the `as.numeric()` function may prove useful. ```{r} # Create a table of Cylinders frequencies tbl <- ___ # Generate a horizontal barplot of these frequencies mids <- barplot(___, horiz = ___, col = ___, names.arg = ___) # Add names labels with text() text(___, mids, ___) # Add count labels with text() text(___, mids, ___) ``` ## Using the `symbols()` function to display relations between more than two variables The scatterplot allows us to see how one numerical variable changes with the values of a second numerical variable. The `symbols()` function allows us to extend scatterplots to show the influence of other variables. This function is called with the variables `x` and `y` that define a scatterplot, along with another argument that specifies one of several possible shapes. Here, you are asked to use the `circles` argument to create a *bubbleplot* where each data point is represented by a circle whose radius depends on the third variable specified by the value of this argument. ### Instructions The `Cars93` data frame from the `MASS` package is already available in your workspace. - Use the `symbols()` function with its default settings and the appropriate arguments to create a bubble plot of `MPG.city` versus `Horsepower` from the `Cars93` data frame, with the bubble area by the `Price` variable. Note that this means the bubble radius should be proportional to the square root of `Price`. - Re-create the first plot, but with the optional argument `inches` set to `0.1`. ```{r} # Call symbols() to create the default bubbleplot symbols(___, ___, circles = ___) # Repeat, with the inches argument specified symbols(___, ___, circles = ___, inches = ___) ``` ## Saving plot results as files In an interactive R session, we typically generate a collection of different plots, often using the results to help us decide how to proceed with our analysis. This is particularly true in the early phases of an exploratory data analysis, but once we have generated a plot we want to share with others, it is important to save it in an external file. R provides support for saving graphical results in several different external file formats, including jpeg, png, tiff, or pdf files. In addition, we can incorporate graphical results into external documents, using tools like the `Sweave()` function or the `knitr` package. One particularly convenient way of doing this is to create an R Markdown document, an approach that forms the basis for another course. Because png files can be easily shared and viewed as e-mail attachments and incorporated into many slide preparation packages (e.g. Microsoft Powerpoint), this exercise asks you to create a plot and save it as a png file. The basis for this process is the `png()` function, which specifies the name of the png file to be generated and sets up a special environment that captures all graphical output until we exit this environment with the `dev.off()` command. ### Instructions - Use the `png()` function to direct all subsequent plot results to the external file `bubbleplot.png`. - Run the code to re-create the second bubble plot from the previous exercise. - Exit the `png()` environment to return graphics control to your session by calling `dev.off()`. - Use the `list.files()` code provided to verify that `bubbleplot.png` was created. To do this, you need to specify `"png"` as the value to the pattern argument to `list.files()`. ```{r} # Call png() with the name of the file we want to create # Re-create the plot from the last exercise symbols(Cars93$Horsepower, Cars93$MPG.city, circles = sqrt(Cars93$Price), inches = 0.1) # Save our file and return to our interactive session # Verify that we have created the file list.files(pattern = "png") ``` ## Iliinsky and Steele's 12 recommended colors This exercise asks you to create a horizontal barplot that shows Iliinsky and Steele's set of recommended 12 colors, in descending order of desirability from the top of the plot to the bottom. Also, the first six "more preferred" colors are displayed with longer bars to visually emphasize their preferred status over the other six. Code is provided to create the character vector `IScolors` with the names of the 12 colors recommended by Iliinsky and Steele, in their recommended order. We've also created the numeric vector `barWidths` containing the value 2 for the first six Iliinsky and Steele colors and the value 1 for the next six. In this exercise, you'll use the `barplot()` function to create the horizontal barplot shown in the figure below, with the color names to the left of the bars and each bar drawn in the indicated color. ![Recommended plot colors by Iliinsky and Steele.](images/plotcolors.png) ### Instructions - Notice how the `barWidths` vector of length 12 contains the length of each bar shown. - The 6 longer bars have a value of 2. - The 6 shorter ones have a value of 1. - Recreate the horizontal barplot you see above, using the `horiz`, `col`, `axes`, `names.arg`, and `las` arguments to `barplot()`. Note that since the horizontal barplot is constructed from the bottom up, it is necessary to reverse the `IScolors` and `barWidths` vectors in the `barplot()` function call, using the `rev()` function. ```{r} # Iliinsky and Steele color name vector IScolors <- c("red", "green", "yellow", "blue", "black", "white", "pink", "cyan", "gray", "orange", "brown", "purple") # Create the data for the barplot barWidths <- c(rep(2, 6), rep(1, 6)) # Recreate the horizontal barplot with colored bars barplot(___, horiz = ___, col = ___, axes = ___, names.arg = ___, las = 1) ``` ## Using color to enhance a bubbleplot In a previous exercise, you saw how the `symbols()` function could be used to generate a bubbleplot showing the relationship between three variables (specifically, `Horsepower`, `MPG.city`, and `Price` from the `Cars93` data frame). There, the basic format was a scatterplot of `MPG.city` versus `Horsepower`, with points represented as bubbles, sized by `Price`. Here, you are asked to create a variation on this plot, using the factor variable `Cylinders` to determine both the size and the color of the bubbles. To do this, note that `as.numeric(Cars93$Cylinders)` generates a sequence of numerical values from 1 to 6 that specify the six unique levels of the `Cylinders` factor. The point of this exercise is to show that, in cases where color is an option, the clarity of this bubbleplot can be improved substantially through the use of color in addition to symbol size. Since the `Cylinders` variable exhibits six unique values, six colors are required to make this enhancement, so the top six colors recommended by Iliinsky and Steele are used here. ### Instructions ![Exercise objective is to create this plot.](images/bubbleplot.png) The `Cars93` data frame in the `MASS` package is available in your workspace, and the vector `IScolors` of color names is created in the code included at the beginning of the exercise. - Create the vector `cylinderLevels` giving numerical labels to the unique values of the `Cylinders` variable. - Using the `symbols()` function, recreate the bubbleplot of `MPG.city` vs. `Horsepower` shown below with the bubbles sized by the `cylinderLevels` variable. - Specify a maximum circle radius of 0.2 inches. - Use the `bg` parameter to color each bubble according to the numerical level of the `Cylinders` factor variable (i.e., the `cylinderLevels` values), using the first six recommended colors from Iliinsky and Steele (e.g. the first Cylinder level should be colored "red", the second level colored "green", etc.) ```{r} # Iliinsky and Steele color name vector IScolors <- c("red", "green", "yellow", "blue", "black", "white", "pink", "cyan", "gray", "orange", "brown", "purple") # Create the `cylinderLevel` variable # Create the colored bubbleplot symbols(___, ___, circles = ___, inches = ___, bg = ___[___]) ``` ## Using color to enhance stacked barplots The most common barplot consists of a collection of vertical bars, each representing some characteristic of one of a finite number of datasets or data subsets. Several previous exercises have illustrated the utility of the somewhat less common horizontal barplot, useful in part because it allows text to be displayed horizontally across the bars, as illustrated in Exercise 1. Another useful variant of the standard bar plot is the stacked bar plot, where each bar in the plot is partitioned into segments characterizing portions of the data characterized by the bar. Stacked bar plots can also be generated using the `barplot()` function. Here, each bar is specified by a matrix whose columns specify the heights of the segments in each bar. By default, the `barplot()` function generates stacked bar plots using different shades of gray for the different segments of each bar in the plot. The point of this exercise is to show that, if we can use it, color can be a more effective alternative. ### Instructions ![Your colored plot should look like this.](images/barplot.png) The character vector `IScolors` from the previous exercises is still available in your workspace. - Create the table `tbl` giving the record counts for each `Cylinders` value at each `Origin` value (this table should have 6 rows and 2 columns). - Using the `barplot()` function, create a stacked barplot that summarizes this information using shades of gray. - Recreate this first plot, but using the first six Iliinsky and Steele colors for the six `Cylinders` levels. ```{r} # Create a table of Cylinders by Origin # Create the default stacked barplot # Enhance this plot with color ``` ## The tabplot package and grid graphics The `tabplot` package allows you to tap into the power of grid graphics without explicit knowledge of how the system works under the hood. The main function in `tabplot` is `tableplot()`, developed to visualize data distributions and relationships between variables in large datasets. Specifically, the `tableplot()` function constructs a set of side-by-side horizontal barplots, one for each variable. This function works best when viewing up to about 10 variables at a time, for datasets with arbitrarily many records. In this exercise, you are asked to apply this function to a dataset with just under 68,000 records and 11 variables. The `tableplot()` function is called with a data frame and, if no optional arguments are specified, it selects the first data column as the reference variable. This variable can be of any type, but the display is easiest to explain when it's numeric, as in the example considered here. For further details, refer to the vignette *Visualization of large datasets with tabplot* that accompanies the help files for the `tabplot` package. ### Instructions - Load the `insuranceData` package. - Use the `data()` function to load the `dataCar` data frame. - Load the `tabplot` package the normal way, but surround your call to `library()` with the `suppressPackageStartupMessages()` function to avoid a bunch of unncessary output from printing to the console. - Apply the `tableplot()` function to the `dataCar` data frame. ```{r} # Load the insuranceData package # Use the data() function to load the dataCar data frame # Load the tabplot package suppressPackageStartupMessages(___) # Generate the default tableplot() display ``` ## A lattice graphics example Another package built on grid graphics is `lattice`, but unlike `tabplot`, `lattice` is a general-purpose graphics package that provides alternative implementations of many of the plotting functions available in base graphics. Specific examples include scatterplots with the `xyplot()` function, bar charts with the `barchart()` function, and boxplots with the `bwplot()` function. One important difference between lattice graphics and base graphics is that similar functions available in both graphics systems often produce very different results when applied to the same data. As a specific example, the `bwplot()` function creates horizontal boxplots, while the default result of the `boxplot()` is a vertical boxplot display. Another more important difference between lattice and base graphics is that lattice graphics supports conditional plots that show the separate relationships between variables within different groups. This capability is illustrated in this exercise, where you are asked to construct a plot showing the relationship between the variables `calories` and `sugars` from the `UScereal` data frame, conditional on the value of the `shelf` variable. ### Instructions - Load the `lattice` package to make the function `xyplot()` available. - Using the conditional formula format `"y ~ x | z"` (scatterplot of y vs. x, conditioned on z), construct a conditional scatterplot of `calories` vs. `sugars` conditional on `shelf` from the `UScereal` data frame in the `MASS` package. Make sure to convert `shelf` to a factor. ```{r} # Load the lattice package # Use xyplot() to construct the conditional scatterplot ``` ## A ggplot2 graphics example This final exercise provides an introduction to the `ggplot2` graphics package. Like the `lattice` package, the `ggplot2` package is also based on grid graphics and it also represents a general purpose graphics package. The unique feature of the `ggplot2` package is that it is based on the *grammar of graphics*, a systematic approach to building and modifying graphical displays, starting with a basic graphics element and refining it through the addition of successive modifiers. This exercise provides a simple illustration of the approach. Specifically, you are asked to use `ggplot2` to, first, create a simple scatterplot of `MPG.city` versus `Horsepower` from the `Cars93` data frame. Next, you are asked to add a simple modifier that adds color based on the value of the `Cylinders` variable. Finally, you are asked to convert this result into a colored bubble plot, with both bubble sizes and colors determined by the `Cylinders` variable. The primary purpose of this exercise is to give you a flavor of the `ggplot2` package. ### Instructions The character vector `IScolors` from the previous exercises is still available in your workspace. - Load the `ggplot2` package. - Create `basePlot` as the `ggplot` object based on the `Cars93` data frame from the `MASS` package and the scatterplot aesthetic with `x` as the `Horsepower` variable and `y` as the `MPG.city` variable. When passing the variable names to `x` and `y`, they should be unquoted (e.g. `x = var_name`). (Note that this definition does not render the scatterplot.) - Make a simple rendering of `basePlot` with the `geom_point()` function. - Make a second rendering using the `color` parameter of `geom_point()` to specify the first six Iliinsky-Steele colors for the six `Cylinders` levels. - Make a third rendering with the same coloring as the second one, but with the point sizes corresponding to the levels of the `Cylinders` factor variable.