3 Data Visualization

In Chapter 2 we summarized data numerically, which allowed us to gather basic information about the data. In this chapter, we will summarize data visually, which is another critical step in getting to know the data we are working with.

Choose appropriate plots for different types of variables
Construct visualizations using ggplot()
Interpret visualizations

Visualizations can be useful for exploring and understanding data, looking at the overall patterns and trends, describing the distribution of data, helping raise new questions or hint that you’re asking the wrong question, and delivering a message about the data to the viewer.

The exploratory stage of data visualization is often done for ourselves, our research team, and colleagues. We will do exploratory data visualization in this chapter, and learn to write code that will give us basic visualizations. When our aim is to deliver a message, we must go beyond the basics to communicate it effectively. We will cover how to improve our basic plots in more detail in Chapter 4.

3.1 Data Context

Throughout this chapter, we’ll be working with the American Time Use Survey (ATUS) data. The United States Bureau of Labor Statistics conducts this survey annually to measure how Americans spend their time on various activities, including work, household activities, caregiving, and leisure pursuits.

We have included some of the ATUS data, including responses only from those who are enrolled in a college or university in the atus_college data frame. This dataset provides insights into how college students spend their time, including employment status, enrollment status, weekly earnings, and time spent alone.

We can go ahead and load the packages we will use in this chapter. The atus_college data is provided in the hellodatascience package. The data visualization functions that we will use mostly come from the ggplot2 package. The ggplot2 package loads along with other tidyverse packages when library(tidyverse) is called.

library(hellodatascience)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We can go ahead and load the data and glimpse at it.

data(atus_college)
glimpse(atus_college)

Rows: 312
Columns: 13
$ employment        <fct> Part Time, Full Time, Part Time, Part Time, NA, Part…
$ age               <dbl> 19, 23, 22, 21, 26, 25, 27, 36, 30, 20, 18, 20, 25, …
$ enrollment        <fct> Part Time, Full Time, Full Time, Full Time, Full Tim…
$ weekly_earnings   <dbl> 400.00, 1476.92, 561.25, 100.00, NA, 300.00, 1076.92…
$ household_size    <dbl> 6, 2, 2, 4, 2, 3, 3, 1, 3, 2, 4, 4, 2, 2, 4, 5, 4, 2…
$ time_alone        <dbl> 326, 150, 357, 22, 0, 455, 90, 340, 326, 120, 285, 5…
$ sleep_time        <dbl> 680, 180, 470, 660, 875, 765, 630, 300, 445, 630, 66…
$ work_time         <dbl> 315, 0, 0, 0, 0, 0, 0, 645, 555, 0, 0, 0, 520, 615, …
$ degree_class_time <dbl> 0, 0, 238, 0, 0, 0, 0, 0, 0, 0, 0, 228, 0, 0, 0, 0, …
$ shopping_time     <dbl> 14, 0, 0, 0, 0, 0, 0, 5, 20, 345, 0, 0, 0, 0, 0, 0, …
$ lunch_break_time  <dbl> 66, 60, 20, 115, 35, 50, 25, 30, 75, 60, 15, 15, 60,…
$ sports_time       <dbl> 0, 60, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0…
$ religious_time    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0…

Recall that the following code needs to be run in the Console to get the documentation for the dataset.

?atus_college

The atus_college dataset has 312 rows and 13 columns. In other words, this dataset has information on 13 variables from 312 college students. Note that there are more than 312 students in the United States, hence the visualizations we will make in this chapter will be about these specific individuals. Once we get to Chapter 11 and beyond, we will discuss whether we can extend our findings to the overall student body in the United States. The variables represented in each column of the data are shown in Table 3.1.

Table 3.1: Documentation of Variables in the atus_college data

Variable Name	Description
employment	full time or part time employment status of respondent
age	age
enrollment	are you enrolled as a full-time or part-time student?
weekly_earnings	weekly earnings at main job
household_size	number of people living in respondent’s household
time_alone	total nonwork-related time respondent spent alone (in minutes)
sleep_time	time spent sleeping
work_time	time spent working at main job
degree_class_time	time spent taking class for degree, certification, or licensure
shopping_time	time spent shopping (store, telephone, internet)
lunch_break_time	time spent taking a lunch break
sports_time	time spent participating in sports, exercise, or recreation
religious_time	time spent attending or participating in religious services

3.2 Visualizing a Single Variable

Before we dive into advanced analyses, it is essential to understand the variables in our dataset one at a time. Single-variable visualizations help us explore the distribution, central tendencies (e.g., mean and median), and patterns within each variable. This preliminary step informs our understanding of the data and guides subsequent analyses. There are a variety of plots that we can create, but the choice of plot is often determined by the type of variable we have.

3.2.1 Visualizing a Categorical Variable

A bar plot (also called a bar chart) can be used to visualize a categorical variable. Recall that in Section 2.4.1 we had said that we can summarize categorical variables with counts and proportions. Bar plots display counts and proportions using rectangular bars, where the height of each bar corresponds to the proportion or the number of observations in that category.

A bar plot of employment status. The x-axis is labeled 'employment status' with categories Full Time, Part Time, and NA. The y-axis is labeled 'count' and ranges from 0 to 140. Full Time has the tallest bar at about 130, Part Time is around 80, and NA is close to 100. — Figure 3.1: A bar plot of employment status

The bar plot in Figure 3.1 displays the distribution of employment status among college students, helping us compare the number of students who work full-time, or part-time. As we can see, more students work full-time than part-time You might also notice that there are some missing values as denoted by NA. In fact there are more observations that are NA than part-time. You might be wondering what these NA values represent. Could they be the ones who are unemployed? Or could they be those who just did not want to respond to this particular survey question? Well, we don’t know. Understanding missing data may require additional analyses.

We can use a bar plot when

we have categorical data
we want to compare frequencies across different categories
we want to identify the most and least common categories

Let’s go ahead and replicate Figure 3.1 using R. To create all of our plots, we will be using the ggplot() function, which creates a coordinate system that you can add layers to, such as type of graph, labels, color, and much more. Most of the basic plots we will make in this chapter will consist of three steps:

In the first step, using the ggplot() function, we specify the data we will use. This code will create a blank canvas with a gray background.

ggplot(
  data = atus_college
)

A blank rectangular space — Figure 3.2: Blank coordinate system

The next step is the mapping. Within the ggplot() function we use the aes() function to map variables to certain aesthetics. In this case of creating a barplot for employment, we are only mapping the employment variable to the x-axis. After this code, you will notice that the categories of the employment variable are now shown on the x-axis.

ggplot(
  data = atus_college, 
  aes(x = employment)
)

A blank rectangular space with only the x axis labeled emplyment with three categories Full Time, Part Time, and NA — Figure 3.3: Mapping `employment` variable to the x-axis

The final step is to add a geometric object (geom_bar()) that tells ggplot how to display the data (with bars). Note that a geometric object adds a layer to our plot, hence it is added by using a + sign in the previous line.

ggplot(
  data = atus_college, 
  aes(x = employment)
) +
  geom_bar()

To summarize, the three main steps to make a plot are:

Call the ggplot() function on the data we want to plot.
Map variables to plot aesthetics using the aes() function.
Add a layer to specify the plot type.

The tidyverse style guide has the following convention for writing ggplot2 code.

The plus sign for adding layers + always has a space before it and is followed by a new line.
The new line is indented by two spaces. RStudio does this automatically for you.

We could also write the above code as follows:

ggplot(
  data = atus_college, 
  aes(x = employment)
) +
  geom_bar()

It is clear that aes() is an argument within the ggplot() function since data and aes() are vertically aligned. When our aesthetic mappings get long we will prefer this method of specifying arguments in a vertical manner.

3.2.2 Visualizing a Numeric Variable

There are many different plots that can be utilized to visualize a numeric variable. In this section, we will cover histograms and boxplots. You should be aware that there are many other options.

A histogram is one of the most fundamental tools for exploring numeric data. It displays the distribution of a numeric variable by dividing the range of the variable into bins (intervals or classes) and showing the frequency of observations in each bin.

We will visualize the weekly_earnings variable using a histogram. To create a histogram, we will use the three-step process that we learned while making a bar plot.

The full code is as follows:

ggplot(
1  data = atus_college,
2  aes(x = weekly_earnings)
) +
3  geom_histogram()

1: We specify the data within the ggplot() function. i.e., ggplot(data = atus_college)
2: We map our variables to aesthetics. In this case, we would like to display the weekly_earnings on the x-axis. i.e., aes(x = weekly_earning)
3: We state the plot type by using the appropriate geom object. i.e., geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

A histogram of 'weekly earnings. The x-axis is labeled 'Weekly Earnings' and ranges from 0 to 3000 in increments of 1000. The y-axis is labeled 'Count' and ranges from 0 to 20 in increments of 5. There are about 30 bars with a width of about 100. The bars height represents the frequency of weekly earnings in different ranges, with most counts concentrated at lower earnings and gradually decreasing toward higher earnings. — Figure 3.5: A histogram of weekly earnings

The weekly earnings data range from about $0 to $3,000 and there are 30 bins, so each bin has a width of about $100. Reading the plot we can see 4 people make between $0 and $100, 10 people make between $100 and $200, and so on. We can also observe overall trends like that most of the people in the data make $1000 a week or less, with very few making above $2,000 a week.

You might also notice that, in addition to displaying the histogram, R is showing us a message right below the code. The message says, “stat_bin() using bins = 30. Pick better value with binwidth.” which is informing us that in order to make a histogram, R had to make certain decisions. By default, R made an histogram with 30 bins (the up-right rectangles) but this may not be the best choice. R is telling us to use the binwidth argument to choose a value appropriate for our data.

Let’s go ahead and set the width of each rectangle to $20 using the binwidth argument to see if that’s a better choice.

ggplot(
  data = atus_college,
  aes(x = weekly_earnings)
) +
1  geom_histogram(binwidth = 20)

1: Since binwidth is related to the histogram layer, this argument goes inside the geom_histogram() function.

You can see that the bins got very thin when they were set to $20. There are many bins that are empty or have a height of 1, in other words, many bins represent only a single observation. Perhaps setting the binwidth to $20 was not the best decision.

A figure showing three histograms comparing different bin widths for weekly earnings distribution. Each histogram has the x-axis labeled Weekly Earnings ranging from 0 to 3000, and the y-axis labeled Count. The first histogram uses a bin width of 20, with many narrow bars and counts mostly below 6. The second histogram uses a bin width of 100, with fewer bars and counts up to about 20. The third histogram uses a bin width of 500, with very few wide bars and counts reaching about 75.The figure illustrates how increasing bin width reduces the number of bars and changes the distributions visual detail. All three figures show higher counts for lower earning and counts decreasing towatds higher earnings. — Figure 3.7: Comparing different binwidths

So, how does one determine the best binwidth or number of bins? This has been a question that many statisticians have tried to answer. Some have come up with their own rules. However, these rules are above the scope of this book. At this early stage we will decide on the binwidth by trying different binwidths based on the range of our data to see what reveals patterns best, as shown in Figure 3.7. We don’t want a lot of bins as it may result in too many empty bins, and it would be hard to see the overall pattern (e.g., binwidth = 20). We also don’t want too few bins as there will be a loss of detail (e.g., binwidth = 500). We want something that is summarized but that may also display important details based on the context of our data.

In Figure 3.7, it may seem at first that the height of all the graphs is the same, but a closer look shows you that the range of the y-axis is different in the three graphs. As you explore various binwidths to determine what best describes your data, R adjusts the rectangular window to fit the data. The larger the binwidth, the fewer the number of bins with a larger frequency and hence the taller the bins. On the other hand, a smaller binwidth results in a larger number of bins with smaller frequencies and often empty ones.

One last concept we need to understand while using histograms is skewness, which is a measure of asymmetry of a distribution. It tells us whether the data tends to have longer “tails” on one side compared to the other (i.e. unusual extreme observations relative to the rest of the data), and where the bulk of the data is concentrated.

A figure showing four histograms illustrating different distribution shapes. Each histogram has the x-axis labeled x and the y-axis labeled count. The first histogram, labeled Left-skewed, with the tail on the left side is longer meaning that it has most high-count bars concentrated on the right side, and low counts towards the left. The second histogram, labeled Bell-shaped, shows a symmetric distribution with most bars centered in the middle and equal length tails on both sides. The third histogram, labeled Uniform, has bars of roughly equal height across the range. The fourth histogram, labeled Right-skewed, with the tail on the right side is longer meaning that it has most high-count bars concentrated on the left side, and low counts towards the right. — Figure 3.8: Understanding skewness of a histogram

In Figure 3.8, we can see the four most common types of distributions. A left-skewed distribution has a long tail extending to the left, and has most of the data clustered on the right side. The tail is also in the direction of the negative side of the x-axis, and hence sometimes other data scientists refer to this as a negatively skewed distribution.

In a bell-shaped distribution, the tails on the right and left are approximately equal in length, and the data is concentrated in middle. Data is evenly distributed around the center and hence the mean and median of the data is close to one another.

In a uniform distribution all bins have the same height, and hence this distribution is associated with the shape of a rectangle. Both bell-shaped and uniform distributions are considered symmetric distributions because if you were to cut the graph in half, the left side is a mirror image of the right side. In a perfectly symmetric distribution, the mean and the median are equal.

In a right-skewed distribution, the majority of observations are concentrated toward the lower values on the left side, while a lengthy tail stretches toward the higher values on the right. The tail extends in the positive direction along the x-axis, which is why other data scientists might also call this a positively skewed distribution.

The distribution of weekly_earnings is

left-skewed
symmetric
right-skewed

Looking at the distribution of the weekly_earnings which of the following can be concluded?

mean > median
mean = median
mean < median

Check footnote for answers¹

In general, for left-skewed distributions, the mean is smaller than the median, while the opposite happens for right-skewed distributions, where the mean is greater than the median.

Let’s wrap up on histograms by considering when we can use a histogram. When we

have numeric data
want to identify patterns like skewness or multiple peaks
want to see the central tendency and spread of the data
want to check if data follows a particular distribution (like a bell curve)

Since bar plots and histograms both have up-right rectangles, it might be easy to confuse them. One way to distinguish them is by remembering that bar plots are for categorical variables and histograms are for numeric variables. That’s why the bars in the bar plot have spaces between them but bins in a histogram are adjacent to one another as the numbers can be continuous but categories cannot.

Another plot type that can help us visualize a numeric variable is a boxplot (also called a box-and-whisker plot). It provides a visual representation of the central tendency, spread, and skewness of numeric data, by displaying the minimum, first-quartile, median, third-quartile, and maximum, also known as the five-number summary.

Let us create a boxplot of weekly_earnings. Following the three-step pattern for bar plot and histogram, we will write a similar code with a slight twist.

ggplot(
1  data = atus_college,
2  aes(y = weekly_earnings)
) +
3  geom_boxplot()

1: We specify the data within the ggplot() function. i.e., ggplot(data = atus_college)
2: We map our variables to aesthetics. In this case we would like to display the weekly_earnings on the y-axis. i.e., aes(y = weekly_earnings). We actually do not want any variable on the x axis, so we do not specify any for it.
3: Lastly, we specify the plot type by using the appropriate geom object. i.e., geom_boxplot().

A vertical boxplot showing weekly earnings distribution. The x-axis has no label and its range from -0.4 to 0.4 has no real meaning. The y-axis is labeled weekly_earnings, ranging from 0 to 3000. The box spans approximately from 500 to 1250, with a median line near 750. The lower whisker extends from the lower end of the box down to 0, and the upper whisker extends from the upper end of the box up to about 2500. Several outliers appear above the upper whisker, ranging from about 2500 to 3000. — Figure 3.9: Boxplot of weekly earnings

Figure 3.9 looks like an interesting plot! What does it represent though? We show an annotated version of this boxplot in Figure 3.10. Each component of the boxplot represents a specific statistical measure of the data distribution of weekly earnings. Looking at the annotated figure let’s try to understand each part, one by one.

The box itself represents the Interquartile Range (IQR), which represent the range of middle 50% of all data points and spans from the first quartile (Q1 = $394) to the third quartile (Q3 = $1250), and hence IQR = Q3 - Q1 = $856. The horizontal line within the box marks the median value of $706, which is the point where exactly half of the observations fall above and half fall below.

The whiskers extend from the box to show the range of typical values in the dataset. The lower whisker reaches down to the lower whisker limit of $-890, while the upper whisker extends to the upper whisker limit of $2534. These limits are used as bounds to determine if the data have any potential outliers, which are values that do not follow the overall pattern of the data and are too large or too small compared to the rest of the data. The limits are calculated as 1.5 times the IQR distance from the box edges. In this instance, the lower whisker limit is $Q1 - 1.5IQR = 394 - 1.5\times856 = -890$. On the other hand, the upper whisker limit is $Q3 + 1.5IQR = 1250 + 1.5\times856 = 2534$.

The minimum value in the dataset is $12 so the lower whisker actually stops there rather than extend all the way to its limit of $-890. The maximum value is $3040, which is above the upper whisker’s limit of $2534, so the upper whisker extends all the way up to its limit. Note that, in addition to the maximum value, there are many other points that are observed above the upper whisker limit of $2534. Any data points beyond the whisker limits are then considered potential outliers and are plotted as individual points.

A detailed boxplot illustrating weekly earnings distribution with annotations. The y-axis is labeled 'Weekly Earnings ($)' and ranges from -1000 to 3000. The box spans from Q1 = 394 to Q3 = 1250, with a median at 706. The minimum value is 12, and the maximum is 3040. The upper whisker limit is 2534, and the lower whisker limit is -890. Several green points above the upper whisker represent potential outliers between 2534 and 3040. The interquartile range (IQR) is highlighted between Q1 and Q3. Text annotations indicate key values: Min = 12, Median = 706, Q1 = 394, Q3 = 1250, Upper whisker limit = 2534, Lower whisker limit = -890, and Max = 3040. — Figure 3.10: Annotated boxplot

This visualization in Figure 3.10 effectively summarizes the central tendency, spread, and extreme values of weekly earnings, making it easy to identify both the typical range of values and any unusual observations that might warrant further investigation. It does not however, show each individual observation other than the ones that are beyond the upper whisker limit. In Figure 3.11, we have overlaid each observation (i.e., each survey respondent student’s weekly earnings) onto the boxplot. Again, the x-axis is not meaningful, the points are only spread out so that they are visible and not on a single vertical line. This should make it clear that about a quarter of the points are below the box, a half are in the box, and a quarter above the box.

A vertical boxplot combined with jittered points showing weekly earnings distribution. The x-axis is labeled 'x' and the y-axis is labeled weekly_earnings, ranging from 0 to 3000. The box spans approximately from 500 to 1250, with a median line near 750. The lower whisker extends from the lower end of the box down to 0, and the upper whisker extends from the upper end of the box up to about 2500. Several outliers appear above the upper whisker, ranging from about 2500 to 3000. The boxplot is also overlayed with pink jittered points representing individual data values, scattered around the boxplot, with most points concentrated between 0 and 1500 and fewer points at higher earnings. — Figure 3.11: Boxplot overlayed with individual observation points

We can use a boxplot when

have numeric data
want to identify the central tendency and spread of the data
want to check if the data has any potential outliers

So far we have explore one variable at a time, but in real life, we often want to explore the relationship between two variables rather than examining them in isolation.

3.3 Visualizing Two Variables

Understanding how variables relate to each other can reveal important patterns and insights that aren’t apparent when looking at single variables alone. In this section, we’ll explore different plots for visualizing relationships between pairs of variables.

3.3.1 Visualizing Two Categorical Variables

When we have two categorical variables, we want to understand how the categories of one variable are distributed within the categories of another. A stacked bar chart is particularly useful for this purpose because it shows both the overall distribution of one variable and how it breaks down based on the second variable.

Let’s make a visualization to examine the relationship between employment status and enrollment status, which are both categorical variables. In Section 3.2.1 we already did a bar plot to look at the distribution of employment. We will keep the code and the plot very similar.

ggplot(
  data = atus_college, 
  aes(
    x = employment,
1    fill = enrollment
  )
) +
  geom_bar()

1: We add mapping of enrollment variable to the fill aesthetic. This will fill each bar based on the enrollment numbers.

A stacked bar chart showing counts of enrollment status within employment categories. The x-axis is labeled employment with three categories: Full Time, Part Time, and NA. The y-axis is labeled 'count' and ranges from 0 to 150. Each bar is divided into two segments: red for Full Time enrollment and teal for Part Time enrollment. For Full Time employment, the total height is about 140, with roughly 60 teal and 80 red. For Part Time employment, the total is about 85, mostly red with a small teal segment near 10. For NA, the total is about 100, with a larger red segment around 80 and teal around 20. A legend on the right identifies colors for enrollment status: red for Full Time enrollment and teal for Part Time enrollment. — Figure 3.12: Staked barplot of employment and enrollment status

This stacked bar chart shows the relationship between employment status and enrollment status among those enrolled in college. The x aesthetic maps employment categories to the x-axis, while the fill aesthetic uses different colors to represent enrollment categories within each employment group. This allows us to see not only how many students fall into each employment category, but also how enrollment status varies within each employment group.

There are different ways we can position the bars in a bar plot. One way we can do this is by having stacked bars.

ggplot(
  data = atus_college, 
  aes(
    x = employment,
    fill = enrollment
  )
) +
1  geom_bar(position = "fill")

1: We can get a stacked-bar plot by setting the position = "fill" within the geom_bar() function.

A stacked bar chart showing the proportion of enrollment status within employment categories. The x-axis is labeled 'employment' with three categories: Full Time, Part Time, and NA. The y-axis is labeled 'count' and ranges from 0 to 1, representing proportions. Each bar is divided into two segments: red for Full Time enrollment and teal for Part Time enrollment. For Full Time employment, the bar is about 45% teal and 55% red. For Part Time employment, the bar is mostly red (around 90%) with a small teal segment (about 10%). For NA, the bar is about 80% red and 20% teal. A legend on the right identifies colors for enrollment status: red for Full Time enrollment and teal for Part Time enrollment. — Figure 3.13: Standardardized stacked barplot of employment and enrollment

Each bar is scaled to the same height (representing 100% of observations in that employment category), and the colored segments show the relative proportions of each enrollment status within each employment group. This visualization is particularly useful for comparing the composition or distribution patterns across categories, as it eliminates the effect of different group sizes and focuses purely on proportional relationships. Note that ggplot still indicates count on the y axis despite displaying a proportion. We will learn how to change this label in Chapter 4.

We can also position the bars next to each other in a dodged fashion rather than stacking them on top of each other.

ggplot(
  data = atus_college, 
  aes(
    x = employment,
    fill = enrollment
  )
) +
1  geom_bar(position = "dodge")

1: For a dodged bar plot we set position = "dodge" within geom_bar().

A grouped side-by-side bar plot showing counts of employment categories broken into enrollment status. The x-axis is labeled 'employment' with three categories: Full Time, Part Time, and NA. The y-axis is labeled 'count' and ranges from 0 to 90. Each emplyment category has two bars: red for Full Time enrollment and teal for Part Time enrollment. For Full Time employment, the red bar is about 70 and the teal bar about 60. For Part Time employment, the red bar is about 75 and the teal bar about 10. For NA, the red bar is about 85 and the teal bar about 15. A legend on the right identifies colors for enrollment status: red for Full Time enrollment and teal for Part Time enrollment. — Figure 3.14: Side-by-side (dodge) barplot of employment and enrollment status

In a dodged bar plot, bars for different enrollment categories are placed side-by-side rather than stacked. This approach makes it easier to compare the actual counts or frequencies of each enrollment status across employment categories, as each bar’s height directly represents the number of observations. The grouped layout allows for direct visual comparison of bar heights both within and across employment groups.

3.3.2 Visualizing Two Numeric Variables

When we look at two numeric variables, we often want to explore their relationship, and the most effective visualization is a scatterplot. In a scatterplot, one variable is mapped to the x-axis and the other to the y-axis. We often think of y as a function of x, meaning that y is the one that could possibly react or change based on changes in x. Hence x is called the explanatory variable; whereas y is called the response variable. Through the use of a scatterplot, we want to determine if and how the explanatory variable ($x$) associates with the response variable ($y$). Each point represents an observation and its location shows the combination of values for both variables for that particular observation.

To classify the variables, we need context or information on how one would influence the other. For example, the number of hours a student studies for an exam would likely influence their exam score. In this case, $x$ is the number of hours studying, and $y$ is the exam score. If the relationship is not clear or logical, then the assignment is up to the data scientist’s discretion.

Let’s create a scatterplot to examine the relationship between time spent alone (in minutes) and weekly earnings (in dollars) among college students. In this case, we have chosen time spent alone to be plotted on the x-axis and weekly earnings on the y-axis because we of then think of money being a function of time, regardless of where that time is spent.

ggplot(
  data = atus_college,
  aes(
1    x = time_alone,
    y = weekly_earnings
  )
) +
2  geom_point()

1: To understand the relationship between time_alone and weekly_earnings, time_alone is mapped to x-axis and weekly_earnings mapped to y-axis.
2: To add a layer of a scatterplot we utilize geom_point().

A scatterplot showing the relationship between time alone and weekly earnings. The x-axis is labeled time_alone and ranges from 0 to 1000. The y-axis is labeled weekly_earnings and ranges from 0 to 3000. Each point represents an observation. Most points are concentrated in the lower left quadrant, with time alone between 0 and 500 and weekly earnings below 1500. A few points extend toward higher earnings up to 3000 and higher time alone values up to 1000, but they are sparse. The overall pattern suggests no strong linear relationship, with data widely scattered. — Figure 3.15: Scatterplot of time spent alone and weekly earnings

Each point represents one student, positioned according to how much time they spend alone (x-axis) and their weekly earnings (y-axis). By plotting all observations this way, we can look for patterns such as positive or negative relationships, clusters of data points, or potential outliers. For instance, we can see that there are fewer points in the upper right corner indicating that there are fewer students who spend longer time alone while also earning higher weekly income.

A relationship between two variables does not imply causation because their association simply tells us that the two variables move together in some way, but it does not prove that one variable causes the other to change. We need rigorously designed studies to establish causal relationships, which you will learn more about in a future chapter.

We can use a scatterplot when we want to

examine if a relationship between two continuous variables exists or not.
conduct a preliminary analysis before a more formal statistical test looking for particular types of relationships (e.g., we might look to see if weekly earnings decrease when time spent alone increases or vice versa).
identify potential outliers.

3.3.3 Visualizing a Numerical and a Cateogorical Variable

When we want to explore the relationship between a categorical variable and a numerical variable, we need visualization techniques that can show how the distribution of the numerical variable differs across the categories. There are many ways to accomplish this, but one of the most effective and informative approaches is using side-by-side boxplots.

A side-by-side boxplot creates separate boxplots for each category of the categorical variable, allowing us to compare the distribution of the numerical variable across different groups. Each boxplot displays the five-number summary (minimum, first quartile, median, third quartile, and maximum) along with any potential outliers, giving us a comprehensive view of how the numerical variable behaves within each category.

We want to compare weekly_earnings distribution of students based on their employment status.

ggplot(
  data = atus_college, 
  aes(
1    x = employment,
    y = weekly_earnings
  )
) +
  geom_boxplot()

1: This time we map our grouping variable to the x-axis, i.e., x = employment with y still being weekly earnings as in our previous example.

A side-by-side boxplot comparing weekly earnings across employment categories. The x-axis is labeled employment with three categories: Full Time, Part Time, and NA. The y-axis is labeled weekly_earnings and ranges from 0 to 3000. The Full Time boxplot shows a median near 1000, an interquartile range roughly from 750 to 1500, whiskers extending from the box down to about 25 and up to about 2500, and some outliers above 2500 up to 3000. The Part Time boxplot has a median around 400, an interquartile range from about 250 to 500, whiskers extending from the box down to 0 and up to about 1000, and a few outliers above 1000. The NA category has no visible box or data points — Figure 3.16: Side-by-side boxplots based on employment status

This visualization shows the distribution of weekly across different enrollment statuses. We see a boxplot for those enrolled full time, and those who are enrolled part time. However, for those whose employment status noted as NA there is no box. This is not an error on our end, in fact for those who have employment status not reported, there is also no weekly earning reported (i.e., also NA). It is possible that NA might be representing that those who are unemployed.

Each box represents the weekly earnings distribution for one enrollment category, making it easy to compare.

Which of the following can be concluded for certain based on the side-by-side boxplot of employment and weekly earnings. Select ALL that are true.

Among those who had weekly earnings, the highest weekly earning belonged to someone who was employed full time.
The first quartile of the full time employed students was greated than the third quartile of the part time employed students.
Full time employed category has fewer potential outliers than part time employed category.
In the full time employed category, there were more people who had weekly earnings between the third quartile and the median than the first quartile and the median.

Check footnote for answers²

We can use a side-by-side boxplot when we want to

examine the relationship between one categorical variable and one numeric variable
compare variability between different groups
compare distributions across different groups
compare the spread and central tendency of multiple groups side by side

3.4 Visualizing More Than Two Variables

Real-world data analysis often requires us to examine relationships among multiple variables simultaneously. While two-variable visualizations provide valuable insights, adding a third (or even fourth) variable can reveal more complex patterns and interactions that might otherwise remain hidden. Let’s consider one applied case that builds naturally on the scatterplot we’ve already learned.

Building onto the previous scatterplot code we have written, we will identify each student (i.e., point) by a color that differentiates their employment status.

ggplot(
  data = atus_college,
  aes(
    x = time_alone,
    y = weekly_earnings,
1    color = employment
  )
) +
  geom_point()

1: We do this by mapping the employment variable onto the color aesthetic.

A scatterplot showing the relationship between time alone and weekly earnings, with points colored by employment category. The x-axis is labeled time_alone and ranges from 0 to 1000. The y-axis is labeled weekly_earnings and ranges from 0 to 3000. Each point represents an observation, with red for Full Time, teal for Part Time, and gray for NA. Most points are concentrated in the lower left quadrant, where time alone is below 500 and weekly earnings are below 1500. Full Time points (red) dominate across the range, including higher earnings up to 3000. Part Time points (teal) cluster mostly at lower earnings below 1000. NA points (gray) are sparse. A legend on the right identifies colors for employment categories: red for Full Time, teal for Part Time, and gray for NA. — Figure 3.17: Grouping points by color based on employment status

Mapping the employment variable reveals that weekly earnings seems to show a pattern based on whether a student works full time or part time. We could alternatively differentiate each point by mapping the employment status to the shape of the point. Different point shapes (circles, triangles, squares, etc.) distinguish between enrollment categories. This approach can be particularly useful when printing in black and white or when color distinctions might not be clear for all viewers.

ggplot(
  data = atus_college,
  aes(
    x = time_alone,
    y = weekly_earnings,
1    shape = employment
  ) 
) +
  geom_point()

1: Theemployment variable is mapped to the shape aesthetic.

A scatterplot showing the relationship between time alone and weekly earnings, with point shapes indicating employment category. The x-axis is labeled time_alone and ranges from 0 to 1000. The y-axis is labeled weekly_earnings and ranges from 0 to 3000. Each point represents an observation: circles for Full Time, triangles for Part Time, and no shape for NA because there are no points with income and NA status for employment. Most points cluster in the lower left quadrant, where time alone is below 500 and weekly earnings are below 1500. Full Time points (circles) are spread across the range, including higher earnings up to 3000. Part Time points (triangles) are concentrated at lower earnings below 1000. There are no NA points. A legend on the right identifies shapes for employment categories: circles for Full Time, triangles for Part Time, and no shapte for NA. — Figure 3.18: Grouping points by shape based on employment status

For maximum clarity and accessibility, we can combine both approaches by mapping employment to both color and shape. This redundant encoding ensures that the enrollment categories are distinguishable regardless of whether someone can perceive color differences, and it makes the patterns even more apparent to all viewers.

ggplot(
  data = atus_college,
  aes(
    x = time_alone,
    y = weekly_earnings,
    color = employment,
    shape = employment
  )
) +
  geom_point()

Warning: Removed 102 rows containing missing values or values outside the scale range
(`geom_point()`).

Visualizing time_alone, weekly_earnings, and employment allows us to explore questions like: Do students with different employment statuses show different patterns in the relationship between time spent alone and weekly earnings? Are there employment groups that cluster in particular regions of the plot? These insights would be impossible to detect when examining the variables separately.

One last aesthetic we want to consider in this chapter is size. While color and shape aesthetic allowed us to consider a categorical variable (e.g., employment), size can allow us to visualize an additional numeric variable. The size aesthetic allows us to map a fourth variable (e.g., work_time) to the visual size of the points, creating an even more complex multivariable visualization.

ggplot(
  data = atus_college,
  aes(
    x = time_alone,
    y = weekly_earnings,
    color = employment,
1    size = work_time
  ) 
) +
  geom_point()

1: The work_time variable which is numeric, is mapped to size aesthetic.

A scatterplot showing the relationship between time alone and weekly earnings, with points varying in color and size to represent employment category and work time. The x-axis is labeled time_alone and ranges from 0 to 1000. The y-axis is labeled weekly_earnings and ranges from 0 to 3000. Each point represents an observation: red circles for Full Time, teal circles for Part Time, and gray circles for NA. Point size indicates work time, with larger circles representing more work time (up to 800) and smaller circles representing less. Most points cluster in the lower left quadrant, where time alone is below 500 and weekly earnings are below 1500. Full Time points (red) dominate across the range, including higher earnings up to 3000, while Part Time points (teal) are concentrated at lower earnings below 1000. A legend on the right shows color for employment and size scale for work time. — Figure 3.20: Adding a numerical variable to differentiate points based on size

The size aesthetic maps work_time to the diameter of each point - students who work more hours are represented by larger points, while those who work fewer hours appear as smaller points. This creates a bubble chart effect where we can possibly identify patterns such as whether students who work more hours (larger bubbles) tend to have higher earnings, or whether work time varies systematically across employment statuses.

It is worth noting that many of the features we have shown in this chapter was from a technical point of view, what you can and cannot do in a basic visualization. Just because you can get R to visualize four variables at the same time does not necessarily mean that you should. Whether you should do these or not is the topic of Chapter 4. Let’s finish this chapter up by summarizing the key points of ggplot code. In Figure 3.21 you can see a summary of different aesthetic arguments and geom objects that we covered in this chapter.

A reference chart summarizing ggplot2 aesthetics and geoms. The chart is divided into two columns. The left column lists aesthetic mappings under 'aes = (...........)' correspoinding arguments: x and y representing horizontal vertical arrows, respectively. Color groups points using color, such as black, pink, and green. Fill breaks bars based on groups assigning a color to each group, such as a barplot with three rectangles in pink and green. Shape groups points using shapes, such as circle, triangle, and square. Size maps a numerical variable to distinguish the amount of it based on the size of the point, such as pink circles of increasing size. The right column lists geoms under 'geom_.........()' with corresponding icons: bar with three vertical bars with space betweeen them, histogram with a series of small bars touching each other forming a distribution, boxplot with a horizontal box and whiskers, point with a scatterplot of small green dots. The chart visually connects aesthetic options on the left to common geom types on the right. — Figure 3.21: A Summary of aesthetics and geom objects covered in this chapter

Since the histogram of weekly_earnings displays most of the data on the left with fewer observations on the right, in other words, with the tail of the distribution on the right, this is a right-skewed distribution. In right-skewed distributions, mean > median. The very large observations on the right pull the mean to a higher value (imagine including Bill Gates’ weekly earnings in this data!); however, the median is often minimally, if at all, impacted by extremely large (or small) values since it represents the middle observation in the data.↩︎
Only a and b can be concluded for certain. At a first glance c might seem correct as we see 3 points and 5 points for the full time and part time categories respectively. However, we don’t know if there are any points on top of each other. For instance in Figure 3.10 there were 10 outliers but each of these points were not visible until we looked at Figure 3.11. So we cannot know choice c for certain. Choice d is incorrect. About 25% of the people is between the third quartile and the median and about 25% of the people is between the first quartile and the median.↩︎