Week 7 Data Visualization

This week, we focus on data visualization. A more in-depth and highly recommendable resource for data visualization in R is Wickham (2016). A more general introduction to data visualization - which is still highly recommendable is Healy (2018).

The aim of this week is to showcase how to visualize data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods for data visualizations.

Preparation and session set up

For this week’s tutorial, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# install packages
install.packages("lattice")
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")
install.packages("likert")
install.packages("scales")
install.packages("vcd")
install.packages("tm")
install.packages("ggridges")
install.packages("tidyr")
install.packages("knitr")
install.packages("DT")
install.packages("wordcloud")
install.packages("flextable")
install.packages("hexbin")
install.packages("ggstatsplot")
install.packages("PMCMRplus")
install.packages("rstantools")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we activate them as shown below.

# set options
options(stringsAsFactors = F)          # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# activate packages
library(lattice)
library(dplyr)
library(stringr)
library(ggplot2)
library(likert)
library(vcd)
library(tm)
library(ggridges)
library(tidyr)
library(knitr)
library(DT)
library(wordcloud)
library(flextable)
library(hexbin)
library(rstantools)
library(PMCMRplus)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.

7.1 Getting started

Before turning to the graphs, we load the data that we will display. The data set is called lmmdata but we will change the name to pdat for this tutorial. The data set is based on the Penn Parsed Corpora of Historical English (PPC) and it contains the date when a text was written (Date), the genre of the text (Genre), the name of the text (Text), the relative frequency of prepositions in the text (Prepositions), and the region in which the text was written (Region). We also add two more variables to the data called GenreRedux and DateRedux. GenreRedux collapses the existing genres into five main categories (Conversational, Religious, Legal, Fiction, and NonFiction) while DateRedux collapses the dates when the texts were composed into five main periods (1150-1499, 1500-1599, 1600-1699, 1700-1799, and 1800-1913). We also factorize non-numeric variables.

# load data
pdat  <- base::readRDS(url("https://slcladal.github.io/data/pvd.rda", "rb"))

Let’s briefly inspect the data.

Table 7.1: First 15 rows of the pdat data.
Date	Genre	Text	Prepositions	Region	GenreRedux	DateRedux
1,736	Science	albin	166.01	North	NonFiction	1700-1799
1,711	Education	anon	139.86	North	NonFiction	1700-1799
1,808	PrivateLetter	austen	130.78	North	Conversational	1800-1913
1,878	Education	bain	151.29	North	NonFiction	1800-1913
1,743	Education	barclay	145.72	North	NonFiction	1700-1799
1,908	Education	benson	120.77	North	NonFiction	1800-1913
1,906	Diary	benson	119.17	North	Conversational	1800-1913
1,897	Philosophy	boethja	132.96	North	NonFiction	1800-1913
1,785	Philosophy	boethri	130.49	North	NonFiction	1700-1799
1,776	Diary	boswell	135.94	North	Conversational	1700-1799
1,905	Travel	bradley	154.20	North	NonFiction	1800-1913
1,711	Education	brightland	149.14	North	NonFiction	1700-1799
1,762	Sermon	burton	159.71	North	Religious	1700-1799
1,726	Sermon	butler	157.49	North	Religious	1700-1799
1,835	PrivateLetter	carlyle	124.16	North	Conversational	1800-1913

In addition, we will create a vector with colors that we will be using throughout this tutorial. This is not really necessary but it shares us from having to specify colors every time when we do not want to use the default colors that R provides. In this case, we will specify five colors but this palette could be extended. You can also check out the colors that are available in R here and the palettes or sets of colors here.

clrs5 <- c("indianred4", "gray30", "darkblue", "orange", "gray80")
clrs3 <- c("indianred4", "gray30", "darkblue")
clrs2 <- c("orange", "gray80")

We will now turn to creating the graphs.

7.2 Dot and Scatter Plots

The first, and simplest graph, is a so-called scatter or dot plot. Scatter plots are used when the graph is set up to display the relationship between two numeric variables. We will start off with creating a scatter plot in base, then in lattice and finally in the ggplot environment.

7.2.1 Scatter Plots in base

The most fundamental function to create plots in the base environment is to use the general “plot” function. Here, we use that function to create a simple scatter plot.

# create simple scatter plot
plot(Prepositions ~ Date,                 # plot Prepositions by Date
     type = "p",                          # plot type p (points) 
     data = pdat,                     # data from data set pdat  
     ylab = "Prepositions (Frequency)",   # add y-axis label 
     xlab = "Date (year of composition)", # add x-axis label 
     main = "plot type 'p' (points)"      # add title 
     )                                    # end drawing plot

Let us go over the command. The first part of the call is plot which is the function for plotting data in base R. In the round brackets are the arguments in which we specify what the plot should look like. The Prepositions ~ Date part tells R which variables should be displayed and the type = "p" part tells R which type of plot we want (p stands for points, l for lines, b for both lines and points). The part data = pdat tells R which data set to take the data from, and ylab = "Prepositions (Frequency)" and xlab = "Date (year of composition)" informs R about the axes’ labels. The part main = "plot type 'p' (points)" informs R about what we want as the main title of the plot.

7.2.2 Scatter Plots in ggplot2

We now turn to data visualization using ggplot. As the ggplot2 package is already loaded, we create a very basic scatterplot in ggplot2 using the geom_point function to show the advantages of creating visualizations in this environment.

# create simple scatter plot
# use data set "pdat"
ggplot(pdat,  
       # define axes
       aes(x= Date,        
           y= Prepositions)) + 
  # define plot type
  geom_point()

Let’s go over the code above. The function call for plotting in “ggplot2” is simply “ggplot”. This function takes the data set as its first argument and then requires aesthetics. The aesthetics are defined within the “ggplot” function as the arguments of “aes”. The “aes” function takes the axes as the arguments (in the current case). Then, we need to define the type of plot that we want. As we want a scatter plot with points, we add the “geom_point()” function without any arguments (as we do not want to specify the size, colour, and shape of the points just yet).

The advantage of “ggplot2” is that is really easy to modify the plot by adding new layers and to change the basic outlook by modifying the theme which is what we will do in the code below.

ggplot(pdat,    
       # define axes
       aes(x=Date,             
           y= Prepositions, 
           # define to color by Species
           color = GenreRedux)) + 
  # define plot type
  geom_point() +   
  # define theme  as black and white (bw)
  theme_bw()

The white background is created by specifying the theme as a black and white theme (theme_bw()) while the color of the dots is changed by specifying that the color should be applied by Species (color = GenreRedux). Then, the colors to be used are defined in the function scale_color_manual.

We can now specify the symbols in the scatter plot.

# create scatter plot colored by genre
ggplot(pdat, aes(Date, Prepositions, color = GenreRedux, shape = GenreRedux)) +
  geom_point() +
  guides(shape=guide_legend(override.aes=list(fill=NA))) +
  scale_shape_manual(name = "Genre", 
                     breaks = names(table(pdat$GenreRedux)), 
                     values = 1:5) +
  scale_color_manual(name = "Genre", 
                     breaks = names(table(pdat$GenreRedux)), 
                     values = clrs5) +
  theme_bw() +
  theme(legend.position="top")

7.3 Line Graphs

Line graphs are used when we have numeric values that are linked (in one way or another) because they come from the same speaker or genre as in our case).

pdat %>%
  dplyr::group_by(DateRedux, GenreRedux) %>%
  dplyr::summarise(Frequency = mean(Prepositions)) %>%
  ggplot(aes(x=DateRedux, y= Frequency, group= GenreRedux, color = GenreRedux)) +
  # add geom layer with lines
  geom_line()

7.4 Histograms

Histograms summarize numeric variables by showing their distribution across bins.

Using ggplot, we specify the variable we want to summarize in the aesthetics and use the geom_histogram function to generate a histogram.

ggplot(pdat, aes(Prepositions)) +
  geom_histogram()

We can simply add information about a second variable by specifying this variable as the basis for the coloring of the bars (which we do by specify the fill argument).

ggplot(pdat, aes(Prepositions, fill = Region)) +
  geom_histogram()

7.5 Bar plots

# create bar plot data
bdat <- pdat %>%
  dplyr::mutate(DateRedux = factor(DateRedux)) %>%
  group_by(DateRedux) %>%
  dplyr::summarise(Frequency = n()) %>%
  dplyr::mutate(Percent = round(Frequency/sum(Frequency)*100, 1))

The creation of barplots in ggplot works just like other types of visualizations in this framework. We first define the data and the aesthetics and then use the geom_bar to create a barplot.

# bar plot
ggplot(bdat, aes(DateRedux, Percent, fill = DateRedux)) +
  geom_bar(stat="identity") +          # determine type of plot
  theme_bw() +                         # use black & white theme
  # add and define text
  geom_text(aes(y = Percent-5, label = Percent), color = "white", size=3) + 
  # add colors
  scale_fill_manual(values = clrs5) +
  # suppress legend
  theme(legend.position="none")

Compared with the pie chart, it is much easier to grasp the relative size and order of the percentage values which shows that pie charts are unfit to show relationships between elements in a graph and, as a general rule of thumb, should be avoided.

Bar plot can be grouped to add another layer of information which is particularly useful when dealing with frequency counts across multiple categorical variables. To create grouped bar plots, we plot Region while including DateRedux as the fill argument. Also, we use the command position=position_dodge().

# bar plot
ggplot(pdat, aes(Region, fill = DateRedux)) + 
  geom_bar(position = position_dodge(), stat = "count") +  
  theme_bw() +
  scale_fill_manual(values = clrs5)

If we leave out the position=position_dodge() argument, we get a stacked bar plot as shown below.

# bar plot
ggplot(pdat, aes(DateRedux, fill = GenreRedux)) + 
  geom_bar(stat="count") +  
  theme_bw() +
  scale_fill_manual(values = clrs5)

One issue to consider when using stacked bar plots is the number of variable levels: when dealing with many variable levels, stacked bar plots tend to become rather confusing. This can be solved by either collapsing infrequent variable levels or choose a colour palette that reflects some other inherent piece of information such as formality (e.g. blue) versus informality (e.g. red).

Stacked bar plots can also be normalized so that changes in percentages become visible. This is done by exchanging position=position_dodge() with position="fill".

# bar plot
ggplot(pdat, aes(DateRedux, fill = GenreRedux)) + 
  geom_bar(stat="count", position="fill") +  
  theme_bw() +
  scale_fill_manual(values = clrs5) +
  labs(y = "Probability")

7.6 Boxplots

So far, we have plotted values but we have not plotted the underlying distributions. For instance, we have plotted mean values but not the variance within the distribution. One handy way to combine plotting general trends and their underlying distributions are boxplots.

Boxplots, or Box-and-Whisker Plots, are exploratory graphics first created by John W. Tukey and they show the relationships between categorical and numeric variables. They are very useful because they not only provide measures of central tendency (the median which is the line in the middle of the box) but they also offer information about the distribution of the data. To elaborate, fifty percent of data points fall within the box while seventy-five percent of data points fall within the whiskers (the lines which look like extended error bars): the box thus encompasses the interquartile range between the first and third quartile. The whiskers show the minimum and maximum values in the data and only outliers (data points that lie 1.5 times the interquartile range or more above the third quartile or 1.5 times the interquartile range or more below the first quartile. If the whiskers differ in length, then this means that the data is asymmetrically distributed.

# create boxplot
ggplot(pdat, aes(DateRedux, Prepositions, color = GenreRedux)) +                 
  geom_boxplot(fill=clrs5, 
               color="black")

Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton University Press.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. springer.

References

Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton University Press.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. springer.