We will continue using the data set from the previous section to
introduce some basics of data visualization. Some libraries can include
data sets that once you install the library you can use. In this case we
will use data from the STNet
library, if you dont have it
installed yet make sure you do by using:
# Install STNet from github
devtools::install_github('spablotemporal/STNet')
# Load the libraries
library(ggplot2) # for graphics
library(dplyr) # For data manipulation
library(STNet)
# loading the data from the package
data('captures') # we load the data
head(captures) # let's have a look at the data
## municipality location Loc date year captures
## 1 Temascaltepec San Pedro Tenayac Cueva el Uno 11/06/14 2014 6
## 2 Tlatlaya Nuevo Copaltepec La alcantarilla 12/05/05 2005 3
## 3 Tlatlaya Nuevo Copaltepec La alcantarilla 12/05/07 2007 30
## 4 Tlatlaya Nuevo Copaltepec La alcantarilla 12/03/09 2009 0
## 5 Tlatlaya Nuevo Copaltepec La alcantarilla 10/08/10 2010 4
## 6 Tlatlaya Nuevo Copaltepec La alcantarilla 16/05/11 2011 4
## treated lat lon trap_type
## 1 6 18.03546 -100.2095 1
## 2 2 18.40417 -100.2688 1
## 3 29 18.40417 -100.2688 4
## 4 0 18.40417 -100.2688 3
## 5 3 18.40417 -100.2688 1
## 6 3 18.40417 -100.2688 2
By default R already has a set of functions to create a variety of
figures, but the code can get quite complex and difficult to read as we
produce more detailed figures. ggplot2
is a library that
provides a set of functions for producing a variety of figures.
The function ggplot()
has to be called at the beginning
of the plot definition, this function sets a blank canvas for our plot.
If we call the function with no arguments we will just see the empty
canvas, for example:
ggplot()
Then we can add layers to our canvas based on the data we want to
visualize, similarly to the pipes, we will connect the different layers
of our plot with the operator +
.
The basic components that we need to define for a plot are the following:
An example:
ggplot(data = captures) + # This is the data we will use
geom_histogram( # This is the geometry
aes(x = treated) # The aesthetic includes only one variable representing the x axis
)
Other components of the plots can be defined to further customize our
figures, and we will cover those more in detail in future
sections.
As you noticed in the previous example, we can print the figures
directly from the R console, but a way I like to organize our figures is
to put them all inside a single object in R. This object can be a
list
, which is just a container for other objects.
# To create an empty list we can use the function list()
figures <- list()
The most simple way to visualize the distribution of a continuous variable is using a histogram. Histograms are a special kind of bar plots where our variable is grouped in bins and showing the counts for each bin. Now that we have our container list for the plots, we can simply save it there and assign a name we want.
Notice that we will combine the pipes with the ggplot syntax. you can either define the name of the data in the ggplot function or before the function and connect it with a pipe.
figures$histogram <- captures %>% # This is the data we use.
ggplot() + # we set the empty canvas
geom_histogram(aes(x = treated)) # add a layer to visualize a histogram
# we can see our plot by calling the name in our container list
figures$histogram
Box plots are great to show the distribution of a continuous variable. We can use it to show only one variable, or multiple variables. It is important to be very descriptive when making plots, the idea of a figure is that can explain itself. we will start to slowly introduce functions to do this and customize our figures.
# Only one variable
figures$box <- captures %>%
ggplot() +
geom_boxplot(aes(y = treated))
figures$box
Bar plots are great to represent frequencies of categories. In the following example we will first count the number of treated by year, and then see it in a bar plot.
figures$bars <- captures %>%
count(year, wt = treated) %>%
ggplot() +
geom_bar(aes(
x = year, # X axis
y = n # Y axis
), stat = 'identity') # type of barplot
figures$bars
This is one of the most popular kind of plots, it is useful to represent relationship between two continuous variables.
figures$scatter <- captures %>% # first we start with the name of our data.frame
ggplot() + # then we set the canvas
geom_point(aes(x = treated, y = captures)) # and we add layer for points
figures$scatter
To create a time series we will need to reformat the data a little
bit so R can do what we want. We will introduce a new kind of variable:
date
. The date variables are pretty much what it sounds
like, is a variable that has a format with year, month and day; there
are other ways to format dates in R, but this is the most common and
straight forward.
tCaptures <- captures %>%
count(date, wt = treated) %>%
mutate(date = as.Date(date, "%m/%d/%y"), # First we will format the date
week = format(date, "%V"),
week = lubridate::floor_date(date, 'week')) # The we create a variable formatting the date as week of the year
Now that we have our variables in the correct format, we can use it as any other variable.
figures$timeseries <- tCaptures %>%
ggplot() +
geom_line(aes(x = week, y = n))
figures$timeseries
Now that we have all the figures in a list, we can make arrangements
with our figures. For this we use the function ggarrange()
from the ggpubr library.
library(ggpubr) # load the library
ggarrange(plotlist = figures)
Usually we try to avoid spaces when using names for the column names, but for our plots this could be not the most straight forward way to communicate our analysis, we can set specific labels to make our plots more readable and self explanatory. Let’s improve bar plot figure a bit more to make it clearer:
figures$bars <- captures %>%
count(year, wt = treated) %>%
ggplot() +
geom_bar(aes(x = year, y = n), stat = 'identity') + # type of barplot
labs(# We will use the function labs to generate our labels
title = 'Number of treated by year',
x = 'Year',
y = 'Number of treated'
)
figures$bars
ggplot includes the function theme()
to define most of
the aspects of the figure such as the background color, the grid, axes,
legend, among many others. There is also several predefined themes (all
start with theme_
followed by the name of the theme) that
you can use, if don’t want to mess with all the arguments from the
function theme()
. For example:
# all the predefined themes start with theme_
figures$timeseries <- tCaptures %>%
ggplot() +
geom_line(aes(x = week, y = n)) +
labs(title = 'Treated by year', x = 'Week', y = 'Number of treated') +
theme_minimal() # We will use the theme minimal
figures$timeseries
There are other aesthetics we can define such as color, type of point, size, among many others. Lets try changing the point shape for one of the plots we previously did:
figures$scatter <- captures %>% # the data we are using
ggplot() + # we set the canvas
geom_point(aes(
x = captures, # X axis
y = treated, # Y axis
shape = factor(trap_type) # point shape
)) +
labs(title = 'Captures and treatments by year', x = 'Captures', y = 'Treated', shape = 'Trap type') +
theme_classic() # now lets try the theme 'classic'
figures$scatter
In the next example we will use the trap type to color our boxplot:
captures %>%
ggplot() + # set the canvas
geom_boxplot(aes(y = treated, fill = factor(trap_type))) # we add a boxplot layer
We can also use the variable to color other parts of the plot such as the border:
captures %>%
ggplot() + # set the canvas
geom_boxplot(aes(y = treated, col = factor(trap_type))) # we add a boxplot layer
So far we have added variables inside our aes()
function, but we can add some arguments outside the aes()
function that we want them to be applied for all observations. For
example, we can change the outline of the boxplot to be the same for the
two groups, but the fill color different per group:
captures %>%
ggplot() + # set the canvas
geom_boxplot(
aes(y = treated, fill = factor(trap_type)), # This is the normal aesthetics we define
col = 'red' # all aesthetics we define here will be applied to all th ebservations
) # we add a boxplot layer
To define specific colors for our figure, we can use the function
scale_*_manual
where the * represents the aesthetic we want
to represent. If we want to use the color for the fill, we would
use:
figures$box <- captures %>%
ggplot() +
geom_boxplot(aes(y = treated, fill = factor(trap_type))) +
scale_fill_manual(values = c('red', 'green', 'blue', 'yellow'))
figures$box
R manages colors in three different ways: by name (i.e: ‘red’), by
rgb value using the function rgb()
(i.e. rgb(1, 0, 0)
), or using hexadecimal
code (i.e. “#F00000”). You can get a full list of the named colors
in R by using the function colors()
, but you will only be
able to see the names. Luckly someone made a tool that can help us
exacly the colors that we want: the Colour Picker addin. Addins
are tools that are available in Rstudio to facilitate tasks, lets try
the colour picker (should be already in your addins toolbar).
Exercise: Pick 4 colors you like and use them to create a boxplot of the distribution of number of treated by the trap type.
Facets are a way of stratifying the data based on variables in the
data set, you can think about it in a similar way we have been using
groups. To create a stratified plot we can use the function
facet_grid()
which will ask for a variable to go in the
rows and another for columns:
figures$histogram <- captures %>% # The data we will use
ggplot() + # set the canvas
geom_histogram(aes(treated), fill = 'red4') + # We will create a histogram of the Age
facet_grid(
rows = vars(trap_type), # We will use the Sex variable for rows
cols = vars(year) # We use the Result variable for Columns
) +
theme_pubclean() # lets try another theme
figures$histogram
Now that we have a good mix of different figures, lets export them.
We can use the function ggsave()
to export a high
definition figure.
ggsave(figures$histogram, filename = 'myHistogram.png')
You can create more detailed arrangements of figures like i will show in the following example:
ggarrange( # Funciton to arrange the figures
figures$box, figures$scatter, figures$bars, figures$timeseries, # These are the figures I will arrange
heights = c(5, 3), # We can define the heights for each of the rows
labels = c("A","B", "C", "D") # add a label for each figure
) %>%
annotate_figure( # Function to annotate the figure
top = text_grob( # Top title
label = 'Distribution of the data', # Label for title
face = 'bold', # Bold letters
size = 18 # Size of the title
)
)
Having static figures is the most common application of graphics in
R, but R is also capable of making interactive figures that can be used
in dashboards and other platforms (i.e. shiny, or quarto). There are
several libraries that allow you to create interactive figures, one of
the most popular ones is called plotly
. The best part of
plotly is that if you learn how to use ggplot, you can transfer your
figures to interactive plotly figures pretty much seamlessly. Lets try
that.
We use the function ggplotly()
from the
plotly
library to do that:
library(plotly) # laod the plotly library
# Use the ggplotly function in one of the figures we previouslt created:
ggplotly(figures$histogram)
This lab has been developed with contributions from: Jose Pablo
Gomez-Vazquez.
Feel free to use these training materials for your own research and
teaching. When using the materials we would appreciate using the proper
credits. If you would be interested in a training session, please
contact: jpgo@ucdavis.edu