R and Data visualization

Pablo Gomez

Introduction

What is this workshop?


  • This workshop is an introduction
  • DO NOT EXPECT TO MASTER R AFTER THE WORKSHOP
  • If you already have experience with R, is always nice to see how other people do things, so feel free to share!

Tentative Schedule


Workshop format

Homepage: https://cadms-ucd.github.io/PROCINORTE_TT/

Posit cloud




Posit cloud

https://posit.cloud


It’s Lab time!

Introduction to R studio

Some programming concepts

Comments

COMMENT AS MUCH AS POSSIBLE!

# This is a comment in R it will be only for the user
This is not a comment and will cause an error

What is the difference between line 1 and 2?

YES! the # character will make everything after it a comment in that line of code

10 + 10 # Everything after will be a comment
7 + 4

Operators

Operators are characters with a specific function in R for example

3 + 3 # this is a sum operator
[1] 6
3 - 2 # this is a subtract operator
[1] 1
4 * 4 # This is a multiplication
[1] 16

Later we will see other kind of operators, but… DONT STRESS about learning everything.

Objects

Objects in R are containers for information, we can create objects with any names we want that start with a letter

myNumber <- 4
myResult <- 4 * 5

Storing multiple elements

Using the c() function

x <- c(1, 3, 5) # using the c() function
x
[1] 1 3 5

Using the list() function

y <- list(1, 3, 5) # using the list() function
y
[[1]]
[1] 1

[[2]]
[1] 3

[[3]]
[1] 5

Boolean logic

1 == 1 # is it equal?
[1] TRUE
1 != 1 # is it NOT equal?
[1] FALSE
1 %in% c(1, 2, 3) # is the number contained in the sequence?
[1] TRUE

Notice that we are using operators to make the comparisons

Functions

Functions are a special kind of object. Functions are objects that require arguments, the arguments needs to be inside parentheses.

# create a sequence of numbers
seq(
  from = 0, # Starting number
  to = 80, # Ending number
  by = 20 # number increment of the sequence
) 
[1]  0 20 40 60 80

Notice that the arguments are named in the function, the arguments in the function seq() function are from, to, by.

We can create our own functions, which we will talk more about in the labs

Variables in R


  • numeric, continuous numeric variables WITH any decimal values. For example: KG of product imported, probability of an event happening.
  • integer , Whole numbers WITHOUT decimal values. For example: Number of animals, number of shipments, etc..
  • character , Alphanumeric variables. For example: name of a region, name of a disease, farm ID.
  • factor , Alphanumeric variable with specific categories or levels. For example: type of product imported, type of farm, etc…

Test time!

x <- seq(from = 5, to = 23, length.out = 10) # create a sequence of numbers
y <- seq(from = 0.1, to = 0.78, length.out = 10) # Create another sequence
mean(x*y) # Get the mean of the multiplication
[1] 7.406667

Objects:
- x
- y

Operators:
- *
- <-
- =

Functions:
- seq()
- mean()

Arguments:
- from
- to
- lengt.out

R syntax

R Syntax

R is like a calculator, we can make mathematical operations, for example:


x = 2 # create a new object with the = operator
y <- 2 # create a new object with the <- operator
x + y # make a operation with the objects
[1] 4

Vectors in R

You can store more than one value using vectors, to create a vector of numbers we use c().


x <- c(5, 6, 7, 8, 9, 10) # create a sequence form 5 to 10
y = 5:10 # create the same sequence but with a different approach
x == y # ask R if the objects have the same information
[1] TRUE TRUE TRUE TRUE TRUE TRUE


Using the keys “alt” + “-” will automatically add the operator <-.

Vector operations

When we have a vector, we can ask R specific values inside an object by using the operator [ ] and specifying which ones we want.


x
[1]  5  6  7  8  9 10


# Here we ask the 3rd value from our sequence
x[3]
[1] 7

Vector operations

When we have a vector, we can ask R specific values inside an object by using the operator [ ] and specifying which ones we want.


x
[1]  5  6  7  8  9 10
y
[1]  5  6  7  8  9 10


# Now we multiply the 3rd value of the x sequence times the 5th value of the y sequence
x[3] * y[5]
[1] 63

Functions


# To get the sum of a vector of numbers inside an object we use sum()
sum(x)
[1] 45

Functions

We can put functions inside function, for example, to get \(\sqrt{\sum_1^n x}\) the square root of a sum of the numbers in x we can use:


sqrt(sum(x))
[1] 6.708204

Making functions

The following function has only one argument which is a name (string) and just pastes some text before and after:

F1 <- function(name){
  paste("Hola", name, "! welcome to the R world (: !") # paste the name with some text
}

# Testing the function (Put your name)
F1(name = "Pablo")
[1] "Hola Pablo ! welcome to the R world (: !"

Remember this?

\[\sqrt{\sum_1^n x}\]

sqrt(sum(x))

Introducing the pipes %>%

Pipes (%>%), can connect several functions to an object.

For example, if we want to execute a function F1() followed by another function F2() for the object x:

F2(F1(x))

is equivalent to:

x %>% F1() %>% F2()

For example

\[\sqrt{\sum_1^n x}\]

Instead of this:

sqrt(sum(x))

We can write it like this:

x %>% sum() %>% sqrt()

For example

Instead of this:

# Get the number of outgoing and incoming shipments 
Out <- rename(summarise(group_by(mov, id_orig), Outgoing = n()), id = id_orig)

We can write this:

# Get the number of outgoing and incoming shipments 
Out <- mov %>% 
  group_by(id_orig) %>%
  summarise(Outgoing = n()) %>%
  rename(id = id_orig)

For example

Instead of this:

# Get the number of outgoing and incoming shipments 
Out <- rename(summarise(group_by(mov, id_orig), Outgoing = n()), id = id_orig)

We can write this:

# Get the number of outgoing and incoming shipments 
Out <- mov %>% # This is the movement data set
  group_by(id_orig) %>% # Group by origin
  summarise(Outgoing = n()) %>% # Count the number of observations
  rename(id = id_orig) # Rename the variable


And we can break down the code easier!

It’s Lab time!

R syntax (Section 1)

Let’s use some Data!

Importing data

Download the excel file from this link. It’s not necessary to have a Box account.

# Import the excel file
PRRS <- readxl::read_xlsx("Data/PRRS.xlsx")
PRRS
Result Sex Age OtherSpecies id name farm_type County
No H 18 0 23 Armstrong Research Farm sow farm Pottawattamie
No H 60 0 23 Armstrong Research Farm sow farm Pottawattamie
No H 60 0 23 Armstrong Research Farm sow farm Pottawattamie
Yes H 36 0 23 Armstrong Research Farm sow farm Pottawattamie

Reducing the data


Sometimes we want to select specific columns and rows on our data to reduce the dimensionality, for this we can use the functions:

  • select() to select specific columns
  • slice() to select specific rows based on position
  • filter() to select specific rows based on a condition

Selecting specific columns

PRRS %>%  # the name of our data
  select(Result, farm_type) # we want to select only the columns Result and farm_type
Result farm_type
No sow farm
No sow farm
No sow farm
Yes sow farm
Yes sow farm

Selecting specific columns

We can also specify which columns we DON’T want to show in our data:

PRRS %>% 
  select(-Age, -id) # with a '-' before the name we will exclude the column from the data
Result Sex OtherSpecies name farm_type County
No H 0 Armstrong Research Farm sow farm Pottawattamie
No H 0 Armstrong Research Farm sow farm Pottawattamie
No H 0 Armstrong Research Farm sow farm Pottawattamie
Yes H 0 Armstrong Research Farm sow farm Pottawattamie
Yes H 0 Armstrong Research Farm sow farm Pottawattamie

Filtering specific rows

Filtering only the observations from boar studs:

PRRS %>% 
  filter(farm_type == 'boar stud') # we will use the equals to operator for this
Result Sex Age OtherSpecies id name farm_type County
No H 48 0 32 Farm Sweet Farm at Rosmann Family Farms boar stud Shelby
No H 60 0 32 Farm Sweet Farm at Rosmann Family Farms boar stud Shelby
Yes H 60 0 32 Farm Sweet Farm at Rosmann Family Farms boar stud Shelby
Yes H 15 0 32 Farm Sweet Farm at Rosmann Family Farms boar stud Shelby
No H 68 0 32 Farm Sweet Farm at Rosmann Family Farms boar stud Shelby

Creating variables

# Creating a new variable
PRRS %>% # name of the data set
  mutate( # mutate is the function we sue to create new variables
    SowFarm = ifelse(farm_type == 'sow farm', 1, 0) # we will create a binary variable where 1 = sow farm, 0 = Any other farm type
  ) 
Result Sex Age OtherSpecies id name farm_type County SowFarm
No H 18 0 23 Armstrong Research Farm sow farm Pottawattamie 1
No H 60 0 23 Armstrong Research Farm sow farm Pottawattamie 1
No H 60 0 23 Armstrong Research Farm sow farm Pottawattamie 1
Yes H 36 0 23 Armstrong Research Farm sow farm Pottawattamie 1
Yes H 50 0 23 Armstrong Research Farm sow farm Pottawattamie 1

Grouping the data

We can calculate different statistics by group. For example lets calculate the mean and standard deviation of the age by Result and Sex:

PRRS %>%
  group_by(Result, Sex) %>% # The groups used for the data
  summarise( # the function summarise calculates statistics by the defined groups
    meanAge = mean(Age), # Calculate the mean age
    sdAge = sd(Age) # Calculate the standard deviation
  )
Result Sex meanAge sdAge
No H 39.67300 24.82636
No M 22.39357 16.67979
Yes H 23.61135 19.79150
Yes M 15.10870 10.30037

Joining data sets

Sometimes we have different data sets that have variables in common and we want to integrate them into a single data set for further analysis.

Joining data sets

Farms:

nodes
id name lat long farm_type
1 Iowa Select Farms Inc 42.50489 -93.26323 sow farm
2 Stanley Martins Fleckvieh Farms 43.08261 -91.56682 sow farm
3 Centrum Valley Farms 42.66331 -93.63630 nursery
4 Hilltop Farms fresh produce 41.71651 -93.90491 sow farm
5 Hog Slat Inc. 42.25929 -91.15566 GDU

Joining data sets

Movements:

Out
id Outgoing
1 30
3 13
4 15
5 33
6 11

Joining data sets

# Joining the two datasets
nodes <- nodes %>% 
  left_join(Out, by ="id")
id name lat long farm_type Outgoing
1 Iowa Select Farms Inc 42.50489 -93.26323 sow farm 30
2 Stanley Martins Fleckvieh Farms 43.08261 -91.56682 sow farm NA
3 Centrum Valley Farms 42.66331 -93.63630 nursery 13
4 Hilltop Farms fresh produce 41.71651 -93.90491 sow farm 15
5 Hog Slat Inc. 42.25929 -91.15566 GDU 33

Back to the excercise!

R syntax (Sections 2 and 3)

Data visualization

ggplot2


  • We build our figures based on layers
  • Similar syntax as dplyr
  • We can combine data wrangling and visualization into a single code chunk

Lectures

Instead of the %>%, in ggplot we connect pieces of code with +

ggplot2


The basic components that we need to define for a plot are the following:

  • data, the data set we will use to generate the figure
  • geometry, or type of graphic we will generate (i.e. histogram, bar, scatter, etc..)
  • aesthetic, variables or arguments that will be used for the figure for example: location, color, size, etc..

Example

captures # Data
municipality location Loc date year captures treated lat lon trap_type
Temascaltepec San Pedro Tenayac Cueva el Uno 11/06/14 2014 6 6 18.03546 -100.2095 1
Tlatlaya Nuevo Copaltepec La alcantarilla 12/05/05 2005 3 2 18.40417 -100.2688 1
Tlatlaya Nuevo Copaltepec La alcantarilla 12/05/07 2007 30 29 18.40417 -100.2688 4
Tlatlaya Nuevo Copaltepec La alcantarilla 12/03/09 2009 0 0 18.40417 -100.2688 3
Tlatlaya Nuevo Copaltepec La alcantarilla 10/08/10 2010 4 3 18.40417 -100.2688 1

Example

captures %>% # Data used
  count(year, wt = treated)  # Some data transformation
year n
2005 167
2006 103
2007 249
2008 143
2009 125

Example

captures %>% # Data used
  count(year, wt = treated) %>%   # Some data wrangling
  ggplot() # Add an empty canvas

Example

captures %>% # Data used
  count(year, wt = treated) %>%   # Some data wrangling
  ggplot() + # Add an empty canvas
  geom_bar() # This is the geometry type

Example

captures %>% # Data used
  count(year, wt = treated) %>%   # Some data wrangling
  ggplot() + # Add an empty canvas
  geom_bar( # This is the geometry type
    aes( # Aesthetics or mapping
      x = year, # X axis
      y = n # Y axis
      ), 
    stat = 'identity' # extra arguments
  ) 

Example

captures %>% # Data used
  count(year, wt = treated) %>%   # Some data wrangling
  ggplot() + # Add an empty canvas
  geom_bar(aes(x = year, y = n), stat = 'identity') +
  labs(# extra arguments
    title = 'Bar plot',
    x = 'Year', y = 'Frequency'
  )

ggplot2

Link to the cheasheet

Other cheatsheets

It’s Lab time!