1 R syntax

R is like a calculator, we can make mathematical operations, for example:

2 + 2

## [1] 4

R is a object-oriented programming language, this means that we create objects that contain information. In R you can achieve the same results using different approaches, for example, to create an object we just type a name for the object and assign it a value using the operators = or <-. We can make operations with objects of the same type, for example:

x = 2 # create a new object with the = operator
y <- 2 # create a new object with the <- operator
x + y # make a operation with the objects

## [1] 4

1.1 Vectors in R

You can store more than one value using vectors, to create a vector of numbers we use c(). For example, we will store a sequence of numbers from 5 to 10 using 2 different approaches and then ask R if the objects are the same.
tip: using the keys “alt” + “-” will automatically add the operator <-. Choosing which assign operator to use is a matter of preference, I personally think that is easier reading code with the operator <-, but a lot of people uses =.

x <- c(5, 6, 7, 8, 9, 10) # create a sequence form 5 to 10
y = 5:10 # create the same sequence but with a different approach
x == y # ask R if the objects have the same information

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Notice that in the previous example we compared two objects using ==, this is the way we tell R that we want to COMPARE and not to assign (remember that to assign you use only one = symbol).

When we have a vector, we can ask R specific values inside an object by using the operator [ ] and specifying which ones we want.

# Here we ask the 3rd value from our sequence
x[3]

## [1] 7

# Now we multiply the 3rd value of the x sequence times the 5th value of the y sequence
x[3] * y[5]

## [1] 63

1.2 Functions

R has a lot of base functions, but we can define new functions. When using R studio, the key Tab will help us to auto complete, this can help us a lot when we don’t remember the exact name of the functions available. The best part of programming with R is that it has a very active community. Since its open source, anyone can create functions and compile them in a package (or library). we can download these packages and access new functions.
Functions in R require arguments, which we can see in the function documentation or if we press the key Tab when we are inside the function.

# To get the sum of a vector of numbers inside an object we use sum()
sum(x)

## [1] 45

We can put functions inside function, for example, to get \(\sqrt{\sum_1^n x}\) the square root of a sum of the numbers in x we can use:

sqrt(sum(x))

## [1] 6.708204

Making functions in R is not as complicated as it sounds and can be very useful when we need to do repetitive work. To define a function we need to include the arguments that we want for the function and what are we doing with those arguments. For example, the following function has only one argument which is a name (string) and just pastes some text before and after:

F1 <- function(name){
  paste("Hola", name, "! welcome to the R world (: !") # paste the name with some text
}

# Testing the function (Put your name)
F1(name = "Pablo")

## [1] "Hola Pablo ! welcome to the R world (: !"

Besides storing numbers in the objects in R, we can store text, matrices, tables, spatial objects, images, and other types of objects. Next we will import our own data and do some manipulation in R.

Exercise: Create a function that performs the \(\sqrt{\sum_1^n x}\) operation you did previously with the code sqrt(sum(x))

1.3 Types of variables

There are different types of variables in R, so far we have used the numeric and string types. If yo want to know what kind of variable is a given object, you can use the function class(). Lets try it.

class(x)

## [1] "numeric"

Exercise: What kind of variable is the output from the first function you defined F1()?

The most commonly used types of variables include:

numeric This are continuous numeric variables WITH any decimal values. For example: KG of product imported, probability of an event happening.
integer These variables are whole numbers WITHOUT decimal values. For example: Number of animals, number of shipments, etc..
character Alphanumeric variables. For example: name of a region, name of a disease, farm ID.
factor Alphanumeric variable with specific categories or levels. For example: type of product imported, type of farm, etc…

1.4 Introducing the pipes

The library dplyr has several functions that can help to clean, create new variables, and modify our data in other ways.

# if we dont have installed the library we will need to install it using:
# install.packages("dplyr")
# we load the library:
library(dplyr)

dplyr introduces a new operator called pipes (%>%), which can connect several functions to an object. This is an alternative to write several functions in a single “line of code” in a more organized way. For example, if we want to execute a function F1() followed by another function F2() for the object x:

F2(F1(x)) is equivalent to x %>% F1() %>% F2()

As you can notice, to read the code F2(F1(x)) we have to go from the inside to the outside to see the order of execution of the functions, but when we read x %>% F1() %>% F2() we read from left to right, which is the same way we normally would read any text in western language.

Suggestion: we can use the keys Ctrl + shift + m to insert the %>% operator.

# We previously used this code to calculate the square root of the sum of x
sqrt(sum(x))

## [1] 6.708204

Using the pipes we can do the same more organized, by writing the order of execution from left to right.

x %>% # First we call the data
  sum() %>% # Sum all the values
  sqrt() # Compute the square root

## [1] 6.708204

You will notice that the outputs are exactly the same. Feel free to use whatever syntax you prefer, but for this course we will use mostly the pipes and writing the code from left to right.

2 Data manipulation

R can import data in different formats. The most common are excel files (.csv, .xls y .xlsx), text files .txt and spatial data .shp, which we will talk about more in detail later.
To read .xls, .xlsx and .shp files we will need to install some libraries. To install a new library you need to be connected to the internet and use the function install.packages() to install the library. Once it has been installed, you can load the library using the function library().

Download the excel file from this link. It’s not necessary to have a Box account.

Suggestion: Sometimes when we want to use only one function from a library, we can just write the name of the library followed by the operator :: and the name of the function, for example: package::function(). This way we won’t have to load the whole library.

# If we don't have the library installed, we use:
# install.packages("readxl")

# Import the excel file
PRRS <- readxl::read_xlsx("Data/PRRS.xlsx")

The most popular format for tables in R are called data.frame, when we import the data from a .csv o .xlsx. We can examine what kind of object it is using the function class():

# Ask the class of a given object
class(PRRS)

## [1] "tbl_df"     "tbl"        "data.frame"

It is possible that objects can have more than 1 class. In the previous example, you can notice that the PRRS object has 3 different types of class.

In the following section we will cover some basics of data manipulation, this includes create, modify and summarise variables in our data.

2.1 Reducing our data

Sometimes we want to select specific columns and rows on our data to reduce the dimentionality, for this we can use the functions:

select() to select specific columns
slice() to select specific rows based on position
filter() to select specific rows based on a condition

2.1.1 Selecting specific columns

PRRS %>%  # the name of our data
  select(Result, farm_type) # we want to select only the columns Result and farm_type

## # A tibble: 1,353 × 2
##    Result farm_type
##    <chr>  <chr>    
##  1 No     sow farm 
##  2 No     sow farm 
##  3 No     sow farm 
##  4 Yes    sow farm 
##  5 Yes    sow farm 
##  6 Yes    sow farm 
##  7 Yes    sow farm 
##  8 Yes    sow farm 
##  9 Yes    sow farm 
## 10 Yes    sow farm 
## # ℹ 1,343 more rows

We can also specify which columns we DON’T want to show in our data:

PRRS %>% 
  select(-Age, -id) # with a '-' before the name we will exclude the column from the data

## # A tibble: 1,353 × 6
##    Result Sex   OtherSpecies name                    farm_type County       
##    <chr>  <chr>        <dbl> <chr>                   <chr>     <chr>        
##  1 No     H                0 Armstrong Research Farm sow farm  Pottawattamie
##  2 No     H                0 Armstrong Research Farm sow farm  Pottawattamie
##  3 No     H                0 Armstrong Research Farm sow farm  Pottawattamie
##  4 Yes    H                0 Armstrong Research Farm sow farm  Pottawattamie
##  5 Yes    H                0 Armstrong Research Farm sow farm  Pottawattamie
##  6 Yes    H                0 Armstrong Research Farm sow farm  Pottawattamie
##  7 Yes    M                0 Armstrong Research Farm sow farm  Pottawattamie
##  8 Yes    M                0 Armstrong Research Farm sow farm  Pottawattamie
##  9 Yes    M                0 Armstrong Research Farm sow farm  Pottawattamie
## 10 Yes    H                0 Armstrong Research Farm sow farm  Pottawattamie
## # ℹ 1,343 more rows

2.1.2 Filtering specific rows

Filtering columns works based on boolean logic, so we can specify a condition and R will show only the rows that satisfy that condition. For example, lets filter only the observations from boar studs:

PRRS %>% 
  filter(farm_type == 'boar stud') # we will use the equals to operator for this

## # A tibble: 19 × 8
##    Result Sex     Age OtherSpecies    id name                   farm_type County
##    <chr>  <chr> <dbl>        <dbl> <dbl> <chr>                  <chr>     <chr> 
##  1 No     H        48            0    32 Farm Sweet Farm at Ro… boar stud Shelby
##  2 No     H        60            0    32 Farm Sweet Farm at Ro… boar stud Shelby
##  3 Yes    H        60            0    32 Farm Sweet Farm at Ro… boar stud Shelby
##  4 Yes    H        15            0    32 Farm Sweet Farm at Ro… boar stud Shelby
##  5 No     H        68            0    32 Farm Sweet Farm at Ro… boar stud Shelby
##  6 No     M         6            0    32 Farm Sweet Farm at Ro… boar stud Shelby
##  7 No     H        75            0    32 Farm Sweet Farm at Ro… boar stud Shelby
##  8 No     H         6            0    32 Farm Sweet Farm at Ro… boar stud Shelby
##  9 No     H       110            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 10 No     H        36            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 11 No     M        12            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 12 No     H        62            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 13 No     H        48            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 14 No     H        62            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 15 No     H        95            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 16 Yes    M        24            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 17 Yes    H        38            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 18 No     H        38            0    32 Farm Sweet Farm at Ro… boar stud Shelby
## 19 No     M        24            0    32 Farm Sweet Farm at Ro… boar stud Shelby

2.2 Creating or modifying variables

To create a new variable we can use the function mutate(). For example, lets create a new variable that will tell us if the farm type is a sow farm or not. For this we use the variable farm_type which already contains information for different farm types

# Creating a new variable
PRRS %>% # name of the data set
  mutate( # mutate is the function we sue to create new variables
    SowFarm = ifelse(farm_type == 'sow farm', 1, 0) # we will create a binary variable where 1 = sow farm, 0 = Any other farm type
  )

## # A tibble: 1,353 × 9
##    Result Sex     Age OtherSpecies    id name           farm_type County SowFarm
##    <chr>  <chr> <dbl>        <dbl> <dbl> <chr>          <chr>     <chr>    <dbl>
##  1 No     H        18            0    23 Armstrong Res… sow farm  Potta…       1
##  2 No     H        60            0    23 Armstrong Res… sow farm  Potta…       1
##  3 No     H        60            0    23 Armstrong Res… sow farm  Potta…       1
##  4 Yes    H        36            0    23 Armstrong Res… sow farm  Potta…       1
##  5 Yes    H        50            0    23 Armstrong Res… sow farm  Potta…       1
##  6 Yes    H        16            0    23 Armstrong Res… sow farm  Potta…       1
##  7 Yes    M        15            0    23 Armstrong Res… sow farm  Potta…       1
##  8 Yes    M        22            0    23 Armstrong Res… sow farm  Potta…       1
##  9 Yes    M        30            0    23 Armstrong Res… sow farm  Potta…       1
## 10 Yes    H        14            0    23 Armstrong Res… sow farm  Potta…       1
## # ℹ 1,343 more rows

Notice that in the previous code chunk, we did not assigned it to an object, which means that R is just going to print it and dont save it. To save it, remember that we use the assign operator <- or =.
To modify an existing variable we can use the same function mutate(), and specify what we want to replace the existing variable with. For example, in the next chunk of code, we will modify the variable Sex. The original variable has H for Female and M for Male, so let’s change it to something that makes more sense:

# Now we will assign the new object to itself overwriting it.
PRRS <- PRRS %>%  # this is the data set we will use
  mutate( # we use the mutate function to create new variables
    Sex = recode( # The function recode, can be used similar to replace in excel
      Sex, # we will overwrite the variable Sex using the same name
      H = 'Female',
      M = 'Male'
    )
  )

Be careful when overwriting objects in R there is no undo for this. It is important that your code is ordered and replicable in case you make any mistake, so you can easily reach to your progress before the mistake.

2.3 Grouping and summarizing tables

Often times we want to calculate summary statistics from our data, we can group by a specific variable (or even multiple variables) to examine if there are differences between groups.

The most basic way of doing this is with the function count() which will only tell us the number of rows for each group:

# We can count how many observations for each of a variable name
PRRS %>% 
  count(Result)

## # A tibble: 2 × 2
##   Result     n
##   <chr>  <int>
## 1 No       986
## 2 Yes      367

The previous table show us that were 986 observations with test result negative and 367 with test result positive. We can add multiple variables to count. Now we will count by Result and Sex:

# We can count based on multiple variables
PRRS %>% 
  count(Result, Sex)

## # A tibble: 4 × 3
##   Result Sex        n
##   <chr>  <chr>  <int>
## 1 No     Female   737
## 2 No     Male     249
## 3 Yes    Female   229
## 4 Yes    Male     138

We can also calculate other statistics by group. For example lets calculate the mean and standard deviation of the age by Result and Sex:

PRRS %>%
  group_by(Result, Sex) %>% # The groups used for the data
  summarise( # the function summarise calculates statistics by the defined groups
    meanAge = mean(Age), # Calculate the mean age
    sdAge = sd(Age) # Calculate the standard deviation
  )

## # A tibble: 4 × 4
## # Groups:   Result [2]
##   Result Sex    meanAge sdAge
##   <chr>  <chr>    <dbl> <dbl>
## 1 No     Female    39.7  24.8
## 2 No     Male      22.4  16.7
## 3 Yes    Female    23.6  19.8
## 4 Yes    Male      15.1  10.3

# PRRS %>% 
  # count(Sex, wt = Result)

2.4 Joining data sets

Sometimes we have different data sets that have variables in common and we want to integrate them into a single data set for further analysis. In this example we have the data sets node_attrib.csv and network.csv

# First we import the data -----------
# Importing the farm dataset 
nodes <- read.csv("Data/node_attrib.csv")
# Importing the movement dataset
mov <- read.csv("Data/network.csv")

The data mov has information of place of origin (id_orig) and destination (id_dest) of animal shipments. First we will calculate the total number of pigs moved for each of the incoming and outgoing movements. To do this we will use the function count()

# Get the number of outgoing and incoming shipments 
Out <- mov %>% # First we call the data
  count(id_orig, wt = pigs.moved) %>% # then we count the number of movements
  rename(id = id_orig, outgoing = n) # Rename the variables

You will notice that we added the argument wt = pigs.moved, pigs.moved is a variable that tell us the number of pigs shipped. The argument wt in the function count() allow us to add a weight to the counts. We also used the function rename() which pretty much does what it sounds like, it renames variables in the data.

Now we will join the data with the farm information contained in the nodes object, to do this we use the left_join() function:

# Joining the two tables
# First we join with the outgoing
farms <- nodes %>% # This is the main object we will join with
  left_join(Out, by ="id") # we need to specify the variable name we are joining by

Exercise: Now create a object that is the same as the Out we just created, but instead of counting the outgoing animals, counts the incoming animals. For this you can use the column id_dest. Then join the new object you created with the farms data. Name the variable that has the number of pigs moved incoming

# Get the number of outgoing and incoming shipments 
In <- mov %>% # First we call the data
  count(id_dest, wt = pigs.moved) %>% # then we count the number of movements
  rename(id = id_dest, incoming = n) # Rename the variable

## Then we join with the incoming
farms <- farms %>%
  left_join(In, by = "id")

The first rows of your data should look like these:

head(farms)

##   id                            name      lat      long farm_type outgoing
## 1  1           Iowa Select Farms Inc 42.50489 -93.26323  sow farm     3528
## 2  2 Stanley Martins Fleckvieh Farms 43.08261 -91.56682  sow farm       NA
## 3  3            Centrum Valley Farms 42.66331 -93.63630   nursery     1087
## 4  4     Hilltop Farms fresh produce 41.71651 -93.90491  sow farm     1606
## 5  5                   Hog Slat Inc. 42.25929 -91.15566       GDU     3440
## 6  6       Safari Iowa Hunting Farms 41.56854 -92.02317       GDU     1073
##   incoming
## 1     1466
## 2     3382
## 3       NA
## 4     3684
## 5     1467
## 6     4491

Notice that we have some outgoing and incoming rows with NA values. Sometimes NAs can interfer with some functions such as sum, in the next example we will summarise the data and egt the ones with the highest number of incoming movements:

farms %>% # This is our joined data
  group_by(id) %>% # We will group it by id
  summarise( # we will perform some summary statistics
    incoming = sum(incoming, na.rm = T) # notice that we add the argument na.rm = T
  ) %>%
  arrange(desc(incoming)) %>% # now we will order by incoming
  head(5) # we use the head() function to get the first 5 observations

## # A tibble: 5 × 2
##      id incoming
##   <int>    <int>
## 1    17    79948
## 2     8    19184
## 3    14    12334
## 4     7    11034
## 5    15     4648

3 Exporting the data

Once that you have processed the data, often times you want to export it so the next time dont have to run all the code. YOu can export to multiple formats, but the most common is to a comma delimited file (CSV), which can be read in excel.

write.csv(farms, 'Data/farms.csv', row.names = F)

This lab has been developed with contributions from: Jose Pablo Gomez-Vazquez.
Feel free to use these training materials for your own research and teaching. When using the materials we would appreciate using the proper credits. If you would be interested in a training session, please contact: jpgo@ucdavis.edu

R basics

Pablo Gomez