R is like a calculator, we can make mathematical operations, for example:
2 + 2
## [1] 4
R is a object-oriented programming language, this means that we
create objects that contain information. In R you can achieve the same
results using different approaches, for example, to create an object we
just type a name for the object and assign it a value using the
operators =
or <-
. We can make operations
with objects of the same type, for example:
x = 2 # create a new object with the = operator
y <- 2 # create a new object with the <- operator
x + y # make a operation with the objects
## [1] 4
You can store more than one value using vectors, to create a vector
of numbers we use c()
. For example, we will store a
sequence of numbers from 5 to 10 using 2 different approaches and then
ask R if the objects are the same.
tip: using the keys “alt” + “-” will automatically add
the operator <-
. Choosing which assign operator to use
is a matter of preference, I personally think that is easier reading
code with the operator <-
, but a lot of people uses
=
.
x <- c(5, 6, 7, 8, 9, 10) # create a sequence form 5 to 10
y = 5:10 # create the same sequence but with a different approach
x == y # ask R if the objects have the same information
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Notice that in the previous example we compared two objects using
==
, this is the way we tell R that we want to COMPARE and
not to assign (remember that to assign you use only one =
symbol).
When we have a vector, we can ask R specific values inside an object
by using the operator [ ]
and specifying which ones we
want.
# Here we ask the 3rd value from our sequence
x[3]
## [1] 7
# Now we multiply the 3rd value of the x sequence times the 5th value of the y sequence
x[3] * y[5]
## [1] 63
R has a lot of base functions, but we can define new functions. When
using R studio, the key Tab will help us to auto complete, this
can help us a lot when we don’t remember the exact name of the functions
available. The best part of programming with R is that it has a very
active community. Since its open source, anyone can create functions and
compile them in a package (or library). we can download these packages
and access new functions.
Functions in R require arguments, which we can see in the function
documentation or if we press the key Tab when we are inside the
function.
# To get the sum of a vector of numbers inside an object we use sum()
sum(x)
## [1] 45
We can put functions inside function, for example, to get \(\sqrt{\sum_1^n x}\) the square root of a sum of the numbers in x we can use:
sqrt(sum(x))
## [1] 6.708204
Making functions in R is not as complicated as it sounds and can be very useful when we need to do repetitive work. To define a function we need to include the arguments that we want for the function and what are we doing with those arguments. For example, the following function has only one argument which is a name (string) and just pastes some text before and after:
F1 <- function(name){
paste("Hola", name, "! welcome to the R world (: !") # paste the name with some text
}
# Testing the function (Put your name)
F1(name = "Pablo")
## [1] "Hola Pablo ! welcome to the R world (: !"
Besides storing numbers in the objects in R, we can store text, matrices, tables, spatial objects, images, and other types of objects. Next we will import our own data and do some manipulation in R.
Exercise: Create a function that performs the \(\sqrt{\sum_1^n x}\) operation you did
previously with the code sqrt(sum(x))
There are different types of variables in R, so far we have used the
numeric and string types. If yo want to know what kind of variable is a
given object, you can use the function class()
. Lets try
it.
class(x)
## [1] "numeric"
Exercise: What kind of variable is the output from
the first function you defined F1()
?
The most commonly used types of variables include:
The library dplyr
has several functions that can help to
clean, create new variables, and modify our data in other ways.
# if we dont have installed the library we will need to install it using:
# install.packages("dplyr")
# we load the library:
library(dplyr)
dplyr
introduces a new operator called pipes
(%>%
), which can connect several functions to an object.
This is an alternative to write several functions in a single “line of
code” in a more organized way. For example, if we want to execute a
function F1()
followed by another function
F2()
for the object x
:
F2(F1(x))
is equivalent to
x %>% F1() %>% F2()
As you can notice, to read the code F2(F1(x))
we have to
go from the inside to the outside to see the order of execution of the
functions, but when we read x %>% F1() %>% F2()
we
read from left to right, which is the same way we normally would read
any text in western language.
Suggestion: we can use the keys Ctrl +
shift + m to insert the %>%
operator.
# We previously used this code to calculate the square root of the sum of x
sqrt(sum(x))
## [1] 6.708204
Using the pipes we can do the same more organized, by writing the order of execution from left to right.
x %>% # First we call the data
sum() %>% # Sum all the values
sqrt() # Compute the square root
## [1] 6.708204
You will notice that the outputs are exactly the same. Feel free to use whatever syntax you prefer, but for this course we will use mostly the pipes and writing the code from left to right.
R can import data in different formats. The most common are excel
files (.csv, .xls y .xlsx), text files
.txt and spatial data .shp, which we will talk about
more in detail later.
To read .xls, .xlsx and .shp files we will
need to install some libraries. To install a new library you need to be
connected to the internet and use the function
install.packages()
to install the library. Once it has been
installed, you can load the library using the function
library()
.
Download the excel file from this link. It’s not necessary to have a Box account.
Suggestion: Sometimes when we want to use only one function
from a library, we can just write the name of the library followed by
the operator ::
and the name of the function, for example:
package::function()
. This way we won’t have to load the
whole library.
# If we don't have the library installed, we use:
# install.packages("readxl")
# Import the excel file
PRRS <- readxl::read_xlsx("Data/PRRS.xlsx")
The most popular format for tables in R are called
data.frame
, when we import the data from a .csv o
.xlsx. We can examine what kind of object it is using the
function class()
:
# Ask the class of a given object
class(PRRS)
## [1] "tbl_df" "tbl" "data.frame"
It is possible that objects can have more than 1 class. In the
previous example, you can notice that the PRRS
object has 3
different types of class.
In the following section we will cover some basics of data manipulation, this includes create, modify and summarise variables in our data.
Sometimes we want to select specific columns and rows on our data to reduce the dimentionality, for this we can use the functions:
select()
to select specific columnsslice()
to select specific rows based on
positionfilter()
to select specific rows based on a
conditionPRRS %>% # the name of our data
select(Result, farm_type) # we want to select only the columns Result and farm_type
## # A tibble: 1,353 × 2
## Result farm_type
## <chr> <chr>
## 1 No sow farm
## 2 No sow farm
## 3 No sow farm
## 4 Yes sow farm
## 5 Yes sow farm
## 6 Yes sow farm
## 7 Yes sow farm
## 8 Yes sow farm
## 9 Yes sow farm
## 10 Yes sow farm
## # ℹ 1,343 more rows
We can also specify which columns we DON’T want to show in our data:
PRRS %>%
select(-Age, -id) # with a '-' before the name we will exclude the column from the data
## # A tibble: 1,353 × 6
## Result Sex OtherSpecies name farm_type County
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 No H 0 Armstrong Research Farm sow farm Pottawattamie
## 2 No H 0 Armstrong Research Farm sow farm Pottawattamie
## 3 No H 0 Armstrong Research Farm sow farm Pottawattamie
## 4 Yes H 0 Armstrong Research Farm sow farm Pottawattamie
## 5 Yes H 0 Armstrong Research Farm sow farm Pottawattamie
## 6 Yes H 0 Armstrong Research Farm sow farm Pottawattamie
## 7 Yes M 0 Armstrong Research Farm sow farm Pottawattamie
## 8 Yes M 0 Armstrong Research Farm sow farm Pottawattamie
## 9 Yes M 0 Armstrong Research Farm sow farm Pottawattamie
## 10 Yes H 0 Armstrong Research Farm sow farm Pottawattamie
## # ℹ 1,343 more rows
Filtering columns works based on boolean logic, so we can specify a condition and R will show only the rows that satisfy that condition. For example, lets filter only the observations from boar studs:
PRRS %>%
filter(farm_type == 'boar stud') # we will use the equals to operator for this
## # A tibble: 19 × 8
## Result Sex Age OtherSpecies id name farm_type County
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 No H 48 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 2 No H 60 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 3 Yes H 60 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 4 Yes H 15 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 5 No H 68 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 6 No M 6 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 7 No H 75 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 8 No H 6 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 9 No H 110 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 10 No H 36 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 11 No M 12 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 12 No H 62 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 13 No H 48 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 14 No H 62 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 15 No H 95 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 16 Yes M 24 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 17 Yes H 38 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 18 No H 38 0 32 Farm Sweet Farm at Ro… boar stud Shelby
## 19 No M 24 0 32 Farm Sweet Farm at Ro… boar stud Shelby
To create a new variable we can use the function
mutate()
. For example, lets create a new variable that will
tell us if the farm type is a sow farm or not. For this we use the
variable farm_type
which already contains information for
different farm types
# Creating a new variable
PRRS %>% # name of the data set
mutate( # mutate is the function we sue to create new variables
SowFarm = ifelse(farm_type == 'sow farm', 1, 0) # we will create a binary variable where 1 = sow farm, 0 = Any other farm type
)
## # A tibble: 1,353 × 9
## Result Sex Age OtherSpecies id name farm_type County SowFarm
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 No H 18 0 23 Armstrong Res… sow farm Potta… 1
## 2 No H 60 0 23 Armstrong Res… sow farm Potta… 1
## 3 No H 60 0 23 Armstrong Res… sow farm Potta… 1
## 4 Yes H 36 0 23 Armstrong Res… sow farm Potta… 1
## 5 Yes H 50 0 23 Armstrong Res… sow farm Potta… 1
## 6 Yes H 16 0 23 Armstrong Res… sow farm Potta… 1
## 7 Yes M 15 0 23 Armstrong Res… sow farm Potta… 1
## 8 Yes M 22 0 23 Armstrong Res… sow farm Potta… 1
## 9 Yes M 30 0 23 Armstrong Res… sow farm Potta… 1
## 10 Yes H 14 0 23 Armstrong Res… sow farm Potta… 1
## # ℹ 1,343 more rows
Notice that in the previous code chunk, we did not assigned it to an
object, which means that R is just going to print it and dont save it.
To save it, remember that we use the assign operator <-
or =
.
To modify an existing variable we can use the same function
mutate()
, and specify what we want to replace the existing
variable with. For example, in the next chunk of code, we will modify
the variable Sex. The original variable has H for Female and M
for Male, so let’s change it to something that makes more sense:
# Now we will assign the new object to itself overwriting it.
PRRS <- PRRS %>% # this is the data set we will use
mutate( # we use the mutate function to create new variables
Sex = recode( # The function recode, can be used similar to replace in excel
Sex, # we will overwrite the variable Sex using the same name
H = 'Female',
M = 'Male'
)
)
Be careful when overwriting objects in R there is no undo for this. It is important that your code is ordered and replicable in case you make any mistake, so you can easily reach to your progress before the mistake.
Often times we want to calculate summary statistics from our data, we can group by a specific variable (or even multiple variables) to examine if there are differences between groups.
The most basic way of doing this is with the function
count()
which will only tell us the number of rows for each
group:
# We can count how many observations for each of a variable name
PRRS %>%
count(Result)
## # A tibble: 2 × 2
## Result n
## <chr> <int>
## 1 No 986
## 2 Yes 367
The previous table show us that were 986 observations with test result negative and 367 with test result positive. We can add multiple variables to count. Now we will count by Result and Sex:
# We can count based on multiple variables
PRRS %>%
count(Result, Sex)
## # A tibble: 4 × 3
## Result Sex n
## <chr> <chr> <int>
## 1 No Female 737
## 2 No Male 249
## 3 Yes Female 229
## 4 Yes Male 138
We can also calculate other statistics by group. For example lets calculate the mean and standard deviation of the age by Result and Sex:
PRRS %>%
group_by(Result, Sex) %>% # The groups used for the data
summarise( # the function summarise calculates statistics by the defined groups
meanAge = mean(Age), # Calculate the mean age
sdAge = sd(Age) # Calculate the standard deviation
)
## # A tibble: 4 × 4
## # Groups: Result [2]
## Result Sex meanAge sdAge
## <chr> <chr> <dbl> <dbl>
## 1 No Female 39.7 24.8
## 2 No Male 22.4 16.7
## 3 Yes Female 23.6 19.8
## 4 Yes Male 15.1 10.3
# PRRS %>%
# count(Sex, wt = Result)
Sometimes we have different data sets that have variables in common and we want to integrate them into a single data set for further analysis. In this example we have the data sets node_attrib.csv and network.csv
# First we import the data -----------
# Importing the farm dataset
nodes <- read.csv("Data/node_attrib.csv")
# Importing the movement dataset
mov <- read.csv("Data/network.csv")
The data mov
has information of place of origin
(id_orig) and destination (id_dest) of animal
shipments. First we will calculate the total number of pigs moved for
each of the incoming and outgoing movements. To do this we will use the
function count()
# Get the number of outgoing and incoming shipments
Out <- mov %>% # First we call the data
count(id_orig, wt = pigs.moved) %>% # then we count the number of movements
rename(id = id_orig, outgoing = n) # Rename the variables
You will notice that we added the argument
wt = pigs.moved
, pigs.moved is a variable that
tell us the number of pigs shipped. The argument wt
in the
function count()
allow us to add a weight to the counts. We
also used the function rename()
which pretty much does what
it sounds like, it renames variables in the data.
Now we will join the data with the farm information contained in the
nodes object, to do this we use the left_join()
function:
# Joining the two tables
# First we join with the outgoing
farms <- nodes %>% # This is the main object we will join with
left_join(Out, by ="id") # we need to specify the variable name we are joining by
Exercise: Now create a object that is the same as
the Out
we just created, but instead of counting the
outgoing animals, counts the incoming animals. For this you can use the
column id_dest. Then join the new object you created with the
farms data. Name the variable that has the number of pigs moved
incoming
# Get the number of outgoing and incoming shipments
In <- mov %>% # First we call the data
count(id_dest, wt = pigs.moved) %>% # then we count the number of movements
rename(id = id_dest, incoming = n) # Rename the variable
## Then we join with the incoming
farms <- farms %>%
left_join(In, by = "id")
The first rows of your data should look like these:
head(farms)
## id name lat long farm_type outgoing
## 1 1 Iowa Select Farms Inc 42.50489 -93.26323 sow farm 3528
## 2 2 Stanley Martins Fleckvieh Farms 43.08261 -91.56682 sow farm NA
## 3 3 Centrum Valley Farms 42.66331 -93.63630 nursery 1087
## 4 4 Hilltop Farms fresh produce 41.71651 -93.90491 sow farm 1606
## 5 5 Hog Slat Inc. 42.25929 -91.15566 GDU 3440
## 6 6 Safari Iowa Hunting Farms 41.56854 -92.02317 GDU 1073
## incoming
## 1 1466
## 2 3382
## 3 NA
## 4 3684
## 5 1467
## 6 4491
Notice that we have some outgoing and incoming rows with NA values. Sometimes NAs can interfer with some functions such as sum, in the next example we will summarise the data and egt the ones with the highest number of incoming movements:
farms %>% # This is our joined data
group_by(id) %>% # We will group it by id
summarise( # we will perform some summary statistics
incoming = sum(incoming, na.rm = T) # notice that we add the argument na.rm = T
) %>%
arrange(desc(incoming)) %>% # now we will order by incoming
head(5) # we use the head() function to get the first 5 observations
## # A tibble: 5 × 2
## id incoming
## <int> <int>
## 1 17 79948
## 2 8 19184
## 3 14 12334
## 4 7 11034
## 5 15 4648
Once that you have processed the data, often times you want to export it so the next time dont have to run all the code. YOu can export to multiple formats, but the most common is to a comma delimited file (CSV), which can be read in excel.
write.csv(farms, 'Data/farms.csv', row.names = F)
This lab has been developed with contributions from: Jose Pablo
Gomez-Vazquez.
Feel free to use these training materials for your own research and
teaching. When using the materials we would appreciate using the proper
credits. If you would be interested in a training session, please
contact: jpgo@ucdavis.edu