R is a full blown programming language for statistical analysis and high level graphics. The programming language aspect of R has allowed developers (often with advanced degrees in statistics) to create a tremendous number of packages that make it easy to apply powerful analytical techniques. This lesson introduces you to the programming language side of R with a review of basic syntax, common data types and structures. At the end of this lesson you should be able to identify and manipulate the most common data structures in R.

Typography:

R commands within a paragraph will appear in this font.

# Blocks of code like this one identify R commands you should copy or type into the R console.
print('Output will appear in a block like this.')
## [1] "Output will appear in a block like this."

Functions

The first thing you need to know about R is that it is a functional language which means that there aren’t really statements per se, only functions, almost all of which return some sort of result or object. One such function is getwd() which returns the current working directory as a character string.

There are three ways to use the name of a function

  1. ?getwd will display documentation for the function
  2. getwd will print out the contents of the function – either R script or compiled code
  3. getwd() will invoke the function, returning a result

Often, you will want to assign the result of a function to a variable. Typing the name of the variable will print out the contents of that variable:

thisDir <- getwd()
thisDir
## [1] "/Users/jonathancallahan/Projects/IRIS/MUSTANG/metrics/Notebooks/IRISClass2015"

You could perform both of these actions in one step as a function’s return value is typically printed to the console by default:

getwd()
## [1] "/Users/jonathancallahan/Projects/IRIS/MUSTANG/metrics/Notebooks/IRISClass2015"

Task 0: Investigate the following functions:

system data creation math stats plotting
getwd c min rnorm plot
setwd seq mean quantile hist
dir rep log cor mtext

Examining the documentation for ?dir we see the following:

list.files(path = ".", pattern = NULL, all.files = FALSE,
           full.names = FALSE, recursive = FALSE,
           ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

dir(path = ".", pattern = NULL, all.files = FALSE,
    full.names = FALSE, recursive = FALSE,
    ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
...

This documentation tells us that there is an equivalent function named list.files() and that these functions take arguments. As seen above, default values are often specified. In this case the default value for the ‘path’ argument is ‘.’ – the current directory.

Arguments to functions may be specified by position or by name or some mix. So you can ask for the absolute path of all files in the current directory with: dir(getwd(),full.names=TRUE). The call to getwd() returns the absolute path of the current directory and this is used as the first argument. We accept default values for all other arguments except ‘full.names’ where we override the default value by setting it to ‘TRUE’.

R has other special values besides just ‘TRUE’ and ‘FALSE’. Here are the most common ones:

Operators

R has the basic set of mathematical operators: +, -, *, /, ^, %%. (See ?base::Arithmetic for more details.) Comparison operators include >, <, >=, <=, ==, !=. (See ?base::Comparison)

Note: R does not have ‘++’ style increment operators.

Note: Although R supports = as the assignment operator, the vast majority of R users prefer <- (left assignment).

# Randomly sample the normal distribution 100 times and assign the result to 'a'
a <- rnorm(100) # see ?rnorm
# Print out the length of 'a'
length(a)
## [1] 100
# Print out the values associated with the default quantiles
quantile(a)
##          0%         25%         50%         75%        100% 
## -2.10908616 -0.44394881  0.03165677  0.51687021  2.50936675

Task 1: Explain the following lines and plots:

b1 <- rnorm(1e6, mean=2.0)
b2 <- rnorm(1e6, mean=-2.0)
hist(c(b1,b2), n=100) # see ?c
subtitle <- paste('Bimodal Std. Dev. =', sd(c(b1,b2)))
mtext(subtitle)

# Why is this next plot unimodal?
hist(b1+b2, n=100)
subtitle <- paste('Unimodal Std. Dev. =', round(sd(b1+b2),3)) # see ?round
mtext(subtitle)


Vectorized Data

The first thing you need to know about any programming language is the set of fundamental data types. Here are the most basic types you will encounter in R:

You can find out what data type a variable contains with the typeof() function. However, we recommend that you instead get in the habit of using the class() function to find out about the class of an object. The class() function is more general in that it will return information about data variables, functions and any of the S3 or S4 objects from R’s increasingly huge bestiary.

For instance, typeof(b1) returns ‘double’ while class(b1) returns the slightly more generic ‘numeric’. But typeof(hist) returns ‘closure’ whereas class(hist) returns ‘function’ which is much more appropriate for non-computer science people.

Data variables in R are vectorized meaning that you can write code that looks a lot like a mathematical expression without having to do explicit looping. We did this in the example above with ‘b1+b2’, which created a new vector by adding b1 and b2 element-by-element.

If you want to access a particular element in a vector you use square brackets to specify which indices you are interested in. (Note: R indices begin at 1, not 0 like C/Java/Python.) Another way to pick elements out of a vector is with a mask – a logical vector of the same length as the original vector.

# Look at the first 5 and last three elements of the built-in 'letters' character vector
letters[1:5]
## [1] "a" "b" "c" "d" "e"
letters[24:26]
## [1] "x" "y" "z"
# Create a logical vector that is TRUE for all even indices
evenMask <- (1:26 %% 2) == 0
evenMask
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [12]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [23] FALSE  TRUE FALSE  TRUE
letters[evenMask]
##  [1] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z"
# Use the `%in%` operator (see ?match) to create a logical vector identifying elements 
# from one vector that match elements from another vector
vowelMask <- letters %in% c('a','e','i','o','u','y')
vowelMask
##  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
## [12] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [23] FALSE FALSE  TRUE FALSE
letters[vowelMask]
## [1] "a" "e" "i" "o" "u" "y"
# The `which()` function identifies indices where the logical vector is TRUE
which(vowelMask)
## [1]  1  5  9 15 21 25
# 'letters[vowelMask]' returns a vector of six elements.
# We can use the '[ ]' syntax again to extract the first 5 elements of the return vector
letters[vowelMask][1:5]
## [1] "a" "e" "i" "o" "u"

Task 2: Explain the following lines and the plot:

everyone <- rnorm(1e6)
percentiles <- quantile(everyone, probs=seq(0,1,.01))
moocherMask <- (everyone <= percentiles[47]) # vectorized comparison
moochers <- everyone[moocherMask]
hist(everyone, n=100, main="Moochers?", col.main='red')
hist(moochers, n=47, col='red', add=TRUE)


Data Structures

vector

We have already discussed the fundamental data structure in R – vector. Anything numeric is a vector even if it only has a single value, even if you are tempted to think it is a constant. For example,

length(pi)
## [1] 1
pi[1:10]
##  [1] 3.141593       NA       NA       NA       NA       NA       NA
##  [8]       NA       NA       NA

Here, R is reporting that the numeric vector named pi is of length of one with \(\pi\) as the first element (see ?print to modify rounding). But we can still reference elements 2:10 and find out that they hold NA (missing value).

matrix

A matrix is a 2-dimensional data structure where all elements are of the same type – numeric, character, logical, etc. Matrices are created with the matrix() function and elements can be referenced with [row,column] index notation. You can select an entire row or column by leaving the other index blank. (Higher dimensional arrays are created with array().)

m <- matrix(c(11,12,13,21,22,23), nrow=2, byrow=TRUE)
dim(m) # dimensions 
## [1] 2 3
m[2,2:3] # row 2, columns 2 & 3
## [1] 22 23
m[c(1,2),c(1,3)] # rows 1 & 2, columns 1 & 3
##      [,1] [,2]
## [1,]   11   13
## [2,]   21   23

list

A list is the R version of an associative array in C or a dictionary in Python with keys and values. Each element of a list contains some other object that may be of any type. So you can have one list element that contains a vector while another contains a matrix or a dataframe (see below).

List elements are referenced with double bracket notation [[x]] where x can be either an integer index or a character string key.

myList <- list('digits'=0:9,
               'xyz'=letters[24:26],
               'matrix'=m)

names(myList)
## [1] "digits" "xyz"    "matrix"
myList[[2]]
## [1] "x" "y" "z"
myList[['matrix']]
##      [,1] [,2] [,3]
## [1,]   11   12   13
## [2,]   21   22   23
myList[['matrix']][,3] # column 3 of the 'matrix' element
## [1] 13 23

dataframe

Last, but definitely not least, R has a dataframe type which is a cross between list and a matrix. Whereas matrices require that all columns be of the same basic type, dataframes are like spreadsheets in that each column can have a different type. Dataframe rows are understood as individual records.

Dataframe elements can be referenced with either matrix or list syntax. Individual columns (variables) can also be referenced with df$var syntax as seen below.

R comes with lots of built-in datasets. (Use data() to see them all.) Many of these are dataframes. We will explore the iris dataset to learn more about dataframes.

# First, let's learn about this dataset
class(iris)
## [1] "data.frame"
dim(iris)
## [1] 150   5
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# The first four columns are numeric variables. The 'Species' column is a *factor* (see ?factor)

# Two calls to hist() showing two ways to access dataframe columns
hist(iris$Sepal.Width, breaks=seq(0,10,.2), xlab='', main='', las=1) # (see ?par to learn about 'las')
hist(iris[['Sepal.Length']], breaks=seq(0,10,.2), border='red', add=TRUE)
legend('topright',
       legend=c('Sepal Width','Sepal Length'),
       col=c('black','red'),
       pch=0)
title('Botanical Statistics')


Task 3: See if you can figure out what is happening in the code below:

cols <- c('firebrick','goldenrod3','slateblue') # see ?colors
cols <- adjustcolor(cols,0.6) # see ?adjustcolor
pchs <- 15:17 # see ?points
cex <- 1.5 # see ?par

plot(iris$Sepal.Length, iris$Sepal.Width, las=1,
     xlab='Sepal Length (inches?)', ylab='Sepal Width (inches?)',
     main='Iris Dataset',
     cex=cex,
     col=cols[iris$Species], # hint: what does 'as.numeric(iris$Species)' produce?
     pch=pchs[iris$Species])

legend('topright',
       legend=levels(iris$Species),
       pt.cex=cex, col=cols, pch=pchs)


That ends the whirwind introduction to R. You have been exposed to a lot of new material that bears further exploration. A person could spend hours just reading up on ?par and playing with different graphical parameters.

With the different datasets available through data() you should have plenty of examples to explore.


next > Dataframes and Simple Plots