In Lesson 01 we scratched the surface of R’s data strucutres, poking at the dataframe that is the ‘iris’ dataset. In this lesson we will read in data from an external CSV file and delve more deeply into the kinds of manipulations that R can do with dataframes. We will also explore some of R’s basic plot types.
The read.table()
function and it’s offspring have enough arguments that they can read in just about any CSV file out there. It’s usually just a question of specifying the correct values to arguments. We will begin by reading in text output from the MUSTANG web service: (http://service.iris.edu/mustang/). The read.csv()
function will parse the contents of the URL and return a dataframe.
Note 1: The text output coming from MUSTANG begins with a single header line with a human readable name for the requested metric. However, we already know what the metric is. We specified it with ‘metric=sample_rms’ in the url. We will ignore this CSV-awkward single header line by specifying the ‘skip=1’ parameter when parsing the data.
Note 2: By default, columns with character strings are interpreted as factors which is OK for statisticians but can lead to confusion for those used to other programming environments. We recommend disabling this feature with ‘stringsAsFactors=FALSE’.
# (Read about URL parameters at the web service URL)
url <- "http://service.iris.edu/mustang/measurements/1/query?net=IU&sta=ANMO&cha=BH.|HH.&timewindow=2015-03-18,2015-04-01&output=text&metric=sample_rms&orderby=start"
rms <- read.csv(url, stringsAsFactors=FALSE, skip=1)
# Always begin with some minimal investigation of the number of rows and columns
dim(rms)
## [1] 126 5
# What names do the columns have
names(rms)
## [1] "value" "target" "start" "end" "lddate"
# Quick look at the first few lines
head(rms)
## value target start end
## 1 309.020 IU.ANMO.10.BH1.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 2 468.103 IU.ANMO.10.BHZ.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 3 358.488 IU.ANMO.10.HH1.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 4 623.670 IU.ANMO.10.HHZ.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 5 518.861 IU.ANMO.10.HH2.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 6 378.415 IU.ANMO.10.BH2.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## lddate
## 1 2015/03/20 08:17:39.795571
## 2 2015/03/20 08:18:43.574060
## 3 2015/03/20 08:19:15.252155
## 4 2015/03/20 08:20:14.445695
## 5 2015/03/20 08:19:45.815140
## 6 2015/03/20 08:18:10.995803
# Write a for loop to check the class of the first 5 elements
for (i in 1:5) {
print( class(rms[[i]]) )
}
## [1] "numeric"
## [1] "character"
## [1] "character"
## [1] "character"
## [1] "character"
# Or, nicer and on a single line:
for (name in names(rms)) print( paste(name,':',class(rms[[name]])) )
## [1] "value : numeric"
## [1] "target : character"
## [1] "start : character"
## [1] "end : character"
## [1] "lddate : character"
# Check the summary
summary(rms)
## value target start end
## Min. : 155.5 Length:126 Length:126 Length:126
## 1st Qu.: 247.3 Class :character Class :character Class :character
## Median : 334.5 Mode :character Mode :character Mode :character
## Mean : 765.2
## 3rd Qu.: 778.7
## Max. :8479.9
## lddate
## Length:126
## Class :character
## Mode :character
##
##
##
Often we will want to modify a dataframe before performing detailed analysis. Here we describe some of the more common manipulations.
The following example does each of these.
# Delete columns we are uninterested in
rms$lddate <- NULL
# Convert character strings to class 'POSIXct'. (More on date classess in the next lesson.)
rms$start <- as.POSIXct(rms$start, "%Y/%m/%d %H:%M:%OS", tz="GMT")
rms$end <- as.POSIXct(rms$end, "%Y/%m/%d %H:%M:%OS", tz="GMT")
# Add more columns to the dataframe
# First, use the 'stringr' package `str_split_fixed()` function to split up the 'target' column
# (More on stringr in the next lesson.)
char_matrix <- stringr::str_split_fixed(rms$target,'\\.',5)
# Next, assign the columns of this character matrix to named columns in the dataframe
rms$net <- char_matrix[,1]
rms$sta <- char_matrix[,2]
rms$loc <- char_matrix[,3]
rms$cha <- char_matrix[,4]
rms[['qual']] <- char_matrix[,5] # demonstration of alternative column syntax
# Our converted and newly defined columns are available in the modified dataframe
head(rms)
## value target start end net sta loc cha qual
## 1 309.020 IU.ANMO.10.BH1.M 2015-03-18 2015-03-19 IU ANMO 10 BH1 M
## 2 468.103 IU.ANMO.10.BHZ.M 2015-03-18 2015-03-19 IU ANMO 10 BHZ M
## 3 358.488 IU.ANMO.10.HH1.M 2015-03-18 2015-03-19 IU ANMO 10 HH1 M
## 4 623.670 IU.ANMO.10.HHZ.M 2015-03-18 2015-03-19 IU ANMO 10 HHZ M
## 5 518.861 IU.ANMO.10.HH2.M 2015-03-18 2015-03-19 IU ANMO 10 HH2 M
## 6 378.415 IU.ANMO.10.BH2.M 2015-03-18 2015-03-19 IU ANMO 10 BH2 M
R uses the term factor to mean category or enumerated type. Factors are a way of classifying things. Sometimes you will want to create a factor from a numeric variable, binning values into ‘lo’, ‘med’, ‘hi’ for example. Other times, a character string will represent a category that will be repeated many times in the dataframe and should be converted into a factor.
Factors are stored in R as a vector of integers associated with a special lookup table with the category labels. When a character representation is needed the labels are used. Otherwise an integer representation is used.
Note: This multi-faceted behavior can lead to confusion when you write code that expects a character string but gets the integer representation. That’s why many avoid automatic conversion to factor.
In our case, the ‘target’, ‘net’, ‘sta’, ‘loc’, ‘cha’ and ‘qual’ columns are all factors so lets define them as such.
rms$target <- as.factor(rms$target)
rms$net <- as.factor(rms$net)
rms$sta <- as.factor(rms$sta)
rms$loc <- as.factor(rms$loc)
rms$cha <- as.factor(rms$cha)
rms$qual <- as.factor(rms$qual)
# Summary now identifies how many records in each 'level' of a factor
summary(rms)
## value target start
## Min. : 155.5 IU.ANMO.00.BH1.M:14 Min. :2015-03-18 00:00:00
## 1st Qu.: 247.3 IU.ANMO.00.BH2.M:14 1st Qu.:2015-03-21 00:00:00
## Median : 334.5 IU.ANMO.00.BHZ.M:14 Median :2015-03-24 12:00:00
## Mean : 765.2 IU.ANMO.10.BH1.M:14 Mean :2015-03-24 12:00:00
## 3rd Qu.: 778.7 IU.ANMO.10.BH2.M:14 3rd Qu.:2015-03-28 00:00:00
## Max. :8479.9 IU.ANMO.10.BHZ.M:14 Max. :2015-03-31 00:00:00
## (Other) :42
## end net sta loc cha
## Min. :2015-03-19 00:00:00 IU:126 ANMO:126 00:42 BH1:28
## 1st Qu.:2015-03-22 00:00:00 10:84 BH2:28
## Median :2015-03-25 12:00:00 BHZ:28
## Mean :2015-03-25 12:00:00 HH1:14
## 3rd Qu.:2015-03-29 00:00:00 HH2:14
## Max. :2015-04-01 00:00:00 HHZ:14
##
## qual
## M:126
##
##
##
##
##
##
# For factors, the table() function will count up the occurrences of each 'level'
table(rms$cha)
##
## BH1 BH2 BHZ HH1 HH2 HHZ
## 28 28 28 14 14 14
# At this point, we will save our work in a binary .RData file for future reference
save(rms, file='rms_example.RData')
One of the things R does well is make it very easy to create simple plots. The following commands show off some of the basic plot types. Readers are encouraged to examine plot function documentation with, for example, ?barplot
. But don’t sweat the details. We will cover more detailed plot options in Lesson 04.
We will use bar and pie plots to show the number and proportion of measurements as grouped by location and channel.
# Save default graphical parameters
oldPar <- par()
# Setup a layout for multiple plots
layout(matrix(1:2, nrow=1, byrow=TRUE)) # see ?layout and ?matrix
# By Channel
barplot(table(rms$cha), las=2, col=rainbow(6), main="Observation Counts") # see ?rainbow
pie(table(rms$cha), clockwise=TRUE, col=rainbow(6), main="Proportion by Channel")
# By Location
barplot(table(rms$loc), las=2, col=c('red','blue'), main="Observation Counts") # see ?barplot
pie(table(rms$loc), clockwise=TRUE, col=c('red','blue'), main="Proportion by Location")
# Restore default graphical parameter settings
par(oldPar)
Histograms give an immediate sense of the distribution of values and are indespensible when getting a first gut feel for the nature of your data.
# Setup a layout for multiple plots
layout(matrix(1:2, nrow=1, byrow=TRUE)) # see ?layout and ?matrix
# Basic Histogram
hist(rms$value, n=100, las=1, main='Distribution of RMS Values')
# Overplotting allows us to show two separate sub-populations
hist(rms$value, breaks=seq(0,10000,100), col='red', border='red', xlab='RMS value', main='')
hist(rms$value[rms$loc=='10'], breaks=seq(0,10000,100), col='blue', border='blue', xlab='RMS value', add=TRUE)
title('RMS Values by Location')
legend('topright', legend=c('00','10'), fill=c('red','blue'), title='Location')
# Restore default graphical parameter settings
par(oldPar)
Task 1: Try to explain what is happening in each line below:
# Working with logical masks
location10Mask <- rms$loc == '10'
class(location10Mask)
summary(location10Mask)
length(rms$loc)
length(rms$loc[location10Mask])
dim(rms)
dim(rms[location10Mask,])
The standard X-Y plot is perfect or plotting time-dependent variables. It can be invoked in one of two ways:
x, y
y ~ x
The second form uses an R formula where the ‘~’ can be interpreted as: “as a function of”.
We can use either notation to create a scatter plot
# Setup a layout for multiple plots
layout(matrix(1:2, nrow=1, byrow=TRUE)) # see ?layout and ?matrix
plot(rms$start, rms$value, xlab='Date', ylab='Sample RMS', main="(x,y notation)")
plot(rms$value ~ rms$start, xlab='Date', ylab='Sample RMS', main="(y~x notation)")
# Restore default graphical parameter settings
par(oldPar)
Task 2: Try to explain what is happening in each line below:
colors <- c('red','blue')
pchs <- 1:6 # see ?points
plot(rms$value ~ rms$start, xlab='Date', ylab='RMS',
col=colors[rms$loc], pch=pchs[rms$cha])
loc_cha <- sort(unique(paste(rms$loc,rms$cha,sep='_')))
legend('topleft',legend=loc_cha, col=c(rep('red',3),rep('blue',6)), pch=c(1:3,1:6))
An interesting thing happens if you use a formula to plot a variable (a dataframe column) of class ‘numeric’ as a function of another variable of class ‘factor’:
class(rms$value)
## [1] "numeric"
class(rms$cha)
## [1] "factor"
plot(rms$value ~ rms$cha)
Here we see R’s ‘polymorphic’ nature where functions respond differently depending on the type of arguments they receive. In this case, the plot()
function actually ‘dispatches’ to the plot.factor()
function (see ?plot.factor
).
Note: You can list all possible argument-specific plot functions with methods(plot)
.
This polymorphic behavior can be a source of great confusion and is only presented here to make you aware of its existence. A much more intuitive way to generate a boxplot is with:
boxplot(rms$value ~ rms$cha, outline=FALSE, las=1, main='Distribution of Values by Channel')
Another whirlwind lesson, but one that should leave you prepared to do some basic exploratotry analysis with data obtained from MUSTANG web services.
Task 3: Explore boxplots
First Steps < prev | next > Strings and Dates