Reading in CSV data
Modifying Dataframes
- removing, changing and adding columns
- creating factors
Simple Plots

In Lesson 01 we scratched the surface of R’s data strucutres, poking at the dataframe that is the ‘iris’ dataset. In this lesson we will read in data from an external CSV file and delve more deeply into the kinds of manipulations that R can do with dataframes. We will also explore some of R’s basic plot types.

Reading in CSV data

The read.table() function and it’s offspring have enough arguments that they can read in just about any CSV file out there. It’s usually just a question of specifying the correct values to arguments. We will begin by reading in text output from the MUSTANG web service: (http://service.iris.edu/mustang/). The read.csv() function will parse the contents of the URL and return a dataframe.

Note 1: The text output coming from MUSTANG begins with a single header line with a human readable name for the requested metric. However, we already know what the metric is. We specified it with ‘metric=sample_rms’ in the url. We will ignore this CSV-awkward single header line by specifying the ‘skip=1’ parameter when parsing the data.

Note 2: By default, columns with character strings are interpreted as factors which is OK for statisticians but can lead to confusion for those used to other programming environments. We recommend disabling this feature with ‘stringsAsFactors=FALSE’.

# (Read about URL parameters at the web service URL)
url <- "http://service.iris.edu/mustang/measurements/1/query?net=IU&sta=ANMO&cha=BH.|HH.&timewindow=2015-03-18,2015-04-01&output=text&metric=sample_rms&orderby=start"
rms <- read.csv(url, stringsAsFactors=FALSE, skip=1)

# Always begin with some minimal investigation of the number of rows and columns
dim(rms)

## [1] 126   5

# What names do the columns have
names(rms)

## [1] "value"  "target" "start"  "end"    "lddate"

# Quick look at the first few lines
head(rms)

##     value           target               start                 end
## 1 309.020 IU.ANMO.10.BH1.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 2 468.103 IU.ANMO.10.BHZ.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 3 358.488 IU.ANMO.10.HH1.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 4 623.670 IU.ANMO.10.HHZ.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 5 518.861 IU.ANMO.10.HH2.M 2015/03/18 00:00:00 2015/03/19 00:00:00
## 6 378.415 IU.ANMO.10.BH2.M 2015/03/18 00:00:00 2015/03/19 00:00:00
##                       lddate
## 1 2015/03/20 08:17:39.795571
## 2 2015/03/20 08:18:43.574060
## 3 2015/03/20 08:19:15.252155
## 4 2015/03/20 08:20:14.445695
## 5 2015/03/20 08:19:45.815140
## 6 2015/03/20 08:18:10.995803

# Write a for loop to check the class of the first 5 elements
for (i in 1:5) {
  print( class(rms[[i]]) )
}

## [1] "numeric"
## [1] "character"
## [1] "character"
## [1] "character"
## [1] "character"

# Or, nicer and on a single line:
for (name in names(rms)) print( paste(name,':',class(rms[[name]])) )

## [1] "value : numeric"
## [1] "target : character"
## [1] "start : character"
## [1] "end : character"
## [1] "lddate : character"

# Check the summary
summary(rms)

##      value           target             start               end           
##  Min.   : 155.5   Length:126         Length:126         Length:126        
##  1st Qu.: 247.3   Class :character   Class :character   Class :character  
##  Median : 334.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 765.2                                                           
##  3rd Qu.: 778.7                                                           
##  Max.   :8479.9                                                           
##     lddate         
##  Length:126        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Modifying Dataframes

removing, changing and adding columns

Often we will want to modify a dataframe before performing detailed analysis. Here we describe some of the more common manipulations.

You can simplify a dataframe by deleting columns you don’t want. To delete columns you just set them to NULL and they disappear. (This also works with individual elements of a vector.)
We may also wish to convert columns to a more approriate class with ‘as.~’ functions.
If we are going to use individual columns from the dataframe to generate additional identifiers or metrics sometimes it is useful to just add these identifiers to the dataframe. You can do this by simply referencing a new column name.

The following example does each of these.

# Delete columns we are uninterested in
rms$lddate <- NULL

# Convert character strings to class 'POSIXct'.  (More on date classess in the next lesson.)
rms$start <- as.POSIXct(rms$start, "%Y/%m/%d %H:%M:%OS", tz="GMT")
rms$end <- as.POSIXct(rms$end, "%Y/%m/%d %H:%M:%OS", tz="GMT")

# Add more columns to the dataframe
# First, use the 'stringr' package `str_split_fixed()` function to split up the 'target' column
# (More on stringr in the next lesson.)
char_matrix <- stringr::str_split_fixed(rms$target,'\\.',5)
# Next, assign the columns of this character matrix to named columns in the dataframe
rms$net <- char_matrix[,1]
rms$sta <- char_matrix[,2]
rms$loc <- char_matrix[,3]
rms$cha <- char_matrix[,4]
rms[['qual']] <- char_matrix[,5] # demonstration of alternative column syntax

# Our converted and newly defined columns are available in the modified dataframe
head(rms)

##     value           target      start        end net  sta loc cha qual
## 1 309.020 IU.ANMO.10.BH1.M 2015-03-18 2015-03-19  IU ANMO  10 BH1    M
## 2 468.103 IU.ANMO.10.BHZ.M 2015-03-18 2015-03-19  IU ANMO  10 BHZ    M
## 3 358.488 IU.ANMO.10.HH1.M 2015-03-18 2015-03-19  IU ANMO  10 HH1    M
## 4 623.670 IU.ANMO.10.HHZ.M 2015-03-18 2015-03-19  IU ANMO  10 HHZ    M
## 5 518.861 IU.ANMO.10.HH2.M 2015-03-18 2015-03-19  IU ANMO  10 HH2    M
## 6 378.415 IU.ANMO.10.BH2.M 2015-03-18 2015-03-19  IU ANMO  10 BH2    M

creating factors

R uses the term factor to mean category or enumerated type. Factors are a way of classifying things. Sometimes you will want to create a factor from a numeric variable, binning values into ‘lo’, ‘med’, ‘hi’ for example. Other times, a character string will represent a category that will be repeated many times in the dataframe and should be converted into a factor.

Factors are stored in R as a vector of integers associated with a special lookup table with the category labels. When a character representation is needed the labels are used. Otherwise an integer representation is used.

Note: This multi-faceted behavior can lead to confusion when you write code that expects a character string but gets the integer representation. That’s why many avoid automatic conversion to factor.

In our case, the ‘target’, ‘net’, ‘sta’, ‘loc’, ‘cha’ and ‘qual’ columns are all factors so lets define them as such.

rms$target <- as.factor(rms$target)
rms$net <- as.factor(rms$net)
rms$sta <- as.factor(rms$sta)
rms$loc <- as.factor(rms$loc)
rms$cha <- as.factor(rms$cha)
rms$qual <- as.factor(rms$qual)

# Summary now identifies how many records in each 'level' of a factor
summary(rms)

##      value                     target       start                    
##  Min.   : 155.5   IU.ANMO.00.BH1.M:14   Min.   :2015-03-18 00:00:00  
##  1st Qu.: 247.3   IU.ANMO.00.BH2.M:14   1st Qu.:2015-03-21 00:00:00  
##  Median : 334.5   IU.ANMO.00.BHZ.M:14   Median :2015-03-24 12:00:00  
##  Mean   : 765.2   IU.ANMO.10.BH1.M:14   Mean   :2015-03-24 12:00:00  
##  3rd Qu.: 778.7   IU.ANMO.10.BH2.M:14   3rd Qu.:2015-03-28 00:00:00  
##  Max.   :8479.9   IU.ANMO.10.BHZ.M:14   Max.   :2015-03-31 00:00:00  
##                   (Other)         :42                                
##       end                      net        sta      loc      cha    
##  Min.   :2015-03-19 00:00:00   IU:126   ANMO:126   00:42   BH1:28  
##  1st Qu.:2015-03-22 00:00:00                       10:84   BH2:28  
##  Median :2015-03-25 12:00:00                               BHZ:28  
##  Mean   :2015-03-25 12:00:00                               HH1:14  
##  3rd Qu.:2015-03-29 00:00:00                               HH2:14  
##  Max.   :2015-04-01 00:00:00                               HHZ:14  
##                                                                    
##  qual   
##  M:126  
##         
##         
##         
##         
##         
##

# For factors, the table() function will count up the occurrences of each 'level'
table(rms$cha)

## 
## BH1 BH2 BHZ HH1 HH2 HHZ 
##  28  28  28  14  14  14

# At this point, we will save our work in a binary .RData file for future reference
save(rms, file='rms_example.RData')

Simple Plots

One of the things R does well is make it very easy to create simple plots. The following commands show off some of the basic plot types. Readers are encouraged to examine plot function documentation with, for example, ?barplot. But don’t sweat the details. We will cover more detailed plot options in Lesson 04.

bar and pie plots

We will use bar and pie plots to show the number and proportion of measurements as grouped by location and channel.

# Save default graphical parameters
oldPar <- par()

# Setup a layout for multiple plots
layout(matrix(1:2, nrow=1, byrow=TRUE)) # see ?layout and ?matrix

# By Channel
barplot(table(rms$cha), las=2, col=rainbow(6), main="Observation Counts") # see ?rainbow
pie(table(rms$cha), clockwise=TRUE, col=rainbow(6), main="Proportion by Channel")

# By Location
barplot(table(rms$loc), las=2, col=c('red','blue'), main="Observation Counts") # see ?barplot
pie(table(rms$loc), clockwise=TRUE, col=c('red','blue'), main="Proportion by Location")

# Restore default graphical parameter settings
par(oldPar)

histograms

Histograms give an immediate sense of the distribution of values and are indespensible when getting a first gut feel for the nature of your data.

# Setup a layout for multiple plots
layout(matrix(1:2, nrow=1, byrow=TRUE)) # see ?layout and ?matrix

# Basic Histogram
hist(rms$value, n=100, las=1, main='Distribution of RMS Values')

# Overplotting allows us to show two separate sub-populations
hist(rms$value, breaks=seq(0,10000,100), col='red', border='red', xlab='RMS value', main='')
hist(rms$value[rms$loc=='10'], breaks=seq(0,10000,100), col='blue', border='blue', xlab='RMS value', add=TRUE)
title('RMS Values by Location')
legend('topright', legend=c('00','10'), fill=c('red','blue'), title='Location')

# Restore default graphical parameter settings
par(oldPar)

Task 1: Try to explain what is happening in each line below:

# Working with logical masks
location10Mask <- rms$loc == '10'
class(location10Mask)
summary(location10Mask)
length(rms$loc)
length(rms$loc[location10Mask])
dim(rms)
dim(rms[location10Mask,])

scatter plots

The standard X-Y plot is perfect or plotting time-dependent variables. It can be invoked in one of two ways:

x, y
y ~ x

The second form uses an R formula where the ‘~’ can be interpreted as: “as a function of”.

We can use either notation to create a scatter plot

# Setup a layout for multiple plots
layout(matrix(1:2, nrow=1, byrow=TRUE)) # see ?layout and ?matrix

plot(rms$start, rms$value, xlab='Date', ylab='Sample RMS', main="(x,y notation)")
plot(rms$value ~ rms$start, xlab='Date', ylab='Sample RMS', main="(y~x notation)")

# Restore default graphical parameter settings
par(oldPar)

Task 2: Try to explain what is happening in each line below:

colors <- c('red','blue')
pchs <- 1:6 # see ?points
plot(rms$value ~ rms$start, xlab='Date', ylab='RMS',
     col=colors[rms$loc], pch=pchs[rms$cha])
loc_cha <- sort(unique(paste(rms$loc,rms$cha,sep='_')))
legend('topleft',legend=loc_cha, col=c(rep('red',3),rep('blue',6)), pch=c(1:3,1:6))

boxplots

An interesting thing happens if you use a formula to plot a variable (a dataframe column) of class ‘numeric’ as a function of another variable of class ‘factor’:

class(rms$value)

## [1] "numeric"

class(rms$cha)

## [1] "factor"

plot(rms$value ~ rms$cha)

Here we see R’s ‘polymorphic’ nature where functions respond differently depending on the type of arguments they receive. In this case, the plot() function actually ‘dispatches’ to the plot.factor() function (see ?plot.factor).

Note: You can list all possible argument-specific plot functions with methods(plot).

This polymorphic behavior can be a source of great confusion and is only presented here to make you aware of its existence. A much more intuitive way to generate a boxplot is with:

boxplot(rms$value ~ rms$cha, outline=FALSE, las=1, main='Distribution of Values by Channel')

Another whirlwind lesson, but one that should leave you prepared to do some basic exploratotry analysis with data obtained from MUSTANG web services.

Task 3: Explore boxplots

create a new R Script in RStudio with the name ‘boxplots.R’
cut and paste the data download and conversion code as well as the boxplot example above as a starting point
create a MUSTANG url that grabs a lot more data
read up on boxplot options with ?boxplot
create additional boxplots that tell a story

First Steps < prev | next > Strings and Dates

Lesson 02 – Dataframes and Simple Plots

Mazama Science

Thu Aug 20 10:51:57 2015