In Lesson 01 we scratched the surface of R’s data strucutres, poking at the dataframe that is the ‘iris’ dataset. In this lesson we will read in data from an external CSV file and delve more deeply into the kinds of manipulations that R can do with dataframes.

Reading in CSV data

The read.table() function and it’s offspring have enough arguments that they can read in just about any CSV file out there. It’s usually just a question of specifying the correct values to arguments. We will begin by reading in a fire locations file from the BlueSky daily runs. The read.csv() function will parse the contents of the URL and return a dataframe.

Note: By default, columns with character strings are interpreted as factors which is great for statisticians but can lead to confusion for those used to other programming environments. We recommend disabling this feature with ‘stringsAsFactors=FALSE’.

url <- "http://smoke.airfire.org/bluesky-daily/output/hysplit-pp/NAM-4km/2014080100/data/fire_locations.csv"
fires <- read.csv(url, stringsAsFactors=FALSE)

# Always begin with some minimal investigation
dim(fires) # 68 columns!

## [1] 1656   68

names(fires)

##  [1] "id"                     "event_id"              
##  [3] "latitude"               "longitude"             
##  [5] "type"                   "area"                  
##  [7] "date_time"              "elevation"             
##  [9] "slope"                  "state"                 
## [11] "county"                 "country"               
## [13] "fips"                   "scc"                   
## [15] "fuel_1hr"               "fuel_10hr"             
## [17] "fuel_100hr"             "fuel_1khr"             
## [19] "fuel_10khr"             "fuel_gt10khr"          
## [21] "shrub"                  "grass"                 
## [23] "rot"                    "duff"                  
## [25] "litter"                 "moisture_1hr"          
## [27] "moisture_10hr"          "moisture_100hr"        
## [29] "moisture_1khr"          "moisture_live"         
## [31] "moisture_duff"          "consumption_flaming"   
## [33] "consumption_smoldering" "consumption_residual"  
## [35] "consumption_duff"       "min_wind"              
## [37] "max_wind"               "min_wind_aloft"        
## [39] "max_wind_aloft"         "min_humid"             
## [41] "max_humid"              "min_temp"              
## [43] "max_temp"               "min_temp_hour"         
## [45] "max_temp_hour"          "sunrise_hour"          
## [47] "sunset_hour"            "snow_month"            
## [49] "rain_days"              "heat"                  
## [51] "pm25"                   "pm10"                  
## [53] "co"                     "co2"                   
## [55] "ch4"                    "nox"                   
## [57] "nh3"                    "so2"                   
## [59] "voc"                    "canopy"                
## [61] "event_url"              "fccs_number"           
## [63] "owner"                  "sf_event_guid"         
## [65] "sf_server"              "sf_stream_name"        
## [67] "timezone"               "veg"

# Write a for loop to check the class of the first 5 elements
for (i in 1:5) {
  print( class(fires[[i]]) )
}

## [1] "character"
## [1] "character"
## [1] "numeric"
## [1] "numeric"
## [1] "character"

# Or, nicer and on a single line:
#for (name in names(fires)) print( paste(name,':',class(fires[[name]])) )

# Check the summary
#summary(fires) # not run as it produces voluminous output

Modifying Dataframes

Often we will want to modify the dataframe before performing detailed analysis. Let’s describe some of the more common manipulations.

If we are going to use individual columns from the dataframe to generate higher order identifiers or metrics sometimes it is useful to just add them to the dataframe. You can do this by simply referencing a new column name

You can also simplify a dataframe by deleting columns you don’t want. To delete columns you just set them to NULL and they disappear. (This also works with individual elements of a vector.)

Another common task is to assign missing values (NA) that didn’t get converted when we read the data in. (See the ‘na.strings’ argument to read.csv().)

We may also wish to convert columns to a more approriate class with ‘as.~’ functions

The following example does each of these.

# Delete columns we will never use
fires$sf_server <- NULL
fires$sf_stream_name <- NULL

# Assign missing values whenever 'fips == -9999'
fires$fips[ fires$fips == -9999 ] <- NA
# Go ahead and convert this to class 'character' as FIPS codes aren't really numbers
fires$fips <- as.character(fires$fips)
  
# Add a column we think might be interesting
fires$stateType <- paste0(fires$state, '_', fires$type)
fires$newAQIMetric <- 0.6 * fires$pm25 + 0.3 * fires$pm10 + 0.08 * fires$so2 + 0.02 * fires$co

# Our new column is available in the modified dataframe
summary(fires$newAQIMetric)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     5.3     6.6    36.5    23.5   974.0    1290

Creating Factors

R uses the term factor to mean category or enumerated type. Factors are a way of classifying things. Often you will want to create a factor from a numeric variable, binning values into ‘lo’, ‘med’, ‘hi’ for example. Factors are stored in R as a vector of integers associated with a special lookup table with the category labels. When a character representation is needed the labels are used. Otherwise an integer representation is used.

Note: This multi-facted behavior can lead to confusion when you write code that expects a character string but gets the integer representation. That’s why many avoid automatic conversion to factor.

The two main functions for binning numeric variables are cut() which returns a factor and the more simplistic .bincode() which returns an integer vector. The table() function can be used to sum up the number of occurrences of each factor.

# We will use AQI breaks to bin PM2.5 values into different categories
# Latest aqiBreaks from http://www.arb.ca.gov/carpa/toolkit/data-to-mes/wildfire-smoke-guide.pdf
# Latest aqiColors from http://aqicn.org/faq/2013-09-09/revised-pm25-aqi-breakpoints/

# NOTE:  The low end of each break category is used as the breakpoint.
aqiBreaks <- c(0, 39, 89, 139, 352, 526, 10000)
aqiLabels <- c('good','moderate','USG','unhealthy','very unhealthy','extreme')
aqiColors <- c("#009966","#FFDE33","#FF9933","#CC0033","#660099","#730023")
aqiColors <- adjustcolor(aqiColors, 0.5) # 50% opacity

# Create a new 'factor' column
fires$pm25_AQI <- cut(fires$pm25, breaks=aqiBreaks, labels=aqiLabels)

# Summarize
table(fires$pm25_AQI)

## 
##           good       moderate            USG      unhealthy very unhealthy 
##            315             30              9              6              0 
##        extreme 
##              6

# Now we will create a plot using this newly created factor

# Create a new column of color strings chosen by this factor (integer representation)
fires$pm25_AQIColor <- aqiColors[ fires$pm25_AQI ]

# Save current graphical parameters
oldPar <- par()

# Modify the right margin to create room for labels
par(mar=c(5,4,4,8)+.1)

# Plot pm25 against fuel_1khr, colored by pm25
plot(fires$fuel_1khr, fires$pm25, las=1,
     col=fires$pm25_AQIColor, pch=16, cex=2)

# Add six guide lines
abline(h=aqiBreaks[1:6], col=aqiColors, lwd=4)

# Add AQI level labels
xpos <- 1.05 * max(fires$fuel_1khr, na.rm=TRUE)
ypos <- aqiBreaks[1:6]
text(x=xpos, y=ypos, labels=aqiLabels, pos=4, xpd=NA) # (see ?par about 'xpd')

plot of chunk unnamed-chunk-1

# Restore original graphical parameters
par(oldPar)

Task 1: Improve the graphic above

create a new R Script in RStudio entitled ‘AQIPlot’
cut and paste the R code above as a starting point
add a title with the title() function
use the ‘xlab’ and ‘ylab’ arguments to plot()
remove the horizontal guidelines
color the AQI level labels appropriately
add a legend with legend()
use ‘pch’ and ‘cex’ to improve plot appearance (see ?points and ?par)

Extra Credit

use a different symbol for each pm25_AQI level

Selecting and Filtering

selecting

Sometimes we only want to work with with some of the columns in our dataframe either to reduce the volume of data we pass around or just to simplify things for ourselves. The most readable way to do this in code is to create a character vector with the names of the columns we want to keep.

columns <- c('latitude','longitude','type','area','date_time','state','fuel_1khr','pm25','veg','pm25_AQI','pm25_AQIColor')
subDF <- fires[,columns] # all rows, specified columns
summary(subDF)

##     latitude      longitude          type                area     
##  Min.   :12.8   Min.   :-156.4   Length:1656        Min.   :  46  
##  1st Qu.:18.1   1st Qu.:-115.8   Class :character   1st Qu.: 100  
##  Median :26.2   Median : -99.0   Mode  :character   Median : 100  
##  Mean   :33.5   Mean   :-102.2                      Mean   : 128  
##  3rd Qu.:50.3   3rd Qu.: -90.1                      3rd Qu.: 100  
##  Max.   :63.4   Max.   : -66.4                      Max.   :1500  
##                                                                   
##   date_time            state             fuel_1khr         pm25      
##  Length:1656        Length:1656        Min.   :0.0    Min.   :  0.0  
##  Class :character   Class :character   1st Qu.:0.0    1st Qu.:  4.5  
##  Mode  :character   Mode  :character   Median :0.4    Median :  5.6  
##                                        Mean   :0.8    Mean   : 30.4  
##                                        3rd Qu.:0.8    3rd Qu.: 20.0  
##                                        Max.   :8.3    Max.   :807.6  
##                                        NA's   :1290   NA's   :1290   
##      veg                      pm25_AQI    pm25_AQIColor     
##  Length:1656        good          : 315   Length:1656       
##  Class :character   moderate      :  30   Class :character  
##  Mode  :character   USG           :   9   Mode  :character  
##                     unhealthy     :   6                     
##                     very unhealthy:   0                     
##                     extreme       :   6                     
##                     NA's          :1290

# Sometimes having fewer variables to think about leads to new ideas for explanatory plots
oldPar <- par()
par(mar=c(5,8,4,2)+.1) # more room on the left margin
boxplot(subDF$area ~ subDF$pm25_AQI, horizontal=TRUE, las=1,
        xlab='Fire Area', ylab='',
        main='Dangerous PM2.5 levels comes from big fires.')

plot of chunk unnamed-chunk-2

par(oldPar)

filtering

The next level of subsetting is to pick out rows based on values in one of the columns. The most readable way to do this is to create a logical expression or mask to specify which rows to keep. Creating logical masks has the advantage that you can create mulitple masks and logically AND / OR them together.

# Create row masks
goodMask <- subDF$pm25_AQI == "good"
CAMask <- subDF$state == 'CA'
CAGoodMask <- CAMask & goodMask # logical AND
CABadMask <- CAMask & !goodMask
 
# Apply row and column selection at the same time to create much smaller dataframes
CAGoodDF <- fires[CAGoodMask,columns]
CABadDF <- fires[CABadMask,columns]

# Question:  What vegetation types are associated with good/bad PM2.5 levels in CA?
sort(table(CAGoodDF$veg))

## 
##                                         Ponderosa pine savanna 
##                                                              3 
##                                             Black oak woodland 
##                                                              6 
##                                   Sagebrush Shrubland - Sparse 
##                                                              6 
##                                                          Urban 
##                                                              6 
##                       Douglas-fir - Sugar pine - Tanoak forest 
##                                                             15 
## Jeffrey pine - Ponderosa pine - Douglas-fir - Black oak forest 
##                                                             15

sort(table(CABadDF$veg))

## 
## Jeffrey pine - Red fir - White fir / Greenleaf manzanita - Snowbrush forest 
##                                                                           3 
##                                                     Redwood - Tanoak forest 
##                                                                           3 
##              Jeffrey pine - Ponderosa pine - Douglas-fir - Black oak forest 
##                                                                           6 
##                                                              Red fir forest 
##                                                                           6 
##                                    Douglas-fir - Sugar pine - Tanoak forest 
##                                                                           9

Well, we probably shouldn’t draw any conclusions about the correlation between vegetation type and fire generated PM2.5 levels from three days worth of data. The point of this example was to demonstrate how to go about asking questions of a dataframe by filtering the rows based on logical expressions.

R has a subset() function which encapsulates both selecting and filtering but there is a newer and better way to perform dataframe manipulations with the ‘dplyr’ package described in Lesson 03.

Task 2: Explore boxplots

create a new R Script in RStudio entitled ‘fireBoxplots’
cut and paste the boxplot example above as a starting point
read up on boxplot() with ?boxplot
create additional boxplots that “tell a story”

Lesson 02 – Working with Dataframes

Mazama Science