Lesson 03 – dplyr for summary statistics

filter()
select()
group_by()
summarize()
arrange()

In Lesson 02 we learned about dataframes and explored selecting columns of interest and filtering rows based on values found in one of the columns. There is much more that we might do in terms of extracting and manipulating data from a dataframe object and, as in any programming language, there are many ways to do it.

The recently released ‘dplyr’ package provides us with a very robust and full-featured “grammar of data maniuplation” with functions like filter(), select(), group_by(), summarize() and arrange(). The rest of this lesson will demonstrate the use of these functions to calculate summary statistics from a dataframe of individual records.

We’ll be working with the same fire locations dataset we experimented with in Lesson 02. This time, however, we need to also load the ‘dplyr’ package with library(dplyr):

library(dplyr)
url <- "http://smoke.airfire.org/bluesky-daily/output/hysplit-pp/NAM-4km/2014080100/data/fire_locations.csv"
fires <- read.csv(url, stringsAsFactors=FALSE)

filter()

The filter() function returns a subset of the rows based on a condition (aka logical mask) that is passed in. If multiple conditions are supplied they are combined with ‘&’ (logical AND).

To start things off, we’ll create a new WF dataframe containing only those fires classified as “wildfires”.

WF <- filter(fires, type == "WF")
dim(WF)

## [1] 195  68

select()

The select() function returns a subset of the columns. When working with very large dataframes it can speed things up if you work with only those columns you are interested in.

Our newly created WF dataset contains 195 observations of wildfires. Let’s just view the state in which these wildfires occured and the area of said wildfires. The select() function followed by head() will display the structure of our new dataframe:

WFsub <- select(WF, state, area)
head(WFsub)

##   state area
## 1    NM  150
## 2    NM  150
## 3    NM  150
## 4    MT  400
## 5    MT  400
## 6    MT  400

We now have a dataframe of wildfires that contains only two columns. To ‘tell a story’ with this dataset we will perform two additional very common activities: we will first organize the data into groups and secondly create summaries by group.

group_by()

Most data operations are useful when done on groups defined by variables in the dataset. The group_by() function takes an existing dataframe and converts it into a grouped dataframe so that operations can be performed “by group”.

Let’s group our wildfires by state:

WFbyState <- group_by(WFsub, state)

summarize()

The summarize() function reduces all records in each group to a single row by applying some function and assigning the result to a name. A logical example with the current dataset would would be to ask: How much acerage was burned as a result of wildfires in each state?

summaryDF <- summarize(WFbyState, total=sum(area, na.rm=TRUE)) # specify 'na.rm=TRUE' to remove missing values
summaryDF

## Source: local data frame [6 x 2]
## 
##   state total
## 1    AZ  1320
## 2    CA 12648
## 3    MT  1950
## 4    NM   450
## 5    OR  6990
## 6    WA  8490

arrange()

Finally, lets order this data table in descending order with arrange():

arrange(summaryDF, desc(total))

## Source: local data frame [6 x 2]
## 
##   state total
## 1    CA 12648
## 2    WA  8490
## 3    OR  6990
## 4    MT  1950
## 5    AZ  1320
## 6    NM   450

That’s actually a pretty nice result that tells a story about the acreage burned by wildfires in each state.

In the process above, we created a new dataframe at each stage of the processing and then used the new dataframe as input for the next function. Experienced R programmers might string all this functionality together without creating the interim dataframes by using a function call (that returns a value) as the argument to another function. The result might look like this:

arrange(summarize(group_by(select(filter(fires,type=="WF"),state,area),state),total=sum(area,na.rm=TRUE)),desc(total))

## Source: local data frame [6 x 2]
## 
##   state total
## 1    CA 12648
## 2    WA  8490
## 3    OR  6990
## 4    MT  1950
## 5    AZ  1320
## 6    NM   450

To interpret this overly-long line you first have to find the innermost function filter(fires,type=="WF") and then work your way outwards. Written like this, it’s not very readable. Fortunately, the dplyr package has some syntactic sugar to make this code much more readable. The new %>% operator can be read as “then” and is equivalent to a unix ‘pipe’. Rewriting our functionality using this new syntax we end up with:

fires %>%
  filter(type == "WF") %>%
  select(state, area) %>%
  group_by(state) %>%
  summarize(total=sum(area, na.rm=TRUE)) %>%
  arrange(desc(total))

## Source: local data frame [6 x 2]
## 
##   state total
## 1    CA 12648
## 2    WA  8490
## 3    OR  6990
## 4    MT  1950
## 5    AZ  1320
## 6    NM   450

Finally, for super-readable code, we can add some comment lines, omit the ‘select’ line as unnecessary and use the ‘right assignment’ operator to put the result in a new, descriptively named table:

# Take the "fires" dataset
#   then filter for type == "WF"
#   then group by state
#   then calculate total area by state
#   then arrange in descending order by total
#   finally, put the result in wildfireAreaByState
fires %>%
  filter(type == "WF") %>%
  group_by(state) %>%
  summarize(total=sum(area, na.rm=TRUE)) %>%
  arrange(desc(total)) ->
  wildfireAreaByState

The ‘dplyr’ package has a few more functions like ‘mutate()’ that are worth reading up on but we have covered enough of the basics to do some useful work. Please try your hand at the tasks below. In lesson 04 we’ll create some data visualizations from the resulting tables.

Task 1: Prescribed Burns by state

# Take the "fires" dataset
#   then filter for type == "RX"
#   then group by state
#   then calculate maximum area by state
#   then arrange in descending order by maximum area
#   (let it print, don't assign it to anything)

Task 2: Add columns for median and max pm25 and total area to this table and sort by total area.

Task 3: Display a table of wildfire statistics by vegetation type.

Task 4: Create two more tables of your own choosing that “tell a story”.

Lesson 03 – ‘dplyr’ for summary statistics

Mazama Science

Last updated: September 29, 2014

filter()

select()

group_by()

summarize()

arrange()