In this lesson we will learn more about the ggplot2 package by Hadley Wickham. In a 2010 article A layered grammar of graphics in the Journal of Computational and Graphical Statistics Wickham describes the grammar:

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics. This paper builds on Wilkinson (2006), describing extensions and refinements developed while building an open source implementation of the grammar of graphics for R, ggplot2.

Excellent sources of documentation for ggplot2 include:

We will begin by ingesting the following data using our getMetricDF() function from Lesson 04:

# getMetricDF accepts a MUSTANG url and returns a metric dataframe
getMetricDF <- function(url, skip=1) {
  
  # Read in the data
  DF <- read.csv(url, stringsAsFactors=FALSE, skip=skip)
  
  # Convert character strings to class 'POSIXct'.
  DF$start <- as.POSIXct(DF$start, "%Y/%m/%d %H:%M:%OS", tz="GMT")
  DF$end <- as.POSIXct(DF$end, "%Y/%m/%d %H:%M:%OS", tz="GMT")
  
  # Add columns to the dataframe
  char_matrix <- stringr::str_split_fixed(DF$target,'\\.',5)
  DF$net <- as.factor(char_matrix[,1])
  DF$sta <- as.factor(char_matrix[,2])
  DF$loc <- as.factor(char_matrix[,3])
  DF$cha <- as.factor(char_matrix[,4])
  DF$qual <- as.factor(char_matrix[,5])
  
  return(DF)
}

# Two months of UW network BHZ dailiy *percent_availability* metrics from all locations and stations
url <- "http://service.iris.edu/mustang/measurements/1/query?net=UW&cha=BHZ&timewindow=2015-03-01T00:00:00,2015-04-25T00:00:00&output=text&metric=percent_availability&orderby=start"
df <- getMetricDF(url)

# Quick examination of what we got back
dim(df)
## [1] 1100   10
summary(df)
##      value           target              start           
##  Min.   :  0.00   Length:1100        Min.   :2015-03-01  
##  1st Qu.:100.00   Class :character   1st Qu.:2015-03-14  
##  Median :100.00   Mode  :character   Median :2015-03-28  
##  Mean   : 92.14                      Mean   :2015-03-28  
##  3rd Qu.:100.00                      3rd Qu.:2015-04-11  
##  Max.   :100.00                      Max.   :2015-04-24  
##                                                          
##       end                lddate          net            sta      loc    
##  Min.   :2015-03-02   Length:1100        UW:1100   BABR   : 55   :1100  
##  1st Qu.:2015-03-15   Class :character             BRAN   : 55          
##  Median :2015-03-29   Mode  :character             DAVN   : 55          
##  Mean   :2015-03-29                                DOSE   : 55          
##  3rd Qu.:2015-04-12                                FORK   : 55          
##  Max.   :2015-04-25                                GNW    : 55          
##                                                    (Other):770          
##   cha       qual    
##  BHZ:1100   M:1100  
##                     
##                     
##                     
##                     
##                     
## 

Our dataframe has a variety of stations that we will focus our attention on.

Introudction to ‘ggplot2’

To begin working with ggplot2 you must first jettison everything you know about generating graphics in R. This grammar of graphics approach requires an entirely different mindset. The following snippet demonstrates the necessary minimum pieces of code to generate a plot with ggplot2:

library(ggplot2)

p <- ggplot(df, aes(x=start,y=value)) +
  geom_point() +
  labs(y="Percent Availability")

print(p)

Let’s examine the new concpts in the code above:

While this may seem overly abstract, it opens up lots of possibilities for exploring different visualization styles and makes for very concise and readable code once you’ve mastered things.

We will demonstrate the power of ggplot2 by changing to a jitterplot examination of stations and flipping the coordinates:

p <- ggplot(df, aes(x=sta,y=value)) +
  geom_jitter() +
  coord_flip() +
  labs(title="Percent Availability at each station",x="Station",y="Percent Availability")

print(p)

What have we done differently:

Let’s try one more plot to further explore geometries and coordinate systems:

p <- ggplot(df, aes(x=sta,y=value)) +
  coord_polar() +
  stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
  geom_jitter(shape=1,alpha=0.5) +
  labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="")

print(p)

New features in this plot:

Hopefully, this incremental approach has demonstrated how to build up plots a line at a time to explore the capabilities of ggplot2. One of the advantages of this style of adding plot elements on separate lines is that you can also remove them one at a time by simply commenting them out.

At this point it is worth spending some time with the ggplot2 documentation to becmore more familiar with the different geometries, coordinate systems and annotations.


Task 1: Explore geom_histogram



Task 2: Explore geom_line


Themes

The ggplot2 package has the concept of a stylistic theme which is independent of other aspects of the plot. A few standard themes come with the packages and users and organizations can create their own themes. Default themes include theme_classic, theme_minimal, theme_bw and theme_gray. Modifying the appearance of your plot with a thime is as simple as adding it:

p <- ggplot(df, aes(x=sta,y=value)) +
  coord_polar() +
  stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
  geom_jitter(shape=1,alpha=0.5) +
  labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="") +
  theme_bw()

print(p)

Modifications to an existing theme are made with the theme() function whose documentation describes the full list of graphical elements that can be modified.

*__Note:__In order to take full advantage of theme options we will have to first load the grid package which comes with R.*

We will use some of these elements to modify our plot:

library(grid)

# BONUS: Create station label colors and font face based on average value
meanByStation <- aggregate(value ~ sta, data=df, FUN=mean)
colorIndices <- .bincode(meanByStation$value, breaks=c(0,60,80,90,100), include.lowest=TRUE)
labelColors <- c('red','darkorange','gold','black')[colorIndices]
fontFace <- rep(1,length=length(colorIndices))
fontFace[labelColors != 'black'] <- 4

p <- ggplot(df, aes(x=sta,y=value)) +
  coord_polar() +
  stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
  geom_jitter(shape=1,alpha=0.5) +
  labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="") +
  theme_minimal() +
  theme(axis.text.x=element_text(family='serif',size=12,color=labelColors,face=fontFace),
        axis.ticks.margin=unit(1,"lines"), # doesn't seem to work with 'coord_polar()'
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        plot.margin = unit(c(1,1,1,1), "lines"))

print(p)


Task 3: Explore themes


Faceting

The ggplot2 concept of faceting is all about splitting up your data by factor to create a small multiples of a desired graphic for a quick visual overview. The different facet~ functions allow you to use R’s ~ in formula fashion to specify that you want to break up the plots by some some other variable in the dataframe. Up to now, we have been investigating a single factor: sta. In the next example we use the facet_wrap() function to create a grid of timeseries split up by station:

p <- ggplot(df, aes(x=start, y=value)) +
  geom_line() +
  geom_point(aes(color=value)) +
  scale_color_gradient(low="red",high="black") +
  labs(x="Date",y="Percent Availability") +
  facet_wrap( ~ sta, ncol=4)

print(p)

We can explore the facet_grid() function if we download some data with more than one factor.

url <- "http://service.iris.edu/mustang/measurements/1/query?net=UW&sta=SHUK|SP2|TOLT|WISH&timewindow=2015-03-01T00:00:00,2015-04-25T00:00:00&output=text&metric=percent_availability&orderby=start"
df <- getMetricDF(url)

p <- ggplot(df, aes(x=start, y=value)) +
  geom_line() +
  geom_point(aes(color=value)) +
  scale_color_gradient(low="red",high="black") +
  labs(x="Date",y="Percent Availability") +
  facet_grid(cha ~ sta)

print(p)

‘tidy’ dataframes

The ggplot2 package likes what it calls ‘tidy’ dataframes where all data appear in the same column and there is another column identifying which variable or metric each datum is associated with. These are also called ‘long’ as opposed to ‘wide’ dataframes.

The getSingleValueMetrics() function returns ‘tidy’ dataframes that are ready for use with ggplot():

library(IRISSeismic)
library(IRISMustangMetrics)

# Open a connection to IRIS DMC webservices (including the BSS)
iris <- new("IrisClient")

starttime <- as.POSIXct("2013-06-01", tz="GMT")
endtime <- starttime + 30*24*3600
metricName <- "sample_max,sample_min,sample_mean,sample_rms"

# Get the measurement dataframe
juneStats <- getSingleValueMetrics(iris,"IU","ANMO","00","BHZ",
                                   starttime,endtime,metricName)

p <- ggplot(juneStats, aes(x=starttime,y=value)) +
  geom_point() +
  facet_grid(metricName ~ .)

print(p)


Whew! Yet another whirlwind tour, this time focused on the power and flexibility of ggplot2. Armed with the available documentation and various examples on the web you should be ready to try your hand at creating data visualizations that answer your most pressing questions.

You should finish this lesson by exploring your own favorite data.


Task 4: Explore your favorite data


MUSTANG Metrics < prev | next > Creating new Metrics