In this lesson we will learn more about the ggplot2 package by Hadley Wickham. In a 2010 article A layered grammar of graphics in the Journal of Computational and Graphical Statistics Wickham describes the grammar:

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics. This paper builds on Wilkinson (2006), describing extensions and refinements developed while building an open source implementation of the grammar of graphics for R, ggplot2.

Excellent sources of documentation for ggplot2 include:

Graphs section from Winston Chang’s Cookbook for R
ggplot2 documentation

We will begin by ingesting the following data using our getMetricDF() function from Lesson 04:

# getMetricDF accepts a MUSTANG url and returns a metric dataframe
getMetricDF <- function(url, skip=1) {
  
  # Read in the data
  DF <- read.csv(url, stringsAsFactors=FALSE, skip=skip)
  
  # Convert character strings to class 'POSIXct'.
  DF$start <- as.POSIXct(DF$start, "%Y/%m/%d %H:%M:%OS", tz="GMT")
  DF$end <- as.POSIXct(DF$end, "%Y/%m/%d %H:%M:%OS", tz="GMT")
  
  # Add columns to the dataframe
  char_matrix <- stringr::str_split_fixed(DF$target,'\\.',5)
  DF$net <- as.factor(char_matrix[,1])
  DF$sta <- as.factor(char_matrix[,2])
  DF$loc <- as.factor(char_matrix[,3])
  DF$cha <- as.factor(char_matrix[,4])
  DF$qual <- as.factor(char_matrix[,5])
  
  return(DF)
}

# Two months of UW network BHZ dailiy *percent_availability* metrics from all locations and stations
url <- "http://service.iris.edu/mustang/measurements/1/query?net=UW&cha=BHZ&timewindow=2015-03-01T00:00:00,2015-04-25T00:00:00&output=text&metric=percent_availability&orderby=start"
df <- getMetricDF(url)

# Quick examination of what we got back
dim(df)

## [1] 1100   10

summary(df)

##      value           target              start           
##  Min.   :  0.00   Length:1100        Min.   :2015-03-01  
##  1st Qu.:100.00   Class :character   1st Qu.:2015-03-14  
##  Median :100.00   Mode  :character   Median :2015-03-28  
##  Mean   : 92.14                      Mean   :2015-03-28  
##  3rd Qu.:100.00                      3rd Qu.:2015-04-11  
##  Max.   :100.00                      Max.   :2015-04-24  
##                                                          
##       end                lddate          net            sta      loc    
##  Min.   :2015-03-02   Length:1100        UW:1100   BABR   : 55   :1100  
##  1st Qu.:2015-03-15   Class :character             BRAN   : 55          
##  Median :2015-03-29   Mode  :character             DAVN   : 55          
##  Mean   :2015-03-29                                DOSE   : 55          
##  3rd Qu.:2015-04-12                                FORK   : 55          
##  Max.   :2015-04-25                                GNW    : 55          
##                                                    (Other):770          
##   cha       qual    
##  BHZ:1100   M:1100  
##                     
##                     
##                     
##                     
##                     
##

Our dataframe has a variety of stations that we will focus our attention on.

Introudction to ‘ggplot2’

To begin working with ggplot2 you must first jettison everything you know about generating graphics in R. This grammar of graphics approach requires an entirely different mindset. The following snippet demonstrates the necessary minimum pieces of code to generate a plot with ggplot2:

library(ggplot2)

p <- ggplot(df, aes(x=start,y=value)) +
  geom_point() +
  labs(y="Percent Availability")

print(p)

Let’s examine the new concpts in the code above:

The ggplot() function doesn’t actually plot anything. Instead, it returns an object which has its own print.ggplot() method that causes plotting. (Try methods(print) to see how many print methods exist for different classes of objects.)
The basic arguments to ggplot() are a dataframe and an aesthetic which maps columns of the dataframe onto X and Y.
The + sign adds more information to the plot which will be used only when the plot is finally rendered.
The geometry, in this case geom_point(), defines the overall appearance of the plot.
Additional information needed for the plot is added with additional function calls like labs().

While this may seem overly abstract, it opens up lots of possibilities for exploring different visualization styles and makes for very concise and readable code once you’ve mastered things.

We will demonstrate the power of ggplot2 by changing to a jitterplot examination of stations and flipping the coordinates:

p <- ggplot(df, aes(x=sta,y=value)) +
  geom_jitter() +
  coord_flip() +
  labs(title="Percent Availability at each station",x="Station",y="Percent Availability")

print(p)

What have we done differently:

We changed the aesthetic so that x=sta.
We changed the geometry to geom_jitter().
We modified the coordinate system with coord_flip().
We used additional arguments to labs().

Let’s try one more plot to further explore geometries and coordinate systems:

p <- ggplot(df, aes(x=sta,y=value)) +
  coord_polar() +
  stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
  geom_jitter(shape=1,alpha=0.5) +
  labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="")

print(p)

New features in this plot:

We modified the coordinate system with coord_polar().
We defined a statistic to display with stat_summary() and added it with its own geometry geom="bar".
We called geom_jitter() after the line with geom="bar" so that points would appear on top.
We modified the appearance of jittered points with shape=1,alpha=0.5

Hopefully, this incremental approach has demonstrated how to build up plots a line at a time to explore the capabilities of ggplot2. One of the advantages of this style of adding plot elements on separate lines is that you can also remove them one at a time by simply commenting them out.

At this point it is worth spending some time with the ggplot2 documentation to becmore more familiar with the different geometries, coordinate systems and annotations.

Task 1: Explore geom_histogram

create a new R Script in RStudio calleld ‘percentAvailability.R’
cut and paste the data download and conversion code as well as the first example above as a starting point
use df <- subset(df,sta %in% c('TOLT','OMAK','LON','GRCC')) to create a 4-station subset of the dataframe
start with p <- ggplot(df, aes(x=value)) + geom_histogram(); print(p)
explore geom_histogram() capabilities following the examples in the geom_histogram documentation

Task 2: Explore geom_line

with the same dataframe, start with p <- ggplot(df, aes(x=start,y=value,group=sta)) + geom_line(); print(p)
explore geom_line() capabilities following the examples in the geom_line documentation

Themes

The ggplot2 package has the concept of a stylistic theme which is independent of other aspects of the plot. A few standard themes come with the packages and users and organizations can create their own themes. Default themes include theme_classic, theme_minimal, theme_bw and theme_gray. Modifying the appearance of your plot with a thime is as simple as adding it:

p <- ggplot(df, aes(x=sta,y=value)) +
  coord_polar() +
  stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
  geom_jitter(shape=1,alpha=0.5) +
  labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="") +
  theme_bw()

print(p)

Modifications to an existing theme are made with the theme() function whose documentation describes the full list of graphical elements that can be modified.

*__Note:__In order to take full advantage of theme options we will have to first load the grid package which comes with R.*

We will use some of these elements to modify our plot:

library(grid)

# BONUS: Create station label colors and font face based on average value
meanByStation <- aggregate(value ~ sta, data=df, FUN=mean)
colorIndices <- .bincode(meanByStation$value, breaks=c(0,60,80,90,100), include.lowest=TRUE)
labelColors <- c('red','darkorange','gold','black')[colorIndices]
fontFace <- rep(1,length=length(colorIndices))
fontFace[labelColors != 'black'] <- 4

p <- ggplot(df, aes(x=sta,y=value)) +
  coord_polar() +
  stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
  geom_jitter(shape=1,alpha=0.5) +
  labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="") +
  theme_minimal() +
  theme(axis.text.x=element_text(family='serif',size=12,color=labelColors,face=fontFace),
        axis.ticks.margin=unit(1,"lines"), # doesn't seem to work with 'coord_polar()'
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        plot.margin = unit(c(1,1,1,1), "lines"))

print(p)

Task 3: Explore themes

comment out the coord_polar() line and theme() lines above to return to a basic jitter and barplot
explore geom_barplot() options and how they can be set in the stat_summary() line
use theme elements to modify the appearance of the underlying grid
try to recreate a few elements the MATLAB style plot from Lesson 04
type theme_bw to see the code in the theme_bw() function
install the ggthemes package and try out theme_wsj()
BONUS: create your own theme

Faceting

The ggplot2 concept of faceting is all about splitting up your data by factor to create a small multiples of a desired graphic for a quick visual overview. The different facet~ functions allow you to use R’s ~ in formula fashion to specify that you want to break up the plots by some some other variable in the dataframe. Up to now, we have been investigating a single factor: sta. In the next example we use the facet_wrap() function to create a grid of timeseries split up by station:

p <- ggplot(df, aes(x=start, y=value)) +
  geom_line() +
  geom_point(aes(color=value)) +
  scale_color_gradient(low="red",high="black") +
  labs(x="Date",y="Percent Availability") +
  facet_wrap( ~ sta, ncol=4)

print(p)

We can explore the facet_grid() function if we download some data with more than one factor.

url <- "http://service.iris.edu/mustang/measurements/1/query?net=UW&sta=SHUK|SP2|TOLT|WISH&timewindow=2015-03-01T00:00:00,2015-04-25T00:00:00&output=text&metric=percent_availability&orderby=start"
df <- getMetricDF(url)

p <- ggplot(df, aes(x=start, y=value)) +
  geom_line() +
  geom_point(aes(color=value)) +
  scale_color_gradient(low="red",high="black") +
  labs(x="Date",y="Percent Availability") +
  facet_grid(cha ~ sta)

print(p)

‘tidy’ dataframes

The ggplot2 package likes what it calls ‘tidy’ dataframes where all data appear in the same column and there is another column identifying which variable or metric each datum is associated with. These are also called ‘long’ as opposed to ‘wide’ dataframes.

The getSingleValueMetrics() function returns ‘tidy’ dataframes that are ready for use with ggplot():

library(IRISSeismic)
library(IRISMustangMetrics)

# Open a connection to IRIS DMC webservices (including the BSS)
iris <- new("IrisClient")

starttime <- as.POSIXct("2013-06-01", tz="GMT")
endtime <- starttime + 30*24*3600
metricName <- "sample_max,sample_min,sample_mean,sample_rms"

# Get the measurement dataframe
juneStats <- getSingleValueMetrics(iris,"IU","ANMO","00","BHZ",
                                   starttime,endtime,metricName)

p <- ggplot(juneStats, aes(x=starttime,y=value)) +
  geom_point() +
  facet_grid(metricName ~ .)

print(p)

Whew! Yet another whirlwind tour, this time focused on the power and flexibility of ggplot2. Armed with the available documentation and various examples on the web you should be ready to try your hand at creating data visualizations that answer your most pressing questions.

You should finish this lesson by exploring your own favorite data.

Task 4: Explore your favorite data

design a question about a particular set of seismic events or monitors that might be answerable with MUSTANG data
download and convert a dataframe with enough data to do some exploring
explore different ways of displaying the data to find one that best tells the story
design how you want your final visualization to look
use ggplot2 to get as close as you can to the picture in your mind’s eye

MUSTANG Metrics < prev | next > Creating new Metrics

Lesson 08 – ggplot2

Mazama Science

Thu Aug 20 11:05:33 2015

Introudction to ‘ggplot2’

Themes

Faceting

‘tidy’ dataframes