In this lesson we will learn more about the ggplot2 package by Hadley Wickham. In a 2010 article A layered grammar of graphics in the Journal of Computational and Graphical Statistics Wickham describes the grammar:
A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics. This paper builds on Wilkinson (2006), describing extensions and refinements developed while building an open source implementation of the grammar of graphics for R, ggplot2.
Excellent sources of documentation for ggplot2 include:
We will begin by ingesting the following data using our getMetricDF()
function from Lesson 04:
# getMetricDF accepts a MUSTANG url and returns a metric dataframe
getMetricDF <- function(url, skip=1) {
# Read in the data
DF <- read.csv(url, stringsAsFactors=FALSE, skip=skip)
# Convert character strings to class 'POSIXct'.
DF$start <- as.POSIXct(DF$start, "%Y/%m/%d %H:%M:%OS", tz="GMT")
DF$end <- as.POSIXct(DF$end, "%Y/%m/%d %H:%M:%OS", tz="GMT")
# Add columns to the dataframe
char_matrix <- stringr::str_split_fixed(DF$target,'\\.',5)
DF$net <- as.factor(char_matrix[,1])
DF$sta <- as.factor(char_matrix[,2])
DF$loc <- as.factor(char_matrix[,3])
DF$cha <- as.factor(char_matrix[,4])
DF$qual <- as.factor(char_matrix[,5])
return(DF)
}
# Two months of UW network BHZ dailiy *percent_availability* metrics from all locations and stations
url <- "http://service.iris.edu/mustang/measurements/1/query?net=UW&cha=BHZ&timewindow=2015-03-01T00:00:00,2015-04-25T00:00:00&output=text&metric=percent_availability&orderby=start"
df <- getMetricDF(url)
# Quick examination of what we got back
dim(df)
## [1] 1100 10
summary(df)
## value target start
## Min. : 0.00 Length:1100 Min. :2015-03-01
## 1st Qu.:100.00 Class :character 1st Qu.:2015-03-14
## Median :100.00 Mode :character Median :2015-03-28
## Mean : 92.14 Mean :2015-03-28
## 3rd Qu.:100.00 3rd Qu.:2015-04-11
## Max. :100.00 Max. :2015-04-24
##
## end lddate net sta loc
## Min. :2015-03-02 Length:1100 UW:1100 BABR : 55 :1100
## 1st Qu.:2015-03-15 Class :character BRAN : 55
## Median :2015-03-29 Mode :character DAVN : 55
## Mean :2015-03-29 DOSE : 55
## 3rd Qu.:2015-04-12 FORK : 55
## Max. :2015-04-25 GNW : 55
## (Other):770
## cha qual
## BHZ:1100 M:1100
##
##
##
##
##
##
Our dataframe has a variety of stations that we will focus our attention on.
To begin working with ggplot2 you must first jettison everything you know about generating graphics in R. This grammar of graphics approach requires an entirely different mindset. The following snippet demonstrates the necessary minimum pieces of code to generate a plot with ggplot2:
library(ggplot2)
p <- ggplot(df, aes(x=start,y=value)) +
geom_point() +
labs(y="Percent Availability")
print(p)
Let’s examine the new concpts in the code above:
ggplot()
function doesn’t actually plot anything. Instead, it returns an object which has its own print.ggplot()
method that causes plotting. (Try methods(print)
to see how many print methods exist for different classes of objects.)ggplot()
are a dataframe and an aesthetic which maps columns of the dataframe onto X and Y.+
sign adds more information to the plot which will be used only when the plot is finally rendered.geom_point()
, defines the overall appearance of the plot.labs()
.While this may seem overly abstract, it opens up lots of possibilities for exploring different visualization styles and makes for very concise and readable code once you’ve mastered things.
We will demonstrate the power of ggplot2 by changing to a jitterplot examination of stations and flipping the coordinates:
p <- ggplot(df, aes(x=sta,y=value)) +
geom_jitter() +
coord_flip() +
labs(title="Percent Availability at each station",x="Station",y="Percent Availability")
print(p)
What have we done differently:
x=sta
.geom_jitter()
.coord_flip()
.labs()
.Let’s try one more plot to further explore geometries and coordinate systems:
p <- ggplot(df, aes(x=sta,y=value)) +
coord_polar() +
stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
geom_jitter(shape=1,alpha=0.5) +
labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="")
print(p)
New features in this plot:
coord_polar()
.stat_summary()
and added it with its own geometry geom="bar"
.geom_jitter()
after the line with geom="bar"
so that points would appear on top.shape=1,alpha=0.5
Hopefully, this incremental approach has demonstrated how to build up plots a line at a time to explore the capabilities of ggplot2. One of the advantages of this style of adding plot elements on separate lines is that you can also remove them one at a time by simply commenting them out.
At this point it is worth spending some time with the ggplot2 documentation to becmore more familiar with the different geometries, coordinate systems and annotations.
Task 1: Explore geom_histogram
df <- subset(df,sta %in% c('TOLT','OMAK','LON','GRCC'))
to create a 4-station subset of the dataframep <- ggplot(df, aes(x=value)) + geom_histogram(); print(p)
geom_histogram()
capabilities following the examples in the geom_histogram documentationTask 2: Explore geom_line
p <- ggplot(df, aes(x=start,y=value,group=sta)) + geom_line(); print(p)
geom_line()
capabilities following the examples in the geom_line documentationThe ggplot2 package has the concept of a stylistic theme which is independent of other aspects of the plot. A few standard themes come with the packages and users and organizations can create their own themes. Default themes include theme_classic
, theme_minimal
, theme_bw
and theme_gray
. Modifying the appearance of your plot with a thime is as simple as adding it:
p <- ggplot(df, aes(x=sta,y=value)) +
coord_polar() +
stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
geom_jitter(shape=1,alpha=0.5) +
labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="") +
theme_bw()
print(p)
Modifications to an existing theme are made with the theme()
function whose documentation describes the full list of graphical elements that can be modified.
*__Note:__In order to take full advantage of theme options we will have to first load the grid package which comes with R.*
We will use some of these elements to modify our plot:
library(grid)
# BONUS: Create station label colors and font face based on average value
meanByStation <- aggregate(value ~ sta, data=df, FUN=mean)
colorIndices <- .bincode(meanByStation$value, breaks=c(0,60,80,90,100), include.lowest=TRUE)
labelColors <- c('red','darkorange','gold','black')[colorIndices]
fontFace <- rep(1,length=length(colorIndices))
fontFace[labelColors != 'black'] <- 4
p <- ggplot(df, aes(x=sta,y=value)) +
coord_polar() +
stat_summary(fun.y=mean, geom="bar", color='lightblue', fill='lightblue') +
geom_jitter(shape=1,alpha=0.5) +
labs(title="'Percent Availability' of UW Stations with a BHZ channel (Mar-Apr 2015)",x="",y="") +
theme_minimal() +
theme(axis.text.x=element_text(family='serif',size=12,color=labelColors,face=fontFace),
axis.ticks.margin=unit(1,"lines"), # doesn't seem to work with 'coord_polar()'
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
plot.margin = unit(c(1,1,1,1), "lines"))
print(p)
Task 3: Explore themes
coord_polar()
line and theme()
lines above to return to a basic jitter and barplotgeom_barplot()
options and how they can be set in the stat_summary()
linetheme_bw
to see the code in the theme_bw()
functiontheme_wsj()
The ggplot2 concept of faceting is all about splitting up your data by factor to create a small multiples of a desired graphic for a quick visual overview. The different facet~
functions allow you to use R’s ~
in formula fashion to specify that you want to break up the plots by
some some other variable in the dataframe. Up to now, we have been investigating a single factor: sta
. In the next example we use the facet_wrap()
function to create a grid of timeseries split up by station:
p <- ggplot(df, aes(x=start, y=value)) +
geom_line() +
geom_point(aes(color=value)) +
scale_color_gradient(low="red",high="black") +
labs(x="Date",y="Percent Availability") +
facet_wrap( ~ sta, ncol=4)
print(p)
We can explore the facet_grid()
function if we download some data with more than one factor.
url <- "http://service.iris.edu/mustang/measurements/1/query?net=UW&sta=SHUK|SP2|TOLT|WISH&timewindow=2015-03-01T00:00:00,2015-04-25T00:00:00&output=text&metric=percent_availability&orderby=start"
df <- getMetricDF(url)
p <- ggplot(df, aes(x=start, y=value)) +
geom_line() +
geom_point(aes(color=value)) +
scale_color_gradient(low="red",high="black") +
labs(x="Date",y="Percent Availability") +
facet_grid(cha ~ sta)
print(p)
The ggplot2 package likes what it calls ‘tidy’ dataframes where all data appear in the same column and there is another column identifying which variable or metric each datum is associated with. These are also called ‘long’ as opposed to ‘wide’ dataframes.
The getSingleValueMetrics()
function returns ‘tidy’ dataframes that are ready for use with ggplot()
:
library(IRISSeismic)
library(IRISMustangMetrics)
# Open a connection to IRIS DMC webservices (including the BSS)
iris <- new("IrisClient")
starttime <- as.POSIXct("2013-06-01", tz="GMT")
endtime <- starttime + 30*24*3600
metricName <- "sample_max,sample_min,sample_mean,sample_rms"
# Get the measurement dataframe
juneStats <- getSingleValueMetrics(iris,"IU","ANMO","00","BHZ",
starttime,endtime,metricName)
p <- ggplot(juneStats, aes(x=starttime,y=value)) +
geom_point() +
facet_grid(metricName ~ .)
print(p)
Whew! Yet another whirlwind tour, this time focused on the power and flexibility of ggplot2. Armed with the available documentation and various examples on the web you should be ready to try your hand at creating data visualizations that answer your most pressing questions.
You should finish this lesson by exploring your own favorite data.
Task 4: Explore your favorite data
MUSTANG Metrics < prev | next > Creating new Metrics