In this chapter we will learn more about mult-dimensional arrays and how to calculate statistics with the apply() function.

Multi-dimensional arrays

In R, an array is a multi-dimensional block of cells where every cell must have data of the same class. Arrays are created with the array() function but we will be reading in data arrays from a netcdf file. A two-dimensional array is known as a matrix.

Array elements are accessed using [i,j,k] style indexing with as many indices as there are dimensions. Each of the ‘i’, ‘j’, ‘k’ above can be a single integer, a vector of integers or a logical mask of the same length as that dimension. If any of ‘i’, ‘j’, or ‘k’ are omitted then no subsetting occurs along that dimension.

Let’s load the BlueSky v2 output file we careated in Lesson 07 and do some subsetting.

library(ncdf4)

# Open the file
fileName <- 'PNW-1.33km_2014092500_v2.nc'
nc <- nc_open(fileName)

# Read in all of the data
pm25 <- ncvar_get(nc, "PM25") 
time <- ncvar_get(nc, "time")
time <- as.POSIXct(time, origin="1970-1-1", tz="GMT")
lat <- ncvar_get(nc, "lat")
lon <- ncvar_get(nc, "lon")

# Close the file now that everything is in memory
nc_close(nc)

# First questions:
class(pm25)

## [1] "array"

dim(pm25)

## [1] 1201  421   47

summary(pm25)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      0.00      0.00      3.81      0.00 153300.00

# Whenever we have a third quartile value below the mean we suspect the
# data might be logarithmically distributed
hist(log(pm25))

# The map in Lesson 07 showed some smoke in eastern Washington.  Let's look at
# the statistics in a 1 degree (~80 km) box centered on Washington State University in
# Pullman, WA (-117.167 E, 46.733 N)

lonMask <- lon > -117.667 & lon < -116.667
latMask <- lat > 46.233 & lat < 47.233
Pullman <- pm25[lonMask,latMask,]
Plon <- lon[lonMask]
Plat <- lat[latMask]

# Add the data from Pullman on top
hist(log(Pullman), add=TRUE, col='red')
mtext("Values from around Pullman", col='red')

# We can use the 'plotrix' package multhist() function to plot side-by-side histograms
library(plotrix)

# Specifying particular hours could be done like this:
morningHours <- c(6:12,30:36)
afternoonHours <- c(13:20,37:44)
nightHours <- c(1:5,21:29,45:47)

dataList <- list(log(Pullman[,,morningHours]),
                 log(Pullman[,,afternoonHours]),
                 log(Pullman[,,nightHours]))

multhist(dataList, beside=TRUE, freq=FALSE,
         col=c('blue','red','gray'),
         xlab='log(PM 2.5)',
         main='Morning/Afternoon/Night Smoke Comparison for Pullman, WA')

legend('topright', legend=c('Morning','Afternoon','Night'), fill=c('blue','red','gray'))

If we wanted to know when and where the maximum value of PM2.5 occurred we might be tempted to do the following:

which(Pullman == max(Pullman))

## [1] 388580

Unfortunately, the which() function unravels our 3-D array into an integer vector and returns the position of the maximum in that vector. This doesn’t really tell us anything about the when or where of this maximum. Note that we could do the unraveling ourselves with:

PullmanVector <- as.numeric(Pullman)
class(PullmanVector)

## [1] "numeric"

length(PullmanVector)

## [1] 470000

So we know that our maximum value is approximately 39/47 of the way through this long, unraveled vector. But when did it occur? To figure out that we need to use the apply() function.

Multi-dimensional statistics with apply()

To apply a function like max() over only some of the dimensions of our array we have to use the apply() function. With apply() we can ask questions like “What is the maximum value of PM25 at each separate timestep. We give it the following arguments:

our array
the dimension(s) we want to loop over
the function to apply
optional arguments to the function

# Loop over T (apply function to X and Y) to get a timeseries
hourlyMax <- apply(Pullman, 3, max, na.rm=TRUE)
length(hourlyMax) # same as our time dimension

## [1] 47

which(hourlyMax == max(hourlyMax)) # hour of overall maximum

## [1] 39

# For a simple plot
hourlyMean <- apply(Pullman, 3, mean, na.rm=TRUE)

# NOTE:  The 'y ~ x' notation below uses a *formula* and can be read as 'y as a function of x'

plot(hourlyMean ~ time, type='b', las=1,
     main="Avergage PM 2.5 around Pullman, WA")

# Loop over X and Y (apply function to T) to get a map
library(maps)
sumXY <- apply(pm25, c(1,2), sum, na.rm=TRUE)
image(lon, lat, log(sumXY), las=1,
      xlab='Lon', ylab='Lat',
      main="Cumulative PM 2.5 Exposure")
map('state',add=TRUE)
map('county',col='gray80',add=TRUE)

The function that is applied can be one of R’s standard functions or it can be a function of your own creation as in the following example:

# Calculate the cumulative exposure for a particular time of day.
# All other times do not contribute.
#   x -- timeseries passed in
#   start -- starting hour
#   duration -- # of hours
timeOfDayExposure <- function(x, start, duration) {
  # Determine which hours should be set to zero
  hours <- seq(start,length.out=duration)
  hours <- c(hours,hours+24)
  exposureMask <- 1:47 %in% hours
  x[!exposureMask] <- 0
  # Calculate the cumulative exposure
  exposure <- sum(x)
  return(exposure)
}

# Create some 2-D fields by applying our function
morningJogExposure <- apply(Pullman, c(1,2), timeOfDayExposure, start=6, duration=1)
soccerExposure <- apply(Pullman, c(1,2), timeOfDayExposure, start=15, duration=3)

# Set up the AQI 1-3 hour breaks and colors
aqiBreaks_1_3 <- c(0, 39, 89,  139,   352,   526,   10000)
aqiBreaks_8   <- c(0, 23, 51,   80,   200,   300,   10000)
aqiBreaks_24  <- c(0, 12, 35.5, 55.5, 150.5, 250.5, 10000)
aqiColors <- c("#009966","#FFDE33","#FF9933","#CC0033","#660099","#730023")
aqiColors <- adjustcolor(aqiColors, 0.5)
aqiNames <- c('good','moderate','USG','unhealthy','very unhealthy','extreme')

# Side-by-side layout
layout(t(seq(2)))

# Map 1 (divide exposure by 2 to get single day cumulative)
image(Plon, Plat, morningJogExposure/2, las=1,
      xlab='', ylab='',
      col=aqiColors, breaks=aqiBreaks_24,
      main="Morning Jog Exposure")

# Add a map and label Pullman
map('county', add=TRUE, col='white', lwd=2)
points(-117.167, 46.733, pch='X', cex=1.2, lwd=2, col='white')
text(-117.167, 46.733, labels="Pullman ", cex=1.2, font=2, pos=2, col='white')

# Add a legend
legend('topleft', legend=rev(aqiNames), cex=0.8,
       fill=rev(aqiColors),
       bg='white')

# Map 2 (divide exposure by 2 to get single day cumulative)
image(Plon, Plat, soccerExposure/2, las=1,
      xlab='', ylab='',
      col=aqiColors, breaks=aqiBreaks_24,
      main="Soccer Practice Exposure")

# Add a map and label Pullman
map('county', add=TRUE, col='white', lwd=2)
points(-117.167, 46.733, pch='X', cex=1.2, lwd=2, col='white')
text(-117.167, 46.733, labels="Pullman ", cex=1.2, font=2, pos=2, col='white')

Lesson 08 – Working with Arrays

Mazama Science

Last updated: October 09, 2014

Multi-dimensional arrays

Multi-dimensional statistics with apply()