In this lesson we will learn more about the stringr package for string processing as well as the lubridate package and POSIXct
class for handling dates and times.
We should describe packages a little because they are fundamental to extending R’s core functionality.
Many of the most important packages will be loaded on to your system when you install RStudio. You can return a matrix of information about installed packages and learn about individual packages with installed.packages()
as seen here:
# installed.packages returns a character matrix of information
ip <- installed.packages()
ip['stringr',c('Package','Version')]
## Package Version
## "stringr" "0.6.2"
Show all installed packages with rownames(installed.packages())
.
You can review the 6000+ packages at CRAN (http://cran.r-project.org/web/packages/) and load new packages with install.packages()
or with the ‘Install’ button in the ‘Packages’ tab in RStudio.
Packages define data and functions that can be imported into the global namesapce with the library()
function. This makes the package functions immediately available.
You can also use package functions with the package name prepended like: PACKAGE::FUNCTION(...)
. When you are using several packages, sometimes for just a few function calls, it is often helpful to use this explicit format so that you know which functions are associated with which packages. Being explicit also avoids namespace collisions when two packages use the same function name.
To learn which functions are defined by a package, use for example: help(package='stringr')
.
R has a number of builtin functions for doing things with string variables but, in typical R fashion, they are not collected in one place and they accept different arguments in different orders, reflecting the chaotic early growth of R.
Instead of R’s builtin functions, we will utilize the highly regularized stringr package by Hadley Wickham. From the stringr package description:
stringr is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.
The stringr package provides a number of sensibly named functions that ‘count’, ‘detect’, ‘extract’, ‘join’, ‘pad’, ‘replace’, ‘split’, ‘trim’, etc. All of these functions operate on vectorized character strings. We will load the dataframe we generated in Lesson 02 to demonstrate some basic string processing with stringr.
Note: Defining patterns for R’s regular epression matching (see ?regex) requires that we escape any ‘special’ characters with a backslash. But because backslash is itself a special character we use ‘\\.’ as the regex pattern that matches a single period. The fixed()
function from stringr is useful for turning off regular expression matching.
# Load 'stringr' functions into namespace so we don't have to use 'stringr::'
library(stringr)
# Load previously saved data
rms <- get(load('rms_example.RData'))
# Split 'target' column into a matrix with five components and show the first three rows
str_split_fixed(rms$target, pattern='\\.', 5)[1:3,]
## [,1] [,2] [,3] [,4] [,5]
## [1,] "IU" "ANMO" "10" "BH1" "M"
## [2,] "IU" "ANMO" "10" "BHZ" "M"
## [3,] "IU" "ANMO" "10" "HH1" "M"
# Replace '.' with '_'
str_replace_all(rms$target, pattern=fixed('.'), '_')[1:3]
## [1] "IU_ANMO_10_BH1_M" "IU_ANMO_10_BHZ_M" "IU_ANMO_10_HH1_M"
# Join pieces together
str_join(rms$net,rms$sta,rms$loc,rms$cha, sep='.')[1:3]
## [1] "IU.ANMO.10.BH1" "IU.ANMO.10.BHZ" "IU.ANMO.10.HH1"
# Detect patterns
str_detect(rms$cha, 'HH.')[1:9]
## [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
str_detect(rms$cha, 'HHZ')[1:9]
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
Task 1: SQL-style queries with logical masks
# Example logical masks
lo_med_hi <- .bincode(rms$value, breaks=quantile(rms$value, seq(0,1,1/3)), include.lowest=TRUE)
lo_mask <- lo_med_hi == 1
HH._mask <- str_detect(rms$cha, 'HH.')
..Z_mask <- str_detect(rms$cha, '..Z')
Use the above examples as templates to create a series of logical masks needed to create the subsets listed below. Hint: you will need to combine these masks with bitwise comparison operators &
and |
to create SQL style AND and OR queries.
Request a larger amount of data from mustang and come up with other SQL style queries of your own design
R provides a Date
class for working with dates as well as POSIXlt
and POSIXct
for working with date-times. You should always use the POSIXct class for any work in the time dimension. Data of class POSIXct
are stored as the number of seconds since 1970-01-01 00:00:00 UTC.
Use the strptime()
function to parse ASCII strings into POSIXct and the strftime()
function to format POSIXct into ASCII. It is important to always specify the timezone argument when using these functions as the default is to use the local timezone, not ‘GMT’. Some examples will demonstrate what can be done with time and date functions:
# Start of the Nisqually quake in two different ASCII representations
iso8601 <- "2001-02-28 18:54:00"
human <- "February 28, 2001 at 10:54am PST"
# Parsing
Nisqually <- strptime(iso8601,format="%Y-%m-%d %H:%M:%S",tz="GMT")
Nisqually
## [1] "2001-02-28 18:54:00 GMT"
Nisqually <- strptime(human,format="%B %d, %Y at %I:%M%p PST",tz="America/Los_Angeles")
Nisqually
## [1] "2001-02-28 10:54:00 PST"
# Formatting
strftime(Nisqually,format="Nisqually quake started on %d/%m/%Y, at %I:%M %p -- it was a %A")
## [1] "Nisqually quake started on 28/02/2001, at 10:54 AM -- it was a Wednesday"
strftime(Nisqually,format="Nisqually quake started in %Y on day %j")
## [1] "Nisqually quake started in 2001 on day 059"
If your data are always in UTC then you must always specify ‘tz=“GMT”’ as an argument when you read in or write out dates. If your data is sometimes in local time then you should become familiar with the lubridate package. This package has a variety of utilities that make working with dates much easier.
Here are some examples using the lubridate package.
# How many time zones are in Indiana?
zones <- lubridate::olson_time_zones()
zones[str_detect(zones,'Indiana')]
## [1] "America/Indiana/Indianapolis" "America/Indiana/Knox"
## [3] "America/Indiana/Marengo" "America/Indiana/Petersburg"
## [5] "America/Indiana/Tell_City" "America/Indiana/Vevay"
## [7] "America/Indiana/Vincennes" "America/Indiana/Winamac"
# What time did clocks on the East Coast show when the Nisqually quake struck
newyork <- lubridate::with_tz(Nisqually, tz="America/New_York")
newyork
## [1] "2001-02-28 13:54:00 EST"
# Let's make an 'airport clock' for the Nisqually quake
london <- lubridate::with_tz(Nisqually, tz="Europe/London")
moscow <- lubridate::with_tz(Nisqually, tz="Europe/Moscow")
hongkong <- lubridate::with_tz(Nisqually, tz="Asia/Hong_Kong")
text <- paste0(" Nisqually quake happened at\n",
strftime(Nisqually," %A %B %d %H:%M %Z\n"),
strftime(newyork," %A %B %d %H:%M %Z\n"),
strftime(london," %A %B %d %H:%M %Z\n"),
strftime(moscow," %A %B %d %H:%M %Z\n"),
strftime(hongkong," %A %B %d %H:%M %Z"))
cat(text)
## Nisqually quake happened at
## Wednesday February 28 10:54 PST
## Wednesday February 28 13:54 EST
## Wednesday February 28 18:54 GMT
## Wednesday February 28 21:54 MSK
## Thursday March 01 02:54 HKT
Task 2: lubridate functions
Many documentation pages for package functions have examples that you can cut and paste into the R console. Please explore the following lubridate functions by running the associated examples and experimenting on your own:
Dataframes and Simple Plots < prev | next > Plot Functions