Visualizing Bikeshare Data

This entry is part 18 of 21 in the series Using R

Seattle’s Pronto bikeshare system recently announced a Data Challenge for data visualization using their first year of trip data. As avid cyclists and data analysis junkies, we of course took the bait. Below is a brief description of our Pronto Databrowser submission.

At Mazama Science we focus on creating interactive websites that allow people to thoroughly interrogate interesting datasets. Each project begins with a thorough investigation of the original data to figure out what stories can be told. This is then followed by an exploration of data visualization styles that help communicate the stories we have found. Finally, a user interface is developed that guides people to subset the data in ways that lead to interesting stories. Guiding people to achieve their own “Aha!” experiences is the best way to motivate data based decision making.

Pronto Data

The Pronto Data Challenge data consists of four datasets: trip data, station metadata, daily weather data and minute-to-minute station status (#docks empty/full/broken).

The raw csv files have the following sizes:

For any investigation of human behavior, the most interesting dataset is the trip data which contains 142,846 trip records with the following variables:

  • trip_id
  • starttime
  • stoptime
  • bikeid
  • tripduration
  • from_station_name
  • to_station_name
  • from_station_id
  • to_station_id
  • usertype
  • gender
  • birthyear

To generate maps, this must be combined with station metadata which has:

  • id
  • name
  • terminal
  • lat
  • long
  • dockcount
  • online

We can amend the station metadata by adding station elevations using the Google Maps Elevation API. Using as-the-crow-flies distances from R’s geosphere package, we can also add station-to-station distances to the station metadata and, through lookup, to the trip data.

Now we have a couple of rich datasets to play with.

Data Visualization

After playing around with our amended datasets,  we realized that, even though Pronto bikeshare usage is very seasonal, it isn’t that affected by weather. There aren’t really any muggy days in Seattle as temperature and humidity are inversely correlated. And it can be cool and wet any time of the year. (Though 2015 had unusually long stretches of nice weather.) Subsetting the trip data by gender and age also doesn’t lead to anything particularly revealing.

The most interesting stories we found have to do with the variations in usage patterns between annual members and short-term pass holders; with time-of-day and day-of-week; and depending on the departure station.

You can  explore the data yourself with a variety of visualizations at our Pronto Databrowser. Here are two plots from the Data Stories page that tell pretty good stories by themselves:

The time-of-day usage plot shows that: 1) annual pass holders use the bikes during the morning and evening commute; and 2) usage is heaviest in the summer.

 annual_weeklyUsageByHourOfDay

The elevation plot shows that annual pass holders vastly prefer coasting to humping up Seattle’s steep hills.

annual_coasting

We invite you to find and tell your own stories by poking around and exploring this very interesting dataset.

Happy Exploring!

Series NavigationFunction Argument Lists and missing()When k-means Clustering Fails
This entry was posted in Data Visualization, R and tagged , , . Bookmark the permalink.

2 Responses to Visualizing Bikeshare Data

  1. Terry says:

    Can you explain the elevation plot in more detail?  I can’t see how you get the interpretation that annual pass holders prefer downhill to uphill, since the plot doesn’t show the number of journeys in each direction.

    • Jonathan Callahan says:

      The data for the plot are calculated like this:

      • Assign an elevationDiff to every trip based on the elevations of the starting and ending stations
      • Assign every trip to a minute-of-the-day bin based on startingTime
      • For each minute-of-the-day, sum up the elevationDiffs in that bin
      • Divide by 550 (# of bikes) and 365 (# of days) to normalize

      The result is the “Minute-to-Minute Annually Averaged Elevation Difference per Bike per Day”.

      The sign of the elevationDiff is what keeps track of uphill/downhill (+/-).

      A different analysis might just count up the number of uphill vs. downhill trips without taking into account elevation gain/loss. but I’m not sure what the take home message from that analysis would be. As any cyclist knows, grade matters, so the total elevation gain or loss is very important.