Data producers vs. data consumers

In the marketplace, the needs of producers and consumers are often at odds:  producers want higher prices, consumers lower ones; producers want easy assembly, consumers easy dis-assembly; producers want flexibility and rapid prototyping, consumers reliability and long-term support.

The same competing needs exist in the world of scientific data management where producers of data and consumers of data often operate in very different worlds with very different sets of tools.

Although examples could be drawn from any field of science, climate and environmental data provide some superb examples that highlight the different world views of data producers and data consumers.

Synoptic vs. Time-series

Weather data is collected every hour and is fed into data ingest models that create as output synoptic fields — descriptions of the state of the atmosphere at a specific time but on a broad spatial scale.  These fields are used as input to forecasting models that compare the most recent field with earlier fields and calculate a forecast of the state of the atmosphere at specific times in the future.  The organizing principle for data input and output files is xy-region by time point.

Climate models work the same way, calculating the state of the global climate one time point at a time.  The output of these models, even when stored as multi-dimensional NetCDF files, is typically organized as a series of snapshots at specific times.  This series of snapshots is the world view of these data producers.

Data consumers come in many varieties of  course.  There will be some who are interested in generating maps of weather or climate at some future time or date:  the “map users”.  For these users, the snapshot world view will be an excellent fit.

But what about someone who is interested in looking at a time series representation of the daily weather or monthly climate at a particular location?  To assemble the data for this representation, our “time-series user” must open and read each (potentially multi-Gigabyte) snapshot in order to extract a single value at their location of interest.  A time-series of 1,000 points may require processing of a Terabyte of data.  Clearly the “time-series user” would prefer the data be organized as “time-series by location”.  After opening the file for their location of interest they would simply read all of the data.

Time-series vs. Synoptic

In the world of environmental science, the reverse scenario is often true.  Data are typically collected and organized by location through the use of a unique “Station ID”.  Samples taken at one location in different years have different timestamps but the same “Station ID”.  Any data consumer wishing to generate a synoptic view of the data — all stations for a particular year —  must reorganize the data in order to generate the maps or other broad scale representations they desire

Serving Data Consumers

To our way of thinking, scientific data management should be about meeting the needs of the data consumers — scientists, policy makers and engaged members of the public.  In order for science to inform public policy, the process of working with scientific data must be made easier.  A tremendous amount of time and effort is spent reformatting data for use with specific analysis tools and an equally tremendous amount of subtlety and detail is lost with each reformatting.  It is up to the data managers to make sure that data are made available in structures and formats that help the ultimate users.  Sometimes this means doing things in a less than cutting edge manner.

Making data available via the latest XML-WSDL-web-service frameworks may fit with software engineering best practices but these data are unlikely to be useful to biologists, environmental consultants, geologists, hospital administrators, petroleum engineers, physicists or anyone else without a computer science degree.

Expecting these people to have access to computer staff who can help them is often a very poor assumption.  The chances that they will write Java, C or Python code to work with the data are slim.  The chances that they will reach for their favorite trusted analysis and visualization package — sadly, sometimes only Microsoft Excel — is high.  If their favorite package does not support a particular data format, those data are essentially unavailable to them.  (We are not recommending abandoning modern, information-age approaches to data delivery — only supplementing them with formats that are accessible to the huge number of intelligent individuals still working with bronze-age tools.)

Data Consumer Checklist

We provide the following checklist with the hopes that it will inspire data managers to step out of their data producer and software engineering world views and think about what would be most useful to those at the other end of the data pipeline.

  1. Identify one or more groups of data consumers — people who want to do analysis and visualization with the data.
  2. Identify which software tools they use — statistical pacakges like R, S+, Statistica, etc.; multi-dimensional engines like Matlab, IDL, Octave, etc.; spreadsheets like MicroSoft Excel or OpenOffice; specialized software for a particular community of practice.
  3. Identify any standards (formats, metadata conventions, variable names) that exist within a particular community of practice.
  4. Determine how they want to work with the data — eg. synoptic vs. time-series.
  5. Seek out a representative from the identified users who will work as a guinea pig to test the data formats you create.
  6. Be prepared to offer the same data in multiple formats to satisfy the needs of different groups of consumers.

In the end, good scientific data management is about increasing the efficiency of the data consumers by anticipating and then meeting their needs.  If all goes well, our efforts at data management will scale terrifically as every hour we spend making data more useful will be multiplied by the number of data consumers who no longer have to do this work.

We hope the current administration sees fit to support better data management within agencies with the same enthusiasm with which they support improved metadata-management at sites like Data.gov.

Finding data is a wonderful thing.  But being able to actually use data is equally important.

This entry was posted in Data Management and tagged , , . Bookmark the permalink.

One Response to Data producers vs. data consumers

  1. The following appeared in a personal email and the author wishes to remain anonymous.

    When you write: “Making data available via the latest XML-WSDL-web-service frameworks may fit with software engineering best practices but these data are unlikely to be useful to biologists, [etc]” I think you are missing the point. I just spent some years in bioinformatics and I strongly suspect that it is the biologists who embraced data-as-a-service and that they did it for the following reason: because it gives them total control over what gets downloaded and how much. Through limited or inconvenient web interfaces or APIs, through outright restrictions at the back-end, they get to keep most of their data private while also getting to pretend they are making it public: they obey the letter while evading the spirit of data openness. Meanwhile the geeks, as usual, get blamed for the machinations of other people.

    This is a very important perspective and demonstrates the different experiences in different areas of science. Certainly, anyone who works in fields related to Big Pharma has an interest in controlling who gains access to their data.

    The term ‘biologists’ as used in the original post is far too broad to be meaningful. I had intended to refer to people who work in fields like environmental assessment or ecology and spend much of their time out-of-doors. Many of these individuals have computer skills that begin and end within the Microsoft Office suite of tools. In my experience, people working in these fields are ready to share their data but lack the programming skills to do so. Too often, they hand their data over to computer science graduates who don’t know the first thing about the science at hand or how the data will ultimately be used.

    It would be interesting to hear from folks in other disciplines about the technological or social barriers to improved access and utility of ‘public’ data.