Despite what they say, size does matter.
Successful data management is all about finding the proper tools and formats for dealing with your data. There is no one-size-fits-all solution. And the very first question you should be asking yourself is: “How much data are we talking about?”
It’s all relative
Most people are familiar with data and data management techniques from within their own field of study. Whether one has a large dataset or not is therefore a relative question. A dataset is considered large or small relative to some other collection of data. But the tools for dealing with data — hardware and software — are constantly improving and what may have been considered a ‘large’ amount of data a few years ago may no longer be so big. Consequently, the appropriate tools for dealing with your dataset may be changing as well. With RAM selling for under $50/GB, many datasets are starting to look a lot smaller.
We’ll begin our examination of data volumes by putting various datasets and databases on a logarithmic scale just to get a sense of their relative size. Remember, each tick mark represents a factor of 10 increase in size. (References are included in the list of links at the end of this article.)
Data volumes range over more than twelve orders of magnitude! Where does your dataset fit in?
Clearly, datasets in the single megabyte range (gray) are at the insignificant end of the scale. At about the 100 megabyte scale (green) it becomes important to have a plan for how to manage the data with an eye on potential software limitations. Still, it should be smooth sailing with with respect to hardware until you start handling hundreds of Gigabytes of data. At that point, hardware and software limitations will both impact your decision making.
From 100 gigabytes to perhaps 10 terabytes (yellow) you can still buy off-the-shelf components that will store that much data but data managers must proceed with caution. Above 10 terabytes (orange) requires carefully designed, networked storage devices. This is the realm where computer scientists need to be part of your team. Those projects that involve storing above a petabyte of data (red) are at the cutting edge of what we are planning for the next decade.
For a more personal comparison, our recently purchased, plain vanilla iMac came with 4 gigabytes of RAM and one terabyte of disk. That would allow us to download and play with some of the largest datasets of climate measurements (as opposed to model output).
Megabyte sized (small) datasets
Several of the projects we have worked on involve data collections at the small end of the scale. Any dataset that involves humans in the collection, data entry, processing or validation of individual data points will always be under 100 megabytes. This is the area that contains what we like to refer to as “high value datasets” — those datasets that have actual measurements made by humans as opposed to model output generated by computers or streams of data generated by automated sensing devices.
Because these small datasets are not even as big as available RAM on most machines, there is no requirement to store them in any particular compact format or access them with any particular software. You should always keep in mind that reading from and writing to disk are by far the slowest operations on your computer. Once data are read into RAM, any filtering and subsetting and processing of data should be lightning fast.
When planning data management, we always subscribe to Einstein’s philosophy of: “As simple as possible, but not simpler.” Whatever is simple and easy and flexible and fast is the right choice for working with small datasets. Too many data management applications get bogged down in the complexity of using yesterday’s sexy computer science tools even though they are completely unjustified by the data volumes. Often it would be much more cost effective to simply buy some more memory and then use a brute force approach to filtering and subsetting. Computer memory is much cheaper than the human memory required to keep a complex system working.
For datasets of this size we often recommend storing the data in one or more simple CSV files or an SQLite database if no other solution is preordained.
Gigabyte sized (big) datasets
Once you start working with datasets that are a gigabyte in size and larger you will need to consider very carefully two components: 1) what software tools are used to access and analyze the data; and 2) what data format(s) the data should be stored in. These are not independent choices as a particular set of tools usually relies upon the data being available in a particular format.
Sometimes you will be required to support formats required by a particular piece of software already used by the community of practice working with this data. Other times you may be at liberty to make recommendations. Our philosophy of “as simple as possible” is meant to apply both to the team in charge of managing the data as well as end users of the data. It is therefore very important to interview those who expect interactive access to data to find out how they intend to use the data: what tools they use, what kind of subsetting or querying is required, what kind of interactive access is expected, etc. Hopefully, certain themes will arise from these questions that will help guide your choices.
At this scale, the structure of your data will also have a lot to say about what options are available to you. If you have regularly gridded data the NetCDF format, widely used in the climate data community, may be appropriate while relational data will need to be stored in a RDBMS like MySQL or PostgreSQL.
Terabyte sized and bigger (huge)
We have never worked with datasets larger than about a terabyte. In this size class data management of necessity becomes it’s own separate activity with the associated specialists and funding. Usually, datasets this large consist of lots of raw data which is important to keep but may not be what is needed for higher level analysis. It is often possible to generate and store partially processed versions of the raw data which reduce the data volume by several orders of magnitude and allow much simpler data management solutions. An example of this approach applied to genomic data is found in an excellent article from 2008: How much data is a human genome? It depends how you store it.
The executive summary describes the approach:
For those who don’t want to read through the tedious details that follow, here’s the take-home message: if you want to store the data in a raw format for later re-analysis, you’re looking at between 2 and 30 terabytes (one terabyte = 1,000 gigabytes). A much more user-friendly format, though, would be as a file containing each and every DNA letter in your genome, which would take up around 1.5 gigabytes (small enough for three genomes to fit on a standard data DVD). Finally, if you have very accurate sequence data and access to a high-quality reference genome you can squeeze your sequence down to around 20 megabytes.
Clearly, intelligent data management for very large datasets involves a lot of decisions about which users the data is being managed for. One set of users will require access to the raw data but are willing to wait while at the other end of the spectrum you have people who want instant access to the summary information. As we said at the beginning, there is no one-size-fits-all solution and good data management is about identifying different classes of users and finding the right solutions for each class.
The following list of links provides a little more detail on the data mentioned in the graphic above and a few interesting posts on working with very large data volumes:
- British Petroleum Statistical Review (BP Stat Review)
- Wadeable Streams Assessment (EPA WSA)
- Historical Statistics for Mineral and Material Commodities in the United States (USGS DS 140)
- How much data is a human genome? It depends how you store it. (2008)
- International Comprehensive Ocean-Atmosphere Data Set
- Earth System Grid
- Man behind MasterCard’s 100-terabyte data warehouse (2008)
- CERN – LHC Computing
- Sloan Digital Sky Survey
- The Petabyte Problem: Scrubbing, Curating and Publishing Big Data (2008)
- Petabytes on a budget: How to build cheap cloud storage (2009)