On the left we have zero, our integer measure of nothingness. On the right we have missing value, aka N/A, aka NA, our signal that the value of a datapoint is unknown. Everyone who deals with data has to deal with this important distinction. And far too often people get it wrong.
The importance of storing ‘missing’
One of the most important components of a data management strategy is the proper handling of missing values.
Every data collection effort is essentially an attempt to answer a set of questions of this form:
What was the value of _______ at such-and-such time and place?
A climatologist will store a numeric value in a particular set of units like temperature in degrees C:
What was the high temperature in degrees C in Saint. Louis on Jul 20, 2000?
A census worker will store a string of characters that represent an identifier like a name:
What was the name of the head of household at this address on April 3, 2010?
As in all human endeavors, success is not 100% guaranteed. So we must be ready to deal with the possibility that our data collection effort comes up short. Perhaps the thermometer in St. Louis malfunctioned on our date of interest. In this case the correct answer to the question “What was the temperature?” would be “We don’t know.” In the past , and disappointingly continuing into the present, some people would enter
0 (the number zero) when a datum was missing. Anyone examining the data at some later date would be quite surprised at the sudden cold snap that hit St. Louis on July 20, 2000!
And what about our census worker? If no one is at home should they enter into the NAME field the string: ”not at home”. That would be an unusual name. Perhaps there is a spelling error and they meant “Notatome” which is a valid surname. Or perhaps they combined typos with a first-last switch and are really identifying “Tome Nota” — a Portugese language Christian youth ministry. We can’t really know for sure unless we have a rock solid standard for storing the meaning “we don’t know” or “not measured” or “missing”.
Storing missing values in spreadsheets
Spreadsheet software like Microsoft Excel and OpenOffice Calc is ubiquitous. Given the prevalence of PC’s in the government and business world, most people have Excel on their desktop and many small datasets are generated in Excel and disseminated as Excel files. It is important, therefore, to learn how to use the features of Excel to properly encode the meaning “missing value” whenever necessary.
One of the reasons spreadsheet software is so popular is that it allows users to see the data they are working with, giving them a sense of control over the data. “Seeing is believing” and most spreadsheet users seem to trust their own eyes more than they trust their software tools. Although we agree that human visual perception is an incredible tool for hunting ripe blackberries, we would argue that it is not evolutionarily suited to validating numbers and text. One common example of the problem with human validation of numbers and text is our inability to tell the difference between an empty cell and one containing a whitespace character such as Unicode character ‘SPACE’ (U+0020). According to your eyes, there is no difference between the two. But for your computer there is.
This may not seem like a problem if you are generating charts inside of Excel and can clean up the data yourself when you run into problems. Things get trickier, however, when these data are merged with other data and further processed by people who have no specific knowledge of the dataset. If a column of unvalidated data containing the SPACE character is assumed by downstream software to be numeric, a data type error will typically result. If, instead, the data column is assumed to contain character strings, no error will be detected and the SPACE character will be used as a valid string. The burden of checking for cells containing only whitespace characters and then converting these to missing values falls to the programmers working with the merged dataset. It’s no wonder, then, that merged database projects so often go over budget when the incoming data is poorly validated.
One of the problems in using spreadsheet software to work with data is that the default data validation settings on these tools are too forgiving. By default, data entered into a cell can be of any type — string, float, decimal, date, etc. One of the most important things an Excel user can do is to harness the built in validation tools every time new data are added. Microsoft provides extensive documention on using and customizing data validation within Excel and users should become familiar with these features. Users can validate not only the data type but can set up valid ranges that alert users to problematic values when they are entered.
Getting back to our original theme, missing values in Excel are identified by cells that are truly empty. To understand what this means you have to understand the different cell types used by Excel. Internally, Excel recognizes the following types of cells:
Everything makes sense right up through BOOLEAN. The ERROR type is used when there is an internal Excel error message associated with the value of a cell — presumably due to problems with a formula. But what about BLANK? Well, BLANK cells are those that have formatting information but no stored value. From the data manager’s point of view they should be the same as EMPTY.
Unfortunately, older versions of Excel had a different idea. A Microsoft support article for Excel versions prior to 2002 has the following:
In Microsoft Excel, when you use a formula that tests for a zero value, you may see unexpected results if the cell is blank. Microsoft Excel interprets a blank cell as zero, and not as empty or blank. Therefore, any cells that are blank are evaluated as zero in the function.
Ahhh, spreadsheets. If you have to use them make sure that you have a recent version and that you learn to use the data validation tools available.
Storing missing values in CSV files
We are quite fond of CSV files as a lowest common demoninator format for data exchange. CSV files can be read by just about any software out there and greatly reduce the effort involved in reformatting data. Many datasets are inherently tabular in nature and it is straightforward to export Excel worksheets as CSV files. An earlier post described how to automatically convert an entire Acess database into a set of CSV files. The only problem with CSV as a standard is that it isn’t really standardized. So let us evaluate the options for storing the meaning “missing” in a CSV file.
In the bad old days, which are still with us in some areas, lots of different identifiers were used to store the idea of “missing value”. Here are a few we have encountered:
The hyphen is a convenient note to make on a paper spreadsheet to identify something like: “Measurement attempted but no value obtained”. However, it is not widely understood by analysis software.
Do this under penalty of death!!!
A perennial favorite, this is a valid numeric value and only conveys the meaning “missing” if one is intimately familiar with the valid domain of the data.
- “-9999″, “-999″, “9999″, “999999″, “9999999″
Yes, we’ve seen them all. Some sites use six 9′s and seven 9′s as separate missing value flags to identify which technician was responsible for working up the sample associated with each cell. If you ever expect your data to be used by anyone other than yourself please, please don’t do this.
This designator for “Not Available” is quite human readable but is not commonly understood by analysis software.
This designator is used in floating point computations and is not really appropriate for data that might be stored in spreadsheets or CSV files. For full details see the Wikipedia article on NaN.
The blank string of zero length is an excellent designator of “nothing here”. The only downside is that it is not explicit and some tools will not automatically populate cells with their own internal designator for “missing value” without some coaching.
This designator is widely understood as the ASCII representation of “missing value” and is the default in many packages including our personal favorite — the R statistical analysis package. Even here there is a problem, though, as “NA” is also the ISO 3066-1 identifier for Namibia. We have a rule of avoiding Namibian conflict whenever possible which brings us to our own preferred designator for “missing value”:
The R package, and presumably others, can be coached to accept this as the default missing value string when reading in CSV files and this designator avoids the problems mentioned above.
The bottom line is that we recommend either “na” or “NA” as the missing value flag for CSV files. Anything else is inviting trouble and expense for everyone downstream.
Overcorrecting — Storing ‘missing’ when you really mean zero
Have you ever heard anyone say “… for you and I …”? A lot of people have been taught that they overuse “me” and then overcorrect by using “I” even when the pronoun is the object of a verb or preposition. Well, the same thing happens with zero vs. missing.
There are times when the correct answer is zero and not “we don’t know.” Overcorrecting by storing the missing value deisgnator instead of zero can also cause problems for anyone working with the data.
Two examples should get the idea across:
The USGS DataSeries 140 is a fantastic, century-long history of production and use data for a long list of minerals. The lead-use Excel file from this dataset allows one to track the year-to-year evolution of lead use in different categories, one per column. The column labeled “Gasoline additives” decreases gradually from a high of 218,000 metric tons in 1976 (ouch!) to 4,000 metric tons in 1994. The cells for years after 1994 are of type EMPTY.
As we read above, cells of type EMPTY are to be interpreted as “we don’t know”. So, do the folks at USGS really not know how much lead was used for gasoline additives in say … 2002? Hardly. The Notes worksheet in the file states quite clearly:
From 1987 to 1995, data reported on gasoline additives depicts a progressive phaseout. The U.S. Environmental Protection Agency issued a direct final rule that prohibits the use of chemical additives in gasoline for highway vehicles, effective January 1, 1996.
So the amount of lead used for gasoline additives, at least post 1995, is in fact zero. And that is what should be stored in the spreadsheet, not the Excel designator for “we don’t know”. Anyone trying to create a pie plot showing percentage usage in different categories is stymied when the amount in one of the categories is “unknown”. No category value => no total amount => no percentages. It would be much better for downstream users of the data if zero were stored in these cells.
Even folks who work at the highest levels of government, data-driven science with some of the best software tools available for data handling and analysis are subject to this error. The International Combined Ocean Atmosphere Data Set (ICOADS) is an extremely high value dataset that contains ocean observations of climate variables like air and sea surface temperatures, wind speed and pressure. With this dataset one can create maps depicting the ocean surface temperature for any month going all the way back to January 1800. (Don’t expect to see much data before 1900, though.)
Along with the measured variables this dataset also includes variables containing the number of observations that were used to create the monthly average in a particular grid cell. The idea here is that you might have more faith in a temperature value if the lat-lon box for a particular month contained many observations as opposed to just one or two. So each NUMOBS_SST value is really an answer to the question:
How many observations of sea surface temperature were used to generate the SST value stored in the associated space-time grid cell?
The answer should never be “we don’t know”. But that is in fact what is stored. The NUMOBS_SST field has lots of missing values and a few “1″s in January of 1854. To our way of thinking it should contain a lot of “0″s and a few “1″s.
Perhaps we are nitpicking here. Users of this data do not seem unhappy with the data files as they are. We are merely using this as an example to highlight the difference between zero and missing in the hopes of fostering better data management in all fields.
Here’s to honest reporting of “we don’t know” and to not letting your zeroes go missing.