A recent post in the NY Times alerted us to a potentially interesting government web site. In a continuation of its policy of transparency and science-based decision making, the Obama administration recently opened Data.gov with the following mission statement:
The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.
The idea of spending tax dollars to improve access and machine readability of data is near and dear to our hearts. So let’s take a closer look and see how well they are doing.
On May 25, one of the featured datasets was the “Residential Energy Consumption Survey (RECS)” generated by the Energy Information Administration. The Data.gov site helpfully includes the full 10.3 MB dataset and a smaller subset of particular interest — the “Consumption” portion which weighs in at only 921 KB. We click on the link in hopeful anticipation and arrive at the following metadata page:
http://www.data.gov/details/59
We generally disapprove of the hypercritical nature of scientific commentary where all attention is focused on the negative so let us first draw attention what is good about this metadata page.
- It has the simplified Dublin Core metadata fields as well as several others.
Standardized metadata fields are important when building search tools on top of metadata catalogs. - Standard data formats include Comma Separated Values (CSV).
ASCII CSV is the lowest common denominator for data sharing. The value of CSV is in the fact that it is an essentially brain-dead format that makes no assumptions of any kind on the software that will be used to process it. Wherever reasonable, and we intentionally leave ‘reasonable’ vaguely defined, the government should offer up data as CSV files in addition to whatever other formats might be used within any associated community of practice. - Links are provided to the original data files, an associated data dictionary and additional technical and other metadata.
- It looks attractive and uncluttered.
Well, so much for the good.
Despite our best efforts to avoid criticism, we have always taken the side of the little boy in “The Emperor’s New Clothes”. In order to move forward toward publicly useful public data it is important to point out where this new government effort falls short.
Let us begin at the beginning:
Is there an organizing principle behind the URLs?
A human can of course read the Data.gov web page and learn about the RECS Consumption dataset and manually click on the associated link. But can an interested programmer use any predefined scheme to generate a RESTful URL to this metadata page — perhaps something like “/EIA/RECS/Consumption/”. No, the data management masters at Data.gov have chosen (i.e. have allowed to be chosen for them) the less than informative “/details/59/”. Who needs an organizing principle?
Why is spell checking so hard?
It does not involve a superhuman effort to run blocks of text against a spell checker like aspell or hunspell. At a minimum, one should check to make sure that the characters in blocks of text are ASCII or at least UTF-8 printable characters. That would keep typo’s like ” the Census Bureau<81>fs statistical estimate” from appearing in the metadata. (The Unicode U-0081 control character is often displayed as a question mark within a diamond: �, which is what you may see.)
Can we please have a minimum standard for CSV files?
Let’s take a closer look at this Consumption data file the government has drawn our attention to and see where it can be improved. A small chunk is given below:
SURVEY,DOEID,REGIONC,DIVISION,LRGSTATE,TYPEHUQ,MQRESULT,NWEIGHT,HD65,CD65,KWH,BTUEL,CUFEETNG,BTUNG,GALLONFO,BTUFO, ... 2005,1,3,7,3,2,9,25677.9652467,1231,3281,13459,45922,365,37559,999999,9999999, ... 2005,2,4,9,2,2,1,24261.8102616,1663,1123,13051,44529,227,23320,999999,9999999, ... 2005,3,2,3,0,2,9,31806.2950159,5221,1286,19464,66411,682,70178,999999,9999999, ... 2005,4,4,9,0,3,9,22345.3974913,5261,667,28635,97703,999999,9999999,999999,9999999, ... 2005,5,3,6,0,2,9,18842.4554202,4392,1238,28658,97781,686,70589,999999,9999999, ... 2005,6,1,2,0,2,9,5665.8754185,5162,1236,13212,45079,1340,137865,999999,9999999, ...
- Column Names
It would be nice if the file were self-documenting, but at least a human can return to the metadata page and access the data dictionary to figure out what the column headers mean. You may have noticed that the column names are in all caps and have a maximum length of eight characters. These are the hallmarks of a FORTRAN data processing system from the 1970’s. Even if these cryptic names are embedded in code within the EIA there is no reason in the world to make outside users suffer. Any software written in this millennium should have no difficulty with the actual names from the data dictionary. The following header line would be a vast improvement:"Year Survey Conducted","4-digit identification number","Census Region","Census Division", ...
- Identifiers vs. Integers
There is a big difference between an identifier and an integer. James Bond had a not-so-secret identity — “007”. Yet Sean Connery was 32 when he appeared in Dr. No. Not a single 7-year-old was auditioned for the part.When working with data it is important to distinguish between identifiers which are strings of characters, and numbers which measure something on a linear scale. Column 2 of our dataset is the “4-digit identification number”. (The dataset creators are apparently confused on this distinction. The column should be named “4-digit identifier”.) AND it should have four digits. AND it should be enclosed in quotes — more on that below.
- Significant Figures
Despite the fact that the concept of significant figures is taught in freshman physics and chemistry, any previous knowledge of the idea seems to be furthest from the minds of those responsible for data management. The “NWEIGHT” column (given the informative name “The Final Weight” in the data dictionary) is apparently a statistical weighting applied to the survey region. So the data providers can indeed calculate the weighting to arbitrary precision. But do they really want to report this value to 12 significant figures?
- Missing Value Flags
The good news is they understand the difference between “numeric zero” and “Not Available (NA)”. The bad news is they are using physically possible numeric values as the missing value flag. This dataset actually uses two different missing value flags: “999999” (six nines) and “9999999” (seven nines). This is sometimes done to identify which individual or group is responsible for quality control for a particular variable. Like the FORTRAN column names, however, this is only confusing to the final consumer of the data. More dangerous is the possibility for the missing value flag to be interpreted as a value. The “BTUNG” column has values that are less than two orders of magnitude away from the “9999999” missing value flag. Using even remotely possible values as a missing value flag is a big no-no. Once again, any worthwhile software for analyzing this data should be able to handle the relatively standard “NA” missing value flag, avoiding any possibility for confusion. - Strings in Quotes
Both the the “Identifiers vs. Integers” and “Missing Value Flags” issues call attention to the need for strings to be quoted within CSV files. Rigorous attention to this small detail would prevent a huge amount of misunderstanding and wasted effort. This is probably the single easiest change that can be made to make files like the RECS Consumption more machine readable.
Most of our criticism up to this point has been leveled at the dataset itself. And the designers of the Data.gov system can rightfully say: “That’s not our problem — we’re just pointing to a dataset over at the EIA.” To which the only appropriate response is: “Yes, and that is the entire problem.”
You see, the Data.gov site is clearly being run by people who may know a lot about computer science but have no familiarity with or interest in the actual data they are attempting to provide access to. The computer science grads working on Data.gov see every problem as one for which the solution is high-level abstraction and imposed consistency. “If only we have a good looking, well organized metadata system”, they must be saying, “then everyone will be able to find what they need.” This is how computer science students are taught to think and what CS grads will always do to solve a problem.
What they are forgetting is that the EIA, the agency responsible for this dataset, is already doing a decent job of providing access to the data as well as a whole lot more including:
- collecting and processing the data in the first place
- making the data available at a URL within the EIA hierarchy
- apparently spell-checking their text descriptions
- providing a point of contact who is responsible for the dataset
- working directly with end users of the dataset
- etc.
Consumers of this data do not need yet another, non-targeted metadata system when each specific agency has several such systems of their own. It is hard to imagine how the Data.gov site in its present incarnation can do much besides give the appearance that this administration promotes open access to government data. (Although perhaps it is worth the effort just for that. It is wonderful to finally have an administration that at least appears to value science!)
But if the administration really wants to help make the data more accessible and machine readable it will need to hire individuals familiar with actual science and not just computer science. Scientific data is unlike bank data or business data. You don’t typically have error bars associated with your bank balance. But a careful scientist will always report them and a careful scientific data manager will make sure they don’t get lost. A working knowledge of topics like significant figures, units, precision, accuracy, standards, etc. is a fundamental requirement for those wishing to improve the usability of data. It is only when individuals with field-specific science knowledge work closely with software developers that progress on this front will be made.
Perhaps we can convince some of those physics Ph.D’s to abandon Wall Street Banking for the less lucrative but more meaningful work of scientific data management.
One can always dream.