Have you ever asked yourself whether your telephone number is really a number? It’s got numbers in it but does it measure anything?
How about your credit card number? PO Box? Social Security Number? What would happen if you subtracted one of these from another?
As it turns out, many of the “numbers” we deal with every day are actually identifiers and not a measure of something. Sadly, too many data managers do not distinguish between the two even though making this distinction is quite simple.
Identification vs. Measurement
We have a check stub on our desk that contains the following information:
- CHECK NO. — 12345
- OUR REF. NO. — 290227
- YOUR INVOICE NO. — 090202
- INVOICE DATE — 01/30/09
- INVOICE AMOUNT — 100.00
- AMOUNT PAID — 100.00
The first three items are used to uniquely identify the check according to three different accounting systems. We can think of this information as metadata associated with the actual measurements that the check keeps track of. Although these items all have “NO.” (number) in their name, they should really be called numeric identifiers, names consisting of numbers.
The last three items, the date and amounts, are actual measurements which have units of “days” and “dollars”, respectively.
This check stub is not unlike many data records in that it contains identifiers (the “numbers”) , spatio-temporal locators (the date) and measurements (the amounts). The world of scientific data would be a much friendlier place if data managers understood the distinction between these categories and made this distinction obvious in their datasets.
The easiest way to tell the difference between an identifier and a measurement is to ask yourself whether there are any units involved. If there are no units involved then we are talking about an identifier, a name, a handle on an individual piece of information. Unique identifiers are important as they allow us to be sure we are talking about a particular piece of information. You don’t want all your invoices to go out with the same invoice number. In fact, you should make every effort to ensure that no two invoices share this number.
Measurements are different. They make no claim to being unique. You may send out an invoice for $100 as many times as you like. Measurements have units. Measurements can be represented as a distance along an axis. Measurements, unlike identifiers, can be used in mathematical equations.
Strings vs. Floats and Ints
To human readers, it appears that numeric identifiers and measurements are expressed the same way — as a series of “numbers”. However, when writing software to process data, it is important to differentiate between numeric strings — arrays of ASCII characters from the set [0…9], and floating point or integer values. In typed languages like Fortran, C or Java, this distinction is enforced. In untyped languages like Perl and python, any lack of a clear distinction between numeric identifiers and measurements can lead to some interesting results.
Here is some python code that adds two integers:
>>> a = 7 >>> b = 7 >>> c = a + b >>> c 14
Here is similar looking code that “adds” (i.e. concatenates) two “numbers”:
>>> a = '7' >>> b = '7' >>> c = a + b >>> c '77'
Some would say that this argues for the use of typed languages when working with scientific data. We do not share this judgement. Agile programming languages like python offer so many advantages with respect to programmer efficiency and concise readability that it would be folly to abandon them. Instead, we advocate a more careful approach to data management that can solve the problem for both typed and untyped languages.
Identifiers and Measurements in a CSV File
Comma Separated Value (CSV) files are the defacto standard for scientific data interchange involving limited amounts of data. CSV is the format of choice for the Obama administration’s Data.gov initiative and can be read in and spat out by any software that purports to work with data. Unfortunately there is no standard for exactly how to use the CSV format. The only aspect that everyone appears to agree upon is that fields should be delimited by the Unicode character ‘COMMA’ (U+002C). (Except of course for those datasets on Data.gov that are advertised as CSV files but actually use the TAB character as a delimiter. Sigh…)
To demonstrate the general confusion with strings, numbers, identifiers and measurements we will look at the Data.gov dataset named “Worldwide M1+ Eartquakes, Past Hour” (a USGS dataset):
Src,Eqid,Version,Datetime,Lat,Lon,Magnitude,Depth,NST,Region ci,14489992,1,"Wednesday, July 22, 2009 22:33:02 UTC",33.8516,-117.8411,2.0,9.80,53,"Greater Los Angeles area, California" ak,10001452,1,"Wednesday, July 22, 2009 22:32:20 UTC",51.4460,-177.9687,2.2,31.10,06,"Andreanof Islands, Aleutian Islands, Alaska" nc,71252401,0,"Wednesday, July 22, 2009 21:55:36 UTC",38.8125,-122.8158,1.5,3.00,18,"Northern California"
Like any dataset, this one contains identifiers, spatio-temporal locators and measurements. Without even searching for the separate file containing the descriptions of the column headers (embedded metadata will be the subject of a future post), we can make a pretty good guess as to which category each column falls into.
Src — Contains non-numeric characters so it must be either an identifier or a comment. It’s not a comment so it must be an identifier.
Eqid — Only contains numeric characters they but the column name is ‘Eqid‘ so it is an identifier.
Version — Version numbers are always identifiers.
Datetime, Lat, Lon, Depth — Self-explanatory: they are the spatio-termporal locators.
Magnitude — Now we finally see our first measurement. And notice that it is a floating point value.
NST — Contains only integers but one of them, ’06’, has a leading zero. This is an identifier.
Region — Apart from Datetime, this is the only quoted field. It could be a comment but is more likely drawn from a list of place names and thus an identifier.
Hurray! As humans, we can figure out which columns are identifiers, which ones are spatio-termporal locators and which ones are measurements. All we have to do is visually scan every single file we come across, use our intuition and hope that the file formats never change when the data is updated. (Un-hurray.)
It sounds like there could be some room for some improvement … and there is. We will introduce the term machine parseable to refer to data files that adhere to certain basic rules which allow software to go beyond simply reading in the data. Software reading machine parseable data can actually make initial steps toward ‘understanding’ what the data represent.
In the case of our earthquake data, all we have to do to make this file not only machine readable but also machine parseable is adopt two simple rules:
1. Use standard names for spatio-termporal locators
Our example file does a pretty good job in this regard — “Datetime”, “Lat”, “Lon” and “Depth” should be recognized as special columns by any data management software. If CSV files always used standard names like “Lat” (or “Latitude”, “lat” or “latitude”), software could be written that would always be aware of which columns contained spatio-termporal variables. (Issues of units are addressed in an earlier post.)
2. Surround identifiers in quotes
Our example uses quotes sporadically. It would make much more sense if quotes surrounded every identifier (and comment) in each record. Header rows may/(should?) also use quotes. The only elements not enclosed in quotes would be the actual measurements.
Reformatted in this manner our example file would appear thus:
"Src","Eqid","Version","Datetime","Lat","Lon","Magnitude","Depth","NST","Region" "ci","14489992","1","Wednesday, July 22, 2009 22:33:02 UTC",33.8516,-117.8411,2.0,9.80,"53","Greater Los Angeles area, California" "ak","10001452","1","Wednesday, July 22, 2009 22:32:20 UTC",51.4460,-177.9687,2.2,31.10,"06","Andreanof Islands, Aleutian Islands, Alaska" "nc","71252401","0","Wednesday, July 22, 2009 21:55:36 UTC",38.8125,-122.8158,1.5,3.00,"18","Northern California"
This file is no less human readable than the first but contains indicators, the quotes, that inform both human readers and software about the nature of the data in each column.
There are distinct advantages to creating CSV files in this manner. When this file is imported into a spreadsheet, for instance, the second record will contain a value of “06” for NST. Many spreadsheets that read in the original file will convert the unquoted version into the integer ‘6’ which does not really have the same meaning as the identifier “06”. (James Bond was “007”, not 7.)
That’s all. Just two simple rules to make scientific data more useful to more people. It doesn’t sound difficult because it isn’t.
Best hopes for creating machine parseable data!