What? Where? When?
These are key questions that every scientist or other collector of data must answer.
- What is the value of the thing we are measuring?
- Where are we taking the measurement?
- When are we taking the measurement?
In a previous post we discussed how to standardize “when”. But what about “where”?
It should not be a surprise that the representation of global position information is anything but standard. Humans have been trying to keep track of this information since the time of Eratosthenes. And we have been arguing about things like the prime meridian all the way up until the International Meridian Conference of 1884. Prior to that conference, French maps would likely measure degrees east relative to the Paris Meridian — now longitude 2° 20′ 14.025″ East. (Readers of Tintin may recall Tintin’s insight in Red Rackham’s Treasure that old French navigational charts used the Paris Meridian and not Greenwich.)
Now that the prime meridian is sorted out, in what ways can the latitudes and longitudes in data files be incompatible. It turns out there are several incompatible ways to represent latitudes and longitudes as text strings:
- lats and lons as degrees-minutes-seconds
- lats and lons as degrees-decimal minutes
- lats and lons as decimal degrees
- longitudes with W longitudes positive
- longitudes with W longitudes negative
Scalars vs. Vectors
The disagreement on whether longitudes W are positive or negative numbers belies a deeper confusion about whether longitude is a scalar or a vector. In freshman physics we learn that scalars measure magnitude whereas vectors measure magnitude and direction. So “-90 degrees longitude” is a scalar (minus ninety units along an axis with zero at the prime meridian); while “90 degrees West” is a vector (90 units from the start in a Westward direction). This same confusion also occurs in oceanography and atmospheric science where they often store “depth” and “height” as two separate measurements, both with positive numbers.
Physicists, who understand the difference between scalars and vectors, would of course insist that we throw out “depth” and “height” entirely and instead use ρ — distance from the center of the Earth. They would also replace latitude and longitude with φ and θ and measure each in radians. But this seems an impractical solution. We will be happy if we can convince people that measurements should be reported as “-90” rather than “90W”. Physicists will need to be satisfied that this convention and the right hand rule place the Northern hemisphere on top where it was meant it to be.
Mixed Latitude / Longitude Representations
One would hope that data managers would be careful not to mix differing representations of latitude and longitude but such is not always the case. To explore a real world example, let us take a look at the Recreation Information Database (RIDB) maintained by the Department of the Interior. This site brings together information on recreational sites from many different agencies and uses modern software tools to make this information available to programmers — data delivery in XML format with full schema’s and WSDL descriptions. It tries to provide everything a modern computer science graduate could want.
But what of the actual data, rather than the techniques for data delivery. Compiling data sources from different agencies with different methodologies needs to be approached very carefully. Did the designers of the RIDB just lump any data named “latitude” or “longitude” together or did they make an effort to validate the data they were compiling, insisting that it at least adhere to a common representational format?
As you may have guessed, the “lump together anything named ‘latitude'” strategy was adopted which once again points out the need to have end users — those wishing to create maps or graphs — working alongside the software engineering gurus to ensure that any data going into compilations are properly validated. Data validation is not an optional “add-on”. It is a core feature of any serious data management project.
To understand why, let us examine a few examples of the latitudes found in the RIDB:
0.0 0394655N 085939N 1053354N 1.0 1120930N 112.2 ... 194332N 19.4427 22.2 225.5 252707N ... 60.9 610111N 610456N -78.63957 821418N ... 973818N N
It’s quite a collection. Where do we begin?
- Can we trust “0.0” to mean that we have recreational facilities on the equator somwhere? It’s possible but our faith in the integrity of the data is low.
- It looks like “1120930N” means 112° 09′ 30″ which matches the “112.2” we see in the next line. But is anyone bothered by the fact that 112 is well outside the -90:90 domain of latitudes?
- It is surprising that only a single negative latitude is found in the entire dataset.
- And what do we make of the 41 values stored as “N”?
You will not be surprised to find out that the longitudes are in a similar state of disrepair.
It seems an appropriate time to suggest a standard representation against which the latitudes and longitudes in this — or any — dataset can be validated. Naturally we look to the International Organization for Standardization for a suggestion. Their best effort to date is ISO 6709 which misses the mark by a mile. Rather than insist on a single representation, they try to accommodate many. And, by combining latitude and longitude into a single format they preclude any possibility of treating latitude or longitude as a simple, numeric value that can be interpreted by software that might actually place the location on a map.
No, we are looking for something simpler that can be used with plotting or mapping software; something that would make it easy to generate the KML needed to display locations on Google Earth for example.
A Reasonable Standard
We will make a bold proposal for a new standard with a few simple suggestions:
- Latitudes should be stored as numeric values with units of decimal degrees on the domain -90:90 with negative values in the Southern hemisphere.
- Longitudes should be stored as numeric values with units of decimal degrees on the domain -180:180 with negative values in the Western hemisphere.
For those out there wishing to retain the existing representations in raw files for use with older software we additionally recommend:
- Whenever a non-conforming “latitude” or “longitude” field already exists in a dataset, add a new “latitude_dd” or “longitude_dd” field for the new, standardized representation.
We dream of a day when those working to make datasets more findable will devote some of their considerable energy and talent to also making data more uniform, more reliable, in short — more useful.
It doesn’t seem that hard really.