One of the big jokes among people who manage scientific datasets goes like this:
The great thing about standards is … there are so many to choose from!
While this one liner may never make it to late-night TV, there is much truth to it. Many “standards” exist, and many more are invented each month to accommodate the special needs of new types of data or new software for processing data.
There is, however, one exception that proves the rule: ISO 8601– the international standard for representing dates and times.
What is so hard about understanding dates?
Every school child knows how to write a date like “June 12, 2009”. For the purposes of internationalization we must of course substitute the month name with a number leading to the familiar “6/12/09” we use when writing a check. After the whole Y2K debacle a few years back we also learned that we really need to spell out the full year which leads us to “6/12/2009”. So what is wrong with this format?
Well, the French would write it as “12/6/2009” , the Germans as “12.6.2009”, and the Chinese as “2009-06-12”
It is clear that there is more we need to worry about if we wish to have machine readable, fully internationalized dates:
- Are they little, big or middle-endian (day-month-year, year-month-day or month-day-year)
- What separator is used?
- Are single digit months and days zero-padded?
- What is the relevant time zone? (If you didn’t think time zones were important, just remember that the attack on Pearl Harbor took place on December 8th according to Japanese time.)
In fact, humans have come up with a bewildering variety of ways to express calendar dates. Which is precisely what leads us to ISO 8601.
The International Organization for Standards Technical Corrigendum 8601:2004 describes an unambiguous standard for representing dates, times and time zone information. For only 130 Swiss Franks you can buy a copy for your library. (In the interest of brevity we will spare you the diatribe against charging for descriptions of international standards.)
As usual, the Wikipedia page has an excellent description of the indented of use of ISO 8601:
The scope of the standard covers representations for Gregorian dates, time of day, combined date and time of day, and time intervals. Dates can be represented in three forms: 1) year-month-day of month, 2) year-week number-day of week, and 3) year-day of year. Time of day is represented by using the 24-hour clock. Combined date and time is represented by merging both the date and time of day representations to denote a single time point.
Any date-time combination between the adoption of the Gregorian calender (1582-1929 on a per country basis) and “December 31, 9999” can be unambiguously represented with this format. (For datasets with historic data prior to 1929 we will review the Julian and Gregorian calendars and other calendar eras in a future post.)
As careful data managers, our job often involves cleaning up datasets that were generated by humans and human-oriented tools and making them consistent and machine readable. We are often surprised by the lengths to which data providers will go to make this task difficult:
- spreadsheets with 2-digit years, month names and non-zero padded days
- databases with fields for years and Julian days (i.e. day number between 1 and 365)
- XML output with dates, locations and local times rather than UTC times
- ambiguous all-numeric, non-zero padded formats like “2122009”
Each of these choices may have made sense to those collecting the data and may in fact be required in order to work with the hardware and software needed for their particular project. But that doesn’t mean that these choices must be inflicted on the ultimate end users of the data. Luckily, the ISO 8601 standard is supported by every important computer language. So parsing non-standard date information to create ISO 8601 date representations is relatively straightforward.
Whatever we are using for data storage, when working with dates it is important to include a field, column, element, etc. that contains the ASCII ISO 8601 representation of the date.
What this means in practical terms is that for every dataset we work with we must:
- Understand how to use ISO 8601 in our chosen programming language or analysis software.
- Identify whether the source data has an “ISO_date” field.
- If it does, validate each entry in this field.
- If it does not, process the available date information found in other fields and create the ISO 8601 date string which should be stored in an “ISO_date” field.
We do not worry about the data duplication involved in having “day”,”month”,”year” and “ISO_date” fields. Data storage is cheap. Human misunderstanding is expensive. Once we have the “ISO_date” field in our dataset we will have greatly enhanced the utility of the dataset for all downstream users, be they human or machine.
Dealing with the inconsistencies of date and time formats has resulted in a larger waste of human effort than probably any other issue in data management. Lets put the issue entirely to rest by insisting that every dataset with dates include an “ISO_date” field.