<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Working with Data</title>
	<atom:link href="http://mazamascience.com/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://mazamascience.com/WorkingWithData</link>
	<description>Manning the helm during the &#039;data deluge&#039;.</description>
	<lastBuildDate>Sun, 08 Aug 2010 17:35:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Data Volumes</title>
		<link>http://mazamascience.com/WorkingWithData/?p=434</link>
		<comments>http://mazamascience.com/WorkingWithData/?p=434#comments</comments>
		<pubDate>Fri, 09 Apr 2010 20:41:28 +0000</pubDate>
		<dc:creator>Jonathan Callahan</dc:creator>
				<category><![CDATA[Data Management]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[netcdf]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQLite]]></category>

		<guid isPermaLink="false">http://mazamascience.com/blog/?p=434</guid>
		<description><![CDATA[Despite what they say, size does matter. Successful data management is all about finding the proper tools and formats for dealing with your data.  There is no one-size-fits-all solution.  And the very first question you should be asking yourself is: &#8230; <a href="http://mazamascience.com/WorkingWithData/?p=434">&#160;&#160;read more ... </a>]]></description>
			<content:encoded><![CDATA[<p>Despite what they say, size <strong>does</strong> matter.</p>
<p>Successful data management is all about finding the proper tools and formats for dealing with your data.  There is no one-size-fits-all solution.  And the very first question you should be asking yourself is:  &#8221;How much data are we talking about?&#8221;<span id="more-434"></span></p>
<h2>It&#8217;s all relative</h2>
<p>Most people are familiar with data and data management techniques from within their own field of study.  Whether one has a large dataset or not is therefore a relative question.  A dataset is considered large or small relative to some other collection of data.  But the tools for dealing with data &#8212; hardware and software &#8212; are constantly improving and what may have been considered a &#8216;large&#8217; amount of data a few years ago may no longer be so big.  Consequently, the appropriate tools for dealing with your dataset may be changing as well.  With RAM selling for under $50/GB, many datasets are starting to look a lot smaller.</p>
<p>We&#8217;ll begin our examination of data volumes by putting various datasets and databases on a logarithmic scale just to get a sense of their relative size.  Remember, each tick mark represents a factor of 10 increase in size.  (References are included in the list of links at the end of this article.)</p>
<p><img class="aligncenter size-full wp-image-512" title="data volumes" src="http://mazamascience.com/blog/wp-content/uploads/2010/04/data-volumes1.jpg" alt="data volumes" width="421" height="646" /></p>
<p>Data volumes range over more than twelve orders of magnitude!  Where does your dataset fit in?</p>
<p>Clearly, datasets in the single megabyte range (gray) are at the insignificant end of the scale.  At about the 100 megabyte scale (green) it becomes important to have a plan for how to manage the data with an eye on potential software limitations.  Still, it should be smooth sailing with with respect to hardware until you start handling hundreds of Gigabytes of data.  At that point, hardware and software limitations will both impact your decision making.</p>
<p>From 100 gigabytes to perhaps 10 terabytes (yellow) you can still buy off-the-shelf components that will store that much data but data managers must proceed with caution.  Above 10 terabytes (orange) requires carefully designed, networked storage devices.  This is the realm where computer scientists need to be part of your team.  Those projects that involve storing above a petabyte of data (red) are at the cutting edge of what we are planning for the next decade.</p>
<p>For a more personal comparison, our recently purchased, plain vanilla iMac came with 4 gigabytes of RAM and one terabyte of disk.  That would allow us to download and play with some of the largest datasets of climate measurements (as opposed to model output).</p>
<h2>Megabyte sized (small) datasets</h2>
<p>Several of the projects we have worked on involve data collections at the small end of the scale.  Any dataset that involves humans in the collection, data entry, processing or validation of individual data points will always be under 100 megabytes.  This is the area that contains what we like to refer to as &#8220;high value datasets&#8221; &#8212; those datasets that have actual measurements made by humans as opposed to model output generated by computers or streams of data generated by automated sensing devices.</p>
<p>Because these small datasets are not even as big as available RAM on most machines, there is no requirement to store them in any particular compact format or access them with any particular software.  You should always keep in mind that reading from and writing to disk are by far the slowest operations on your computer.  Once data are read into RAM, any filtering and subsetting and processing of data should be lightning fast.</p>
<p>When planning data management, we always subscribe to Einstein&#8217;s philosophy of:  &#8220;As simple as possible, but not simpler.&#8221;  Whatever is simple and easy and flexible and fast is the right choice for working with small datasets.  Too many data management applications get bogged down in the complexity of using yesterday&#8217;s sexy computer science tools even though they are completely unjustified by the data volumes.  Often it would be much more cost effective to simply buy some more memory and then use a brute force approach to filtering and subsetting.  Computer memory is much cheaper than the human memory required to keep a complex system working.</p>
<p>For datasets of this size we often recommend storing the data in one or more simple CSV files or an <a href="http://sqlite.org/" target="_blank">SQLite</a> database if no other solution is preordained.</p>
<h2>Gigabyte sized (big) datasets</h2>
<p>Once you start working with datasets that are a gigabyte in size and larger you will need to consider very carefully two components:  1) what software tools are used to access and analyze the data; and 2) what data format(s) the data should be stored in.  These are not independent choices as a particular set of tools usually relies upon the data being available in a particular format.</p>
<p>Sometimes you will be required to support formats required by a particular piece of software already used by the community of practice working with this data.  Other times you may be at liberty to make recommendations.  Our philosophy of &#8220;as simple as possible&#8221; is meant to apply both to the team in charge of managing the data as well as end users of the data.  It is therefore very important to interview those who expect interactive access to data to find out how they intend to use the data:  what tools they use, what kind of subsetting or querying is required, what kind of interactive access is expected, <em>etc</em>.  Hopefully, certain themes will arise from these questions that will help guide your choices.</p>
<p>At this scale, the structure of your data will also have a lot to say about what options are available to you.  If you have regularly gridded data the <a href="http://www.unidata.ucar.edu/software/netcdf/" target="_blank">NetCDF</a> format, widely used in the climate data community, may be appropriate while relational data will need to be stored in a RDBMS like <a href="http://www.mysql.com/" target="_blank">MySQL</a> or <a href="http://www.postgresql.org/" target="_blank">PostgreSQL</a>.</p>
<h2>Terabyte sized and bigger (huge)</h2>
<p>We have never worked with datasets larger than about a terabyte.  In this size class data management of necessity becomes it&#8217;s own separate activity with the associated specialists and funding.  Usually, datasets this large consist of lots of <em>raw data</em> which is important to keep but may not be what is needed for higher level analysis.  It is often possible to generate and store partially processed versions of the raw data which reduce the data volume by several orders of magnitude and allow much simpler data management solutions.  An example of this approach applied to genomic data is found in an excellent article from 2008:  <a href="http://www.genetic-future.com/2008/06/how-much-data-is-human-genome-it.html" target="_blank">How much data is a human genome?  It depends how you store it.</a></p>
<p>The executive summary describes the approach:</p>
<blockquote><p>For those who don&#8217;t want to read through the tedious details that follow, here&#8217;s the take-home message: if you want to store the data in a raw format for later re-analysis, you&#8217;re looking at <strong>between 2 and 30 terabytes</strong> (one terabyte = 1,000 gigabytes). A much more user-friendly format, though, would be as a file containing each and every DNA letter in your genome, which would take up around <strong>1.5 gigabytes</strong> (small enough for three genomes to fit on a standard data DVD). Finally, if you have very accurate sequence data and access to a high-quality reference genome you can squeeze your sequence down to around <strong>20 megabytes</strong>.</p></blockquote>
<p>Clearly, intelligent data management for very large datasets involves a lot of decisions about which users the data is being managed for.  One set of users will require access to the raw data but are willing to wait while at the other end of the spectrum you have people who want instant access to the summary information.  As we said at the beginning, there is no one-size-fits-all solution and good data management is about identifying different classes of users and finding the right solutions for each class.</p>
<h2>Additional information</h2>
<p>The following list of links provides a little more detail on the data mentioned in graphic above and a few interesting posts on working with very large data volumes:</p>
<ul>
<li><a href="http://www.bp.com/productlanding.do?categoryId=6929&amp;contentId=7044622" target="_blank">British Petroleum Statistical Review</a> (BP Stat Review)</li>
<li><a href="http://www.epa.gov/owow/streamsurvey/" target="_blank">Wadeable Streams Assessment</a> (EPA WSA)</li>
<li><a href="http://minerals.usgs.gov/ds/2005/140/" target="_blank">Historical Statistics for Mineral and Material Commodities in the United States</a> (USGS DS 140)</li>
<li><a href="http://www.genetic-future.com/2008/06/how-much-data-is-human-genome-it.html" target="_blank">How much data is a human genome?  It depends how you store it.</a> (2008)</li>
<li><a href="http://icoads.noaa.gov/" target="_blank">International Comprehensive Ocean-Atmosphere Data Set</a></li>
<li><a href="http://esg-pcmdi.llnl.gov/" target="_blank">Earth System Grid</a></li>
<li><a href="http://news.idg.no/cw/art.cfm?id=32FCBC8A-17A4-0F78-316BA999B7AFE095" target="_blank">Man behind MasterCard&#8217;s 100-terabyte data warehouse</a> (2008)</li>
<li><a href="http://public.web.cern.ch/public/en/lhc/Computing-en.html" target="_blank">CERN &#8211; LHC Computing</a></li>
<li><a href="http://www.sdss.org/" target="_blank">Sloan Digital Sky Survey</a></li>
<li><a href="http://expertvoices.nsdl.org/roadreports/2008/06/20/the-petabyte-problem-scrubbing-curating-and-publishing-big-data/" target="_blank">The Petabyte Problem:  Scrubbing, Curating and Publishing Big Data</a> (2008)</li>
<li><a href="http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/" target="_blank">Petabytes on a budget:  How to build cheap cloud storage</a> (2009)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://mazamascience.com/WorkingWithData/?feed=rss2&amp;p=434</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Management Questionnaire</title>
		<link>http://mazamascience.com/WorkingWithData/?p=459</link>
		<comments>http://mazamascience.com/WorkingWithData/?p=459#comments</comments>
		<pubDate>Wed, 24 Mar 2010 13:59:23 +0000</pubDate>
		<dc:creator>Jonathan Callahan</dc:creator>
				<category><![CDATA[Data Management]]></category>
		<category><![CDATA[validation]]></category>

		<guid isPermaLink="false">http://mazamascience.com/blog/?p=459</guid>
		<description><![CDATA[Sometimes merely filling out a questionnaire can cause you to think about problems in a new way.  When asked to answer a question that has never occurred to you before, you may find yourself reevaluating some of your core assumptions &#8230; <a href="http://mazamascience.com/WorkingWithData/?p=459">&#160;&#160;read more ... </a>]]></description>
			<content:encoded><![CDATA[<p>Sometimes merely filling out a questionnaire can cause you to think about problems in a new way.  When asked to answer a question that has never occurred to you before, you may find yourself reevaluating some of your core assumptions &#8212; assumptions you may not have known you had.  That is the power of asking questions.  Our data management questionnaire poses questions in 12 categories that will help you figure out what you need, what you want, and perhaps give you a hint of how to get there.<span id="more-459"></span></p>
<p>For those who always watch the credits at the end of movies we have a little more to say along the lines of philosophical underpinnings after we present the questionnaire.  For the impatient, here you go:</p>
<hr />Please answer each question to the best of your ability and be honest when the answer is &#8220;I don&#8217;t know.&#8221; or &#8220;I&#8217;m not sure.&#8221;</p>
<p><strong>Questions about the community involved in creating, using and managing the data</strong></p>
<h2>1) Data Providers</h2>
<ul>
<li>Why were the data collected?</li>
<li>Who are the data providers?</li>
<li>How do you get feedback from them?</li>
<li>What questions do they have about the data?</li>
<li>What software tools do they currently use to work with data?</li>
<li>What data formats do they prefer?</li>
<li>What leverage exists to get their cooperation?</li>
<li>What resources do they have?</li>
<li>What internal procedures do they have that impact when and how data are delivered?</li>
<li>What political needs must be met?</li>
<li><strong>What would make them <em>happy customers</em>?</strong></li>
</ul>
<h2>2) Data Consumers</h2>
<ul>
<li>Who are the data consumers?</li>
<li>How do you get feedback from them?</li>
<li>What questions do they have about the data?</li>
<li>What software tools do they currently use to work with data?</li>
<li>What data formats do they prefer?</li>
<li>How do they want to interact with the data?</li>
<li>Are there requirements to interact with other software systems?  (<em>e.g.</em> &#8220;web services&#8221;)</li>
<li><strong>What would make them <em>happy customers</em>?</strong></li>
</ul>
<h2>3) Data Management</h2>
<ul>
<li>Is data management a dedicated activity?</li>
<li>Who is responsible for data management?</li>
<li>Do they have a background in the field of study?</li>
<li>Is this data management project driven by the needs of software or of science?</li>
<li>What financial resources does the individual/team have?</li>
<li>What personnel resources do they have?</li>
<li>What hardware resources do they have?</li>
<li>How will they communicate with data providers and data consumers?</li>
<li>Who will provide long term support for data management?</li>
</ul>
<p><strong>Questions about the nature of the data.</strong></p>
<h2>4) Volume</h2>
<ul>
<li>How many actual numeric measurements (not including textual meta-data) are made in a year? &#8212; thousand/million/billion/trillion</li>
<li>How many are made in a day?</li>
</ul>
<h2>5) &#8216;Speed&#8217;</h2>
<ul>
<li>What temporal precision is needed for incoming data? &#8212; second/minute/hour/day/month/year</li>
<li>How up-to-date should the &#8216;released data&#8217; be? &#8212; Everything up to the last minute/hour/day/month/year?</li>
</ul>
<h2>6) &#8216;Shape&#8217;</h2>
<ul>
<li>How are data currently stored? &#8212; text file/CSV/XML/GIS/spreadsheet/RDBMS/binary file/other</li>
<li>Can the data be expressed as a single row-by-column table?</li>
<li>Are any data geo-spatially located?</li>
<li>Do any data have regular spacing along an axis with physical units? &#8212; latitude/longitude/depth/height/time</li>
<li>What additional metadata must be stored?</li>
</ul>
<p><strong>Questions about functionality to be achieved.</strong></p>
<h2>7) Validation</h2>
<ul>
<li>How are data currently being validated?</li>
<li>How are successfully validated data points being identified?</li>
</ul>
<h2>8) Versioning</h2>
<ul>
<li>How is raw data being versioned?  (<em>e.g</em>. How are changes to the data store being tracked?)</li>
<li>Can earlier versions be retrieved?</li>
<li>How is released data being versioned?</li>
</ul>
<h2>9) Provenance</h2>
<ul>
<li>How is the history and origin of each data point being tracked as data goes from individual submissions to larger aggregations?</li>
</ul>
<h2>10) Authorization</h2>
<ul>
<li>Who is allowed to enter data?</li>
<li>Who is allowed to extract data?</li>
<li>What should be open to the general public?</li>
<li>What kind of secure technology is mandated/desired?</li>
</ul>
<h2>11) Analysis</h2>
<ul>
<li>Are any specific kinds of analysis associated with the data?</li>
<li>Is it desirable to build a system that helps users perform appropriate analysis?</li>
<li>What software requirements does this impose upon this data management project?</li>
</ul>
<h2>12) Interactive Access</h2>
<ul>
<li>What sort of interactive access should be provided to data consumers? &#8212; subsetting/querying/reformatting/analysis/visualization</li>
</ul>
<hr />We hope that these questions are both self-explanatory and thought provoking.  By asking these questions we do not intend to overwhelm small projects with dreams that are too big for their budgets.  Rather we hope to save both providers and consumers of data time and money by getting them to think ahead a little to how their data will be used productively rather than simply ignored in our current <a href="http://www.economist.com/opinion/displaystory.cfm?story_id=15579717" target="_blank">data deluge</a>.</p>
<p>Our philosophy of the practice of data management is that it should always be in the service of both the data providers and data consumers (1 &amp; 2).  One must have a thorough understanding of the original questions that motivated the data collection effort and the additional questions that the data can be used to answer before attempting to come up with data management solutions.  The tools and formats that are currently in use as well as the political landscape in the provider and consumer communities set the fundamental frame for organizing data.</p>
<p>The questions about data management resources and responsibilities (3) are absolutely key.  We believe that <strong>data management should always be a dedicated component of any data gathering or data analysis activity</strong>.  Data management occupies a fundamental position between the worldview of the data providers and that of the data consumers (see the post on <a href="http://mazamascience.com/blog/?p=84" target="_blank">Data Producers vs. Data Consumers</a>).  Any ideas that these two groups are one-and-the-same must be left behind.  In this modern age of open data dissemination the expectation is that any data that are collected will be combined with other data and used by a wide variety of individuals, each with a different set of needs.  It is important, therefore, for data managers to identify target audiences and provide added benefit to these &#8216;customers&#8217;.  The simplest way to gauge the success of a data management project is to ask:  &#8221;Did we make the job of person X easier?&#8221;  We also feel very strongly that those involved in any data management project should have some familiarity with the discipline involved.  At a minimum they should be in regular contact with the providers and consumers of data.</p>
<p>Questions 4 &#8211; 6 are designed to get people to really think about what kind of structure might be best for their data.  We have plenty of war stories of SQL databases being set up with dozens of interrelated tables and complex schemas for data sets that were no longer being updated.  As it turned out, the data could be written out as a single CSV file of a thousand rows by a hundred columns and this simpler format was actually more useful to the downstream consumers of data.  Others build overly complex systems based on outdated ideas of how much data can be stored on disk or in memory.  The &#8216;speed&#8217; issue (5) has to do with transactional vs. archival databases and will be the topic of a future post but boils down to how up-to-date data access needs to be.  Are you designing an airline booking system that needs to be aware of transactions completed a few seconds ago or are your data only updated and released once a month?</p>
<p>The last section on functionality has several leading questions to get people thinking about what a data management system could have.  We believe that data should be validated, updated, trackable, <em>etc</em>. and these questions need to be answered before you design your data handling system.  We also believe that it is in the interest of everyone to make vetted data, analyses and visualizations available to as broad an audience as possible.  Publicly available data and analysis allow us to harness the interest and skills of people we may not even know about.  It is important to plan on internet access even if the budget only allows for a web page pointing to publicly available CSV files.  Even that would be a far sight better than no data access at all or keeping the data essentially locked up in a complicated software system.</p>
<p>Best of luck answering your own data management questions!</p>
]]></content:encoded>
			<wfw:commentRss>http://mazamascience.com/WorkingWithData/?feed=rss2&amp;p=459</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Two Minutes with Vivek Kundra</title>
		<link>http://mazamascience.com/WorkingWithData/?p=429</link>
		<comments>http://mazamascience.com/WorkingWithData/?p=429#comments</comments>
		<pubDate>Sun, 21 Mar 2010 18:09:55 +0000</pubDate>
		<dc:creator>Jonathan Callahan</dc:creator>
				<category><![CDATA[Data Management]]></category>
		<category><![CDATA[Data.gov]]></category>

		<guid isPermaLink="false">http://mazamascience.com/blog/?p=429</guid>
		<description><![CDATA[We recently had the privilege of attending Vivek Kundra&#8217;s talk when he spoke at the University of Washington.  We found his assessment of federal government information technology issues to be refreshingly frank.  After his talk we had a chance to &#8230; <a href="http://mazamascience.com/WorkingWithData/?p=429">&#160;&#160;read more ... </a>]]></description>
			<content:encoded><![CDATA[<p>We recently had the privilege of attending Vivek Kundra&#8217;s talk when he spoke at the University of Washington.  We found his assessment of federal government information technology issues to be refreshingly frank.  After his talk we had a chance to bring up the issue of data quality in historical datasets.  His response was encouragingly realistic.<span id="more-429"></span></p>
<p>For those who don&#8217;t know, <a href="http://en.wikipedia.org/wiki/Vivek_Kundra" target="_blank">Vivek Kundra</a> is Chief Information Officer of the Obama Administration and is the driving force behind the <a href="http://data.gov" target="_blank">Data.gov</a> initiative described in a previous post:  <a href="http://mazamascience.com/blog/?p=12" target="_blank">Reinventing the Wheel at Data.gov</a>.  In that post we complained that Data.gov was simply another new website providing links to agency data without any discipline specific evaluation of the data that was being served up.  This is a complaint we still level against that initiative but Kundra&#8217;s recent talk at the UW gave us a sense of what he is up against with a very limited set of resources.</p>
<p>Other&#8217;s have captured ideas from his talk in some detail:</p>
<ul>
<li><a href="http://blog.orenblog.org/2010/03/04/notes-from-federal-cio-vivek-kundra-talk-at-uw/" target="_blank">Notes from Federal CIO Vivek Kundra talk at UW</a></li>
<li><a href="http://www.fastcompany.com/1571834/vivek-kundra-government-it-plans" target="_blank">Vivek Kundra Outlines Ambitious Government Plans for I.T</a>.</li>
</ul>
<p>From our perspective, the take home messages were the following:</p>
<ul>
<li>Identifying and getting feedback from real &#8216;customers&#8217;, not just project managers, is key to developing useful tools.</li>
<li>The government needs to pull the plug on IT projects that are going nowhere.</li>
<li>Making more data open and available to the public with channels for feedback is the only realistic way to improve data quality.</li>
</ul>
<p>If you want to get a sense of the man and his ideas you can watch a <a href="http://www.ustream.tv/recorded/5186518" target="_blank">video of his talk</a>.  Our question and his response on the topic of data quality in historical datasets begins at 1:00:03.</p>
<p>We were pleased that his response acknowledged the sorry state of some historical data.  Refreshingly, his solution is not to set up an expensive government program to &#8216;fix&#8217; the bad data &#8212; we&#8217;ve been down that road before.  Instead, he believes the only hope is to daylight government data sets and crowdsource the solution.</p>
<p>Now this is a solution that we can get behind!</p>
<ol>
<li>make public versions of data available as soon as possible</li>
<li>open a channel for the public, the ultimate &#8216;customers&#8217;, to provide feedback on data quality and usability issues</li>
<li>have an individual who&#8217;s main job is to improve the quality, completeness and usability of datasets generated by their department</li>
</ol>
<p>We know that many agencies are used to going their own way with respect to managing their data.  And we believe that the needs of research require special tools and systems.  However, we also believe that there is no excuse for delaying access to <em>public versions</em> of agency data which should be made as early as possible.  And these public versions need not require extensive software engineering &#8212; don&#8217;t let perfect be the enemy of good.  For many, many government datasets, a simple CSV version of the data would go a long way toward making the data more useful.</p>
<p><em>(You might be surprised to find out how many complicated SQL schema&#8217;s ultimately boil down to a single row vs. column table.  We&#8217;ll save that as the topic for a future post.)</em></p>
<p><em><span style="font-style: normal;">Best Hopes for common sense data management.</span></em></p>
]]></content:encoded>
			<wfw:commentRss>http://mazamascience.com/WorkingWithData/?feed=rss2&amp;p=429</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Standard Country Names</title>
		<link>http://mazamascience.com/WorkingWithData/?p=129</link>
		<comments>http://mazamascience.com/WorkingWithData/?p=129#comments</comments>
		<pubDate>Mon, 01 Mar 2010 23:38:33 +0000</pubDate>
		<dc:creator>Jonathan Callahan</dc:creator>
				<category><![CDATA[Data Management]]></category>
		<category><![CDATA[Babel]]></category>
		<category><![CDATA[FIPS]]></category>
		<category><![CDATA[i18n]]></category>
		<category><![CDATA[ISO]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[standards]]></category>

		<guid isPermaLink="false">http://mazamascience.com/blog/?p=129</guid>
		<description><![CDATA[What&#8217;s in a name?  That which we call a rose By any other name would smell as sweet. Ahhh love.  Juliet speaks lovely poetry but we learn, as the story unfolds, that names and the identification they impart are in &#8230; <a href="http://mazamascience.com/WorkingWithData/?p=129">&#160;&#160;read more ... </a>]]></description>
			<content:encoded><![CDATA[<blockquote><p>What&#8217;s in a name?  That which we call a rose<br />
By any other name would smell as sweet.</p></blockquote>
<p>Ahhh love.  Juliet speaks lovely poetry but we learn, as the story unfolds, that names and the identification they impart are in fact extremely important.  This is no less true in data management where country names are anything but standardized.<span id="more-129"></span></p>
<p>Anyone care to explain the difference between &#8220;England&#8221;, &#8220;Great Britain&#8221; and the &#8220;United Kingdom&#8221;?  Sadly, these politically distinct entities are often confounded.  If we start with England and then add Scotland and Wales we get &#8220;Great Britain&#8221;.  If we then add Northern Ireland we end up with the &#8220;United Kingdom&#8221; or UK.  (But remember, even the UK does not officially include the Isle of Mann or the Channel Islands which are dependencies of the crown.)  Country names are inextricably bound to the vagaries of history and politics.</p>
<p>Our first example only looked at the name of one country in the language of that country.  Other countries of course have their own word for the UK: Royaume-Uni (French), Reino Unido (Spanish), Vereinigtes Königreich (German), Verenigd Koninkrijk (Dutch).  Linguists will recognize that these are all local versions of the English phrase &#8220;United Kingdom&#8221;.</p>
<p>Translating country names is not always this simple.  Each of Germany&#8217;s neighbors has a name for Germany that stems from the Germanic tribe that they had to deal with back in the Roman era:  Allemagne (French), Germania (Italy) (but <em>tedesco</em> for a German person), Duitsland (Dutch), Niemcy (Polish).  Whatever agreement these nations might reach in their assessment of Germany&#8217;s past, there is no agreement on what to call the Bundesrepublik Deutschland.</p>
<h2>Country name inconsistencies</h2>
<p>As this is an English language site, we will investigate country name inconsistencies a little further by looking at some problematic names found in a few English language datasets.  These datasets are all high value, well curated datasets that make every attempt to be standardized:</p>
<table border="0">
<thead>
<tr>
<th colspan="8" align="center">Country name inconsistencies</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://www.cia.gov/library/publications/the-world-factbook/">CIA World Fact Book</a></td>
<td>Ireland</td>
<td>Russia</td>
<td>Libya</td>
<td>Iran</td>
<td>Burma</td>
<td>Hong Kong</td>
<td>United States</td>
</tr>
<tr>
<td><a href="http://www.census.gov/ipc/www/idb/">US Census Bureau IDB</a></td>
<td>Ireland</td>
<td>Russia</td>
<td>Libya</td>
<td>Iran</td>
<td>Burma</td>
<td>Hong Kong</td>
<td>United States</td>
</tr>
<tr>
<td><a href="†">BP Statistical Review</a></td>
<td>Republic of Ireland</td>
<td>Russian Federation</td>
<td>Libya</td>
<td>Iran</td>
<td>Myanmar</td>
<td>China Hong Kong SAR</td>
<td>US</td>
</tr>
<tr>
<td><a href="http://www.unicef.org/infobycountry/index.html">UNICEF</a></td>
<td>Ireland</td>
<td>Russian Federation</td>
<td>Libyan Arab Jamahiriya</td>
<td>Iran, Islamic Republic of</td>
<td>Myanmar</td>
<td>Hong Kong, China</td>
<td>United States of America</td>
</tr>
<tr>
<td><a href="jodidata.org">Joint Oil Data Initiative</a></td>
<td>Ireland</td>
<td>Russian Federation</td>
<td>Libyan Arab Jamahiriya</td>
<td>Iran, Islamic Republic of</td>
<td>Myanmar</td>
<td>Hong Kong China</td>
<td>United States of America</td>
</tr>
</tbody>
</table>
<p>It&#8217;s quite an interesting mix.  The only &#8220;British&#8221; dataset is careful to use the full &#8220;Republic of Ireland&#8221; and apparently wants to appease China with its use of &#8220;China Hong Kong SAR&#8221;.  Meanwhile, the Riyadh based Joint Oil Data Initiative is careful about &#8220;Libyan Arab Jamahiriya&#8221; and the &#8220;Islamic Republic of&#8221; Iran.  By contrast, the American datasets seem almost colloquial, leaving off all the formal titles and even using &#8220;Burma&#8221; instead of &#8220;Myanmar&#8221;.  These names may be in common use in the US but they don&#8217;t always match other groups&#8217; standards.</p>
<p>As data managers we always look for standard identifiers to represent well defined concepts.  Political units known as nations are an excellent example of such a well defined concept.  No one disagrees about what is meant whether one types &#8220;US&#8221;, &#8220;USA&#8221;, United States&#8221; or &#8220;United States of America&#8221;.  But for anyone working with data, especially those attempting to merge data from different sources it makes a huge difference whether the national identifier is standardized or not.  Writing code to catch all possible variations of English language country names is just the sort of tedious exercise that can be avoided by the proper use of standard identifiers.</p>
<h2>ISO 3166-1 alpha-2</h2>
<p>Luckily, <a href="http://www.iso.org/iso/english_country_names_and_code_elements" target="_blank">ISO-3166-1 alpha-2</a>, is a widely adopted standard that will meet our needs.  This standard specifies a unique, 2-character code for each nation.  Once we have the codes, we can create tables of country names in different languages.  Here is another table with the ISO-3166-1 alpha-2 codes and the recommend English and French names available from the ISO site.</p>
<table border="0">
<thead>
<tr>
<th colspan="8" align="center">ISO 3166-1 alpha-2</th>
</tr>
</thead>
<tbody>
<tr>
<td>code element</td>
<td>IE</td>
<td>RU</td>
<td>LY</td>
<td>IR</td>
<td>MM</td>
<td>HK</td>
<td>US</td>
</tr>
<tr>
<td>English names</td>
<td>IRELAND</td>
<td>RUSSIAN FEDERATION</td>
<td>LIBYAN ARAB JAMAHIRIYA</td>
<td>IRAN, ISLAMIC REPUBLIC OF</td>
<td>MYANMAR</td>
<td>HONG KONG</td>
<td>UNITED STATES</td>
</tr>
<tr>
<td>French names</td>
<td>IRLANDE</td>
<td>RUSSIE, FÉDÉRATION DE</td>
<td>LIBYENNE, JAMAHIRIYA ARABE</td>
<td>IRAN, RÉPUBLIQUE ISLAMIQUE D&#8217;</td>
<td>MYANMAR</td>
<td>HONG-KONG</td>
<td>ÉTATS-UNIS</td>
</tr>
</tbody>
</table>
<p>One of the truisms of scientific data management is that it is always more robust in the long run to store identifiers in a standardized, computer readable fashion.  When human readable output is needed, code can be written to generate whatever local version of the standard identifier is needed.  This is much preferable to storing human friendly identifiers that prevent data from different sources from being combined.  Or worse, that work in some cases but not others.</p>
<p>It is quite possible to have your cake and eat it, too.  We are not recommending that you create Excel spreadsheets that are devoid of human readable names, requiring your users to know that DZ stands for Algeria.  Instead, just add another column labeled <em>country_code</em> or <em>ISO_3166_1_alpha_2</em> or similar that contains the appropriate code.  That way you and the users of your data will be able to write software to automatically process the data without worrying about whether a cell has &#8220;USA&#8221; or &#8220;United States&#8221;.</p>
<p>If, on the other hand, your identifiers live in a database and will be processed before a human ever sees them then you should only store the identifiers and have the presentation layer of software convert the coded identifiers into human readable strings, preferably in the readers language of choice.</p>
<h2><span style="color: #444444;">FIPS 10-4</span></h2>
<p>Over the years, of course, other standards have come and gone.  Perhaps the most important competing 2-character code was the <a href="http://en.wikipedia.org/wiki/FIPS_10-4">FIPS 10-4</a> standard developed by the US National Institute of Standards and Technology.  This standard was withdrawn by NIST in 2008 in favor of the ISO 3166 standard.</p>
<p>Unfortunately, not every US agency has abandoned their old fashioned ways.  Databases for the Energy Information Administration <a href="http://tonto.eia.doe.gov/country/index.cfm">Country Energy Profiles</a> and the CIA <a href="https://www.cia.gov/library/publications/the-world-factbook/index.html">World Factbook</a> still use the old FIPS 10-4 codes so you need to be careful if you are accessing those databases programatically.  (If you aren&#8217;t you may be surprised at Switzerland&#8217;s huge population and coal production &#8212; in ISO 3166 &#8216;CH&#8217; means Switzerland but in FIPS 10-4 &#8216;CH&#8217; means China.)</p>
<h2>python Babel module</h2>
<p>As an added incentive to encourage the use of these country codes we would like to introduce you to the <a href="http://babel.edgewall.org/" target="_blank">python Babel module</a>.  The Babel module helps with internationalization (aka <a href="http://en.wikipedia.org/wiki/Internationalization_and_localization" target="_blank">i18n</a>) and we will use it to create a table of country names in different languages from our standardized codes.  Do not be put off by the geeky nature of the home page &#8212; you don&#8217;t need to learn much to make use of this excellent module.</p>
<p>Installation is a snap on Ubuntu Linux:</p>
<pre style="padding-left: 30px;">sudo easy_install babel</pre>
<p>Translations for our country codes are found in Locale objects in the module, one for each language.  As you might expect, the supported languages are themselves encoded &#8212; this time according to <a href="http://www.loc.gov/standards/iso639-2/" target="_blank">ISO 639-2</a> language codes.  We&#8217;ll reserve detailed description of the Babel module for another post.  For now, lets cut to the chase and show the code that will generate HTML rows containing the list of country names in the languages of those countries:</p>
<pre>#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
from babel import Locale

# NOTE:  'my' dropped as the Burmese characters won't display on my Mac
for language in ('ga','ru','ar','fa','zh','en'):
    locale = Locale(language)
    print('&lt;tr&gt;')
    print('&lt;td&gt;' + language + '&lt;/td&gt;')

    for country in ('IE','RU','LY','IR','MM','HK','US'):
        print('&lt;td&gt;' + locale.territories[country] + '&lt;/td&gt;')

    print('&lt;/tr&gt;')</pre>
<p>Eleven actual lines of code and here is the result:</p>
<table border="0">
<thead>
<tr>
<th colspan="8" align="center">Babel module output</th>
</tr>
</thead>
<tbody>
<tr>
<td>ga</td>
<td>Éire</td>
<td>Cónaidhm na Rúise</td>
<td>An Libia</td>
<td>An Iaráin</td>
<td>Maenmar</td>
<td>R.R.S. na Síne Hong Cong</td>
<td>Stáit Aontaithe Mheiriceá</td>
</tr>
<tr>
<td>ru</td>
<td>Ирландия</td>
<td>Россия</td>
<td>Ливия</td>
<td>Иран</td>
<td>Мьянма</td>
<td>Гонконг, Особый Административный Район Китая</td>
<td>США</td>
</tr>
<tr>
<td>ar</td>
<td>أيرلاندا</td>
<td>روسيا</td>
<td>ليبيا</td>
<td>ايران</td>
<td>ميانمار</td>
<td>هونج كونج الصينية</td>
<td>الولايات المتحدة الأمريكية</td>
</tr>
<tr>
<td>fa</td>
<td>ایرلند</td>
<td>روسیه</td>
<td>لیبی</td>
<td>ایران</td>
<td>میانمار</td>
<td>هنگ‌کنگ، ناحیهٔ ویژهٔ حکومتی چین</td>
<td>ایالات متحدهٔ امریکا</td>
</tr>
<tr>
<td>zh</td>
<td>爱尔兰</td>
<td>俄罗斯</td>
<td>利比亚</td>
<td>伊朗</td>
<td>缅甸</td>
<td>中国香港特别行政区</td>
<td>美国</td>
</tr>
<tr>
<td>en</td>
<td>Ireland</td>
<td>Russia</td>
<td>Libya</td>
<td>Iran</td>
<td>Myanmar</td>
<td>Hong Kong SAR China</td>
<td>United States</td>
</tr>
</tbody>
</table>
<p>Pretty amazing, eh?</p>
<p>So now you know why we always use standardized codes for country names and languages.  And you should too!</p>
]]></content:encoded>
			<wfw:commentRss>http://mazamascience.com/WorkingWithData/?feed=rss2&amp;p=129</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Zero vs. Missing</title>
		<link>http://mazamascience.com/WorkingWithData/?p=343</link>
		<comments>http://mazamascience.com/WorkingWithData/?p=343#comments</comments>
		<pubDate>Thu, 25 Feb 2010 22:43:49 +0000</pubDate>
		<dc:creator>Jonathan Callahan</dc:creator>
				<category><![CDATA[Data Management]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[MS Excel]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[USGS]]></category>
		<category><![CDATA[validation]]></category>

		<guid isPermaLink="false">http://mazamascience.com/blog/?p=343</guid>
		<description><![CDATA[On the left we have zero, our integer measure of nothingness.  On the right we have missing value, aka N/A, aka NA, our signal that the value of a datapoint is unknown.  Everyone who deals with data has to deal &#8230; <a href="http://mazamascience.com/WorkingWithData/?p=343">&#160;&#160;read more ... </a>]]></description>
			<content:encoded><![CDATA[<p>On the left we have zero, our integer measure of nothingness.  On the right we have <em>missing value</em>, aka <em>N/A</em>, aka <em>NA</em>, our signal that the value of a datapoint is unknown.  Everyone who deals with data has to deal with this important distinction.  And far too often people get it wrong.<span id="more-343"></span></p>
<h2>The importance of storing &#8216;missing&#8217;</h2>
<p>One of the most important components of a data management strategy is the proper handling of <em>missing values</em>.</p>
<p>Every data collection effort is essentially an attempt to answer a set of questions of this form:</p>
<p style="padding-left: 30px;">What was the value of _______ at such-and-such time and place?</p>
<p>A climatologist will store a numeric value in a particular set of units like temperature in degrees C:</p>
<p style="padding-left: 30px;">What was the <span style="text-decoration: underline;">high temperature</span> in degrees C in Saint. Louis on Jul 20, 2000?</p>
<p>A census worker will store a string of characters that represent an identifier like a name:</p>
<p style="padding-left: 30px;">What was the <span style="text-decoration: underline;">name of the head of household</span> at this address on April 3, 2010?</p>
<p>As in all human endeavors, success is not 100% guaranteed.  So we must be ready to deal with the possibility that our data collection effort comes up short.  Perhaps the thermometer in St. Louis malfunctioned on our date of interest.  In this case the correct answer to the question &#8220;What was the temperature?&#8221; would be &#8220;We don&#8217;t know.&#8221;  In the past , and disappointingly continuing into the present, some people would enter 0 (the number zero) when a datum was missing.  Anyone examining the data at some later date would be quite surprised at the sudden cold snap that hit St. Louis on July 20, 2000!</p>
<p>And what about our census worker?  If no one is at home should they enter into the NAME field the string:  &#8221;not at home&#8221;.  That would be an unusual name.  Perhaps there is a spelling error and they meant &#8220;Notatome&#8221; which is a valid surname.  Or perhaps they combined typos with a first-last switch and are really identifying &#8220;Tome Nota&#8221; &#8212; a Portugese language Christian youth ministry.  We can&#8217;t really know for sure unless we have a rock solid standard for storing the meaning &#8220;we don&#8217;t know&#8221; or &#8220;not measured&#8221; or &#8220;missing&#8221;.</p>
<h2 style="font-size: 1.5em;">Storing missing values in spreadsheets</h2>
<p>Spreadsheet software like MicroSoft Excel and OpenOffice Calc is ubiquitous.  Given the prevalence of PC&#8217;s in the government and business world, most people have Excel on their desktop and many small datasets are generated in Excel and disseminated as Excel files.  It is important, therefore, to learn how to use the features of Excel to properly encode the meaning &#8220;missing value&#8221; whenever necessary.</p>
<p>One of the reasons spreadsheet software is so popular is that it allows users to <em>see</em> the data they are working with, giving them a sense of control over the data.  &#8220;Seeing is believing&#8221; and most spreadsheet users seem to trust their own eyes more than they trust their software tools.  Although we agree that human visual perception is an incredible tool for hunting ripe blackberries, we would argue that it is not evolutionarily suited to validating numbers and text.  One common example of the problem with human validation of numbers and text is our inability to tell the difference between an empty cell and one containing a whitespace character such as <a href="http://www.fileformat.info/info/unicode/char/0020/index.htm" target="_blank">Unicode character &#8216;SPACE&#8217; (U+0020)</a>.  According to your eyes, there is no difference between the two.  But in reality, there is.</p>
<p>This may not seem like a problem if you are generating charts inside of Excel and can clean up the data yourself when you run into problems.  Things get trickier, however, when these data are merged with other data and further processed by people who have no specific knowledge of the dataset.  If a column of unvalidated data containing the SPACE character is assumed by downstream software to be numeric, a data type error will typically result.  If, instead, the data column is assumed to contain character strings, no error will be detected and the SPACE character will be used as a valid string.  The burden of checking for cells containing only whitespace characters and then converting these to missing values falls to the programmers working with the merged dataset.  It&#8217;s no wonder, then, that merged database projects so often go over budget when the incoming data is poorly validated.</p>
<p>One of the problems in using spreadsheet software to work with data is that the default data validation settings on these tools are too forgiving.  By default, data entered into a cell can be of any type &#8212; string, float, decimal, date, <em>etc</em>.  One of the most important things an Excel user can do is to harness the built in validation tools every time new data are added.  MicroSoft provides <a href="http://office.microsoft.com/en-us/excel/HP100726001033.aspx" target="_blank">extensive documention</a> on using and customizing data validation within Excel and users should become familiar with these features.  Users can validate not only the data type but can set up valid ranges that alert users to problematic values when they are entered.</p>
<p>Getting back to our original theme, missing values in Excel are identified by cells that are truly empty.  To understand what this means you have to understand the different cell types used by Excel.  Internally, Excel recognizes the following types of cells:</p>
<ul>
<li>EMPTY</li>
<li>TEXT</li>
<li>NUMBER</li>
<li>DATE</li>
<li>BOOLEAN</li>
<li>ERROR</li>
<li>BLANK</li>
</ul>
<p>Everything makes sense right up through BOOLEAN.  The ERROR type is used when there is an internal Excel error message associated with the value of a cell &#8212; presumably due to problems with a formula.  But what about BLANK?  Well, BLANK cells are those that have formatting information but no stored value.  From the data manager&#8217;s point of view they <em>should</em> be the same as EMPTY.</p>
<p>Unfortunately, older versions of Excel had a different idea.  A <a href="http://support.microsoft.com/kb/214244" target="_blank">MicroSoft support article</a> for Excel versions prior to 2002 has the following:</p>
<blockquote><p>In Microsoft Excel, when you use a formula that tests for a zero value, you may see unexpected results if the cell is blank. Microsoft Excel interprets a blank cell as zero, and not as empty or blank. Therefore, any cells that are blank are evaluated as zero in the function.</p></blockquote>
<p>Ahhh, spreadsheets.  If you have to use them make sure that you have a recent version and that you learn to use the data validation tools available.</p>
<h2 style="font-size: 1.5em;">Storing missing values in CSV files</h2>
<p>We are quite fond of CSV files as a lowest common demoninator format for data exchange.  CSV files can be read by just about any software out there and greatly reduce the the effort involved in reformatting data.  Many datasets are inherently tabular in nature and it is straightforward to export Excel worksheets as CSV files.  An <a href="http://mazamascience.com/blog/?p=168" target="_blank">earlier post</a> described how to automatically convert an entire Acess database into a set of CSV files.  The only problem with CSV as a standard is that it isn&#8217;t really standardized.  So let us evaluate the options for storing the meaning &#8220;missing&#8221; in a CSV file.</p>
<p>In the bad old days, which are still with us in some areas, lots of different identifiers were used to store the idea of &#8220;missing value&#8221;.  Here are a few we have encountered:</p>
<ul>
<li>&#8220;-&#8221;<br />
The hyphen is a convenient note to make on a paper spreadsheet to identify something like:  &#8220;Measurement attempted but no value obtained&#8221;.  However, it is not widely understood by analysis software.</li>
<li>&#8220;0&#8243;<br />
<strong>Do this under penalty of death!!!</strong></li>
<li>&#8220;999&#8243;<br />
A perennial favorite, this is a valid numeric value and only conveys the meaning &#8220;missing&#8221; if one is intimately familiar with the valid domain of the data.</li>
<li>&#8220;-9999&#8243;, &#8220;-999&#8243;, &#8220;9999&#8243;, &#8220;999999&#8243;, &#8220;9999999&#8243;<br />
Yes, we&#8217;ve seen them all.  Some sites use six 9&#8242;s and seven 9&#8242;s as separate missing value flags to identify which technician was responsible for working up the sample associated with each cell.  If you ever expect your data to be used by anyone other than yourself please, please don&#8217;t do this.</li>
<li>&#8220;N/A&#8221;<br />
This designator for &#8220;Not Available&#8221; is quite human readable but is not commonly understood by analysis software.</li>
<li>&#8220;NaN&#8221;<br />
This designator is used in floating point computations and is not really appropriate for data that might be stored in spreadsheets or CSV files.  For full details see the <a href="http://en.wikipedia.org/wiki/NaN" target="_blank">Wikipedia article</a>.</li>
<li>&#8220;&#8221;<br />
The blank string of zero length is an excellent designator of &#8220;nothing here&#8221;.  The only downside is that it is not explicit and some tools will not automatically populate cells with their own internal designator for &#8220;missing value&#8221; without some coaching.</li>
<li>&#8220;NA&#8221;<br />
This designator is widely understood as the ASCII representation of &#8220;missing value&#8221; and is the default in many packages including our personal favorite &#8212; the <a href="http://www.r-project.org/" target="_blank">R statistical analysis package</a>.  Even here there is a problem, though, as &#8220;NA&#8221; is also the <a href="http://www.iso.org/iso/english_country_names_and_code_elements" target="_blank">ISO 3066-1</a> identifier for Namibia.  We have a rule of avoiding Namibian conflict whenever possible which brings us to our own preferred designator for &#8220;missing value&#8221;:</li>
<li>&#8220;na&#8221;<br />
The R package, and presumably others, can be coached to accept this as the default missing value string when reading in CSV files and this designator avoids the problems mentioned above.</li>
</ul>
<p>The bottom line is that we recommend either &#8220;na&#8221; or &#8220;NA&#8221; as the missing value flag for CSV files.  Anything else is inviting trouble and expense for everyone downstream.</p>
<h2>Overcorrecting &#8212; Storing &#8216;missing&#8217; when you really mean zero</h2>
<p>Have you ever heard anyone say &#8220;&#8230; for you and I &#8230;&#8221;?  A lot of people have been taught that they overuse &#8220;me&#8221; and then overcorrect by using &#8220;I&#8221; even when the pronoun is the object of a verb or preposition.  Well, the same thing happens with zero <em>vs</em>. missing.</p>
<p>There are times when the correct answer is zero and not &#8220;we don&#8217;t know.&#8221;  Overcorrecting by storing the missing value deisgnator instead of zero can also cause problems for anyone working with the data.</p>
<p>Two examples should get the idea across:</p>
<p>The <a href="http://minerals.usgs.gov/ds/2005/140/" target="_blank">USGS DataSeries 140</a> is a fantastic, century-long history of production and use data for a long list of minerals.  The lead-use Excel file from this dataset allows one to track the year-to-year evolution of lead use in different categories, one per column.  The column labeled &#8220;Gasoline additives&#8221; decreases gradually from a high of 218,000 metric tons in 1976 (ouch!) to 4,000 metric tons in 1994.  The cells for years after 1994 are of type EMPTY.</p>
<p>As we read above, cells of type EMPTY are to be interpreted as &#8220;we don&#8217;t know&#8221;.  So, do the folks at USGS really not know how much lead was used for gasoline additives in say &#8230; 2002?  Hardly.  The Notes worksheet in the file states quite clearly:</p>
<blockquote><p>From 1987 to 1995, data reported on gasoline additives depicts a progressive phaseout. The U.S. Environmental Protection Agency issued a direct final rule that prohibits the use of chemical additives in gasoline for highway vehicles, effective January 1, 1996.</p></blockquote>
<p>So the amount of lead used for gasoline additives, at least post 1995, is in fact zero.  And that is what should be stored in the spreadsheet, not the Excel designator for &#8220;we don&#8217;t know&#8221;.  Anyone trying to create a pie plot showing percentage usage in different categories is stymied when the amount in one of the categories is &#8220;unknown&#8221;.  No category value =&gt; no total amount =&gt; no percentages.  It would be much better for downstream users of the data if zero were stored in these cells.</p>
<p>Even folks who work at the highest levels of government, data-driven science with some of the best software tools available for data handling and analysis are subject to this error.  The International Combined Ocean Atmosphere Data Set (<a href="http://icoads.noaa.gov/" target="_blank">ICOADS)</a> is an extremely high value dataset that contains ocean observations of climate variables like air and sea surface temperatures, wind speed and pressure.  With this dataset one can create maps depicting the ocean surface temperature for any month going all the way back to January 1800.  (Don&#8217;t expect to see much data before 1900, though.)</p>
<p>Along with the measured variables this dataset also includes variables containing the number of observations that were used to create the monthly average in a particular grid cell.  The idea here is that you might have more faith in a temperature value if the lat-lon box for a particular month contained many observations as opposed to just one or two.  So each NUMOBS_SST value is really an answer to the question:</p>
<p style="padding-left: 30px;">How many observations of sea surface temperature were used to generate the SST value stored in the associated space-time grid cell?</p>
<p>The answer should never be &#8220;we don&#8217;t know&#8221;.  But that is in fact what is stored.  The NUMOBS_SST field has lots of missing values and a few &#8220;1&#8243;s in January of 1854.  To our way of thinking it should contain a lot of &#8220;0&#8243;s and a few &#8220;1&#8243;s.</p>
<p>Perhaps we are nitpicking here.  Users of this data do not seem unhappy with the data files as they are.  We are merely using this as an example to highlight the difference between zero and missing in the hopes of fostering better data management in all fields.</p>
<p>Here&#8217;s to honest reporting of &#8220;we don&#8217;t know&#8221; and to not letting your zeroes go missing.</p>
]]></content:encoded>
			<wfw:commentRss>http://mazamascience.com/WorkingWithData/?feed=rss2&amp;p=343</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
