Working with data from varied sources can be frustrating — some data will be in CSV format; some in XML; some available as HTML pages; other data as relational databases or MS Excel spreadsheets.
This post will cover the UNIX tools that every data manager needs to be familiar with in order to work with varied data sources.
When data formats are inconsistent the best thing to do is to find some lowest common denominator that is shared by all. Ultimately, that common denominator is ASCII text and good data managers need to have a suite of tools to help them convert data files into ASCII and then to work with that ASCII text. (A previous post already covered converting Microsoft Access “.mdb” files into CSV files.)
Unix shell basics
Readers should not be concerned if they are not Unix gurus. We assume only a passing familiarity with Unix commands and access to a machine with a Unix shell terminal (e.g. Mac OSX). However, you should already be familiar with the following basic commands:
- pwd — present working directory
- mkdir — make a new directory
- cd — change directory
- ls — list contents of a directory
- rm — remove a file
- man — display a manual page for a command
- echo — print something to the stdout
- ; — separator between Unix commands (runs multiple commands without returning the shell prompt)
- \ — line continuation (spread a long command over multiple input lines)
- | — pipe the output of one command as the input of another command
- > — send the output of a command to a file
- >> — append the output of a command to an existing file
Find out more about any Unix command by using the man command (e.g. man rm). When a man page is being displayed you can use CTRL-f and CTRL-b to page forward and back and q to quit and return to the shell prompt.
Here are the 10 commands you should try to master:
The cat command concatenates one or more files and sends the result to standard output. You can use cat at the beginning of a Unix pipeline.
2) head & tail
The head and tail commands allow you to trim a file to include only the first (or last) N lines.
Use wc to count the number of lines, words or bytes in a file.
The cut command cuts a column out of an ascii file based on character position or some delimiter.
The paste command can take two files with the same number of lines and join them together into a single file.
To exercise these first few commands we will need two sample files, one with rownames and another with rowdata. For mysterious reasons, these files begin life separated. (Detailed explanations of the use of each command are not given. Read the man pages to learn more about how command line flags change the default behavior of each command.)
"name" a b c d e f g h i j
"integer value","float value","2-digit ID" 1,1.0,"01" 2,2.0,"02" 3,3.0,"03" 4,4.0,"04" 5,5.0,"05" 6,6.0,"06" 7,7.0,"07" 8,8.0,"08" 9,9.0,"09" 10,10.0,"10"
Now we can use our Unix commands to do some useful work.
Assuming we had never seen these files before, we could take a quick peek at the beginning of the rowdata.txt file to see what it looks like:
head -3 rowdata.txt "integer value","float value","2-digit ID" 1,1.0,"01" 2,2.0,"02"
After similarly checking on rownames.txt we might inquire whether the files have the same number of lines:
wc -l row*.txt 11 rowdata.txt 11 rownames.txt 22 total
They each have eleven lines so we could use paste to join them into a single file. But perhaps we only want the “2-digit ID” column from rowdata.txt. Let’s first practice cutting out that column:
cut -d',' -f3 rowdata.txt "2-digit ID" "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"
Excellent. Now let’s create a new file by cutting out the third column of rowdata.txt and piping it as input to the paste command which we will tell to use the TAB delimiter:
cut -d',' -f3 rowdata.txt | paste -d'\t' rownames.txt - "name" "2-digit ID" a "01" b "02" c "03" d "04" e "05" f "06" g "07" h "08" i "09" j "10"
As we see, the real power of the Unix shell environment is the ability to string together commands that have tightly focused functionality.
Now on to the remaining, more complex commands.
The grep command is a very powerful pattern matching command that uses regular expressions to match individual lines in a file.. Learning how to use regular expressions will take a while but they are extremely powerful once you do.
The sort command takes a bunch of lines of text as input, sorts them and writes them to stdout.
The Unix stream editor processes input one line at a time and can do all kinds of manipulations on the text based on a special set of commands passed to sed. The sed command uses basic regular expressions to find text to manipulate. These are slightly different from the extended regular expressions used by grep. One common source of confusion: Do not use the newline character, ‘\n’ in sed regular expressions. Thankfully, there are many examples to learn from. You should spend some time looking at the sed oneliners page to see what is possible.
Use the curl command to get the contents at a particular URL. This command has a scary number of arguments but is super helpful for grabbing data from a web site.
The xmllint command is very useful for ‘pretty printing’ data in XML format so that tools like grep and sed can work on the output one line at a time.
Putting it all Together
In the following three examples we will use our set of tools to perform the following tasks:
- process the latest data from the CDC to review the progress of the H1N1 flu virus
- convert a human-friendly but machine-unreadable set of unemployment data into a simple CSV file
- examine the consistency of data formats used within a Federal XML dataset
Getting regular updates from the CDC
Our first example will involve the weekly flu report from the CDC. The data are available here.
This is an XML dataset that can best be processed with XML tools. But for those who don’t live and breathe XML this can be an onerous task. Instead we will gather some information of interest using our Unix tools Because this file is updated weekly we will want to create a script that works on the remote data. Looking at the data we determine that the last 273 lines correspond to the most recent week:
curl http://www.cdc.gov/flu/weekly/flureport.xml | tail -273 ...
To figure out what reporting levels are possible we examine the contents of the <label> element for the entire XML document. We use curl in ‘silent mode’, grep for only those lines containing ‘label’ and then use cut to snip out that portion of each line between ‘>’ and ‘<‘. Finally, we sort with the ‘-u’ flag to remove duplicate results:
curl -s http://www.cdc.gov/flu/weekly/flureport.xml | grep label | \ cut -d'>' -f2 | cut -d'<' -f1 | sort -u Local Activity No Activity No Report Regional Sporadic Widespread
For a quick update on how many states and territories are reporting ‘Widespread’ flu we can use grep and wc:
curl -s http://www.cdc.gov/flu/weekly/flureport.xml | tail -273 | \ grep 'Widespread' | wc -l 24
This presentation of information isn’t as nice as the CDC’s page but serves to demonstrate how one can go about harvesting XML data from the web without writing any code.
2016-09-09 NOTE: Examples below here reference sources of data that no longer exist. They are left in for the command line examples they provide.
Converting human-friendly data into machine readable data
The dataset we will be working with this time is titled “Employment status of the civilian non-institutional population 16 years and over by sex, 1973 to date” from the Bureau of Labor Statistics. You may wish to look at the file in a new window.
First we download the data to a local file and look at the file to figure out which lines we want:
curl ftp://ftp.bls.gov/pub/special.requests/lf/aat2.txt -o Employment_Stats.txt \ cat -n Employment_Stats.txt ...
To extract the data just for the women we will clip out the relevant section and grep for those lines with ‘…’:
head -103 Employment_Stats.txt | tail -39 | grep '...' ...
Because this file has constant column widths we can cut out the columns for “Year”, “Percent of Total Population” and “Percent of Employed”, making sure to include one space character before the latter data columns:
head -103 Employment_Stats.txt | tail -39 | grep '...' | cut -c 6-9,61-65,83-87 ...
Now just convert the spaces to commas and include a header line:
echo '"year","percent of population","percent of employed"'; head -103 Employment_Stats.txt | \ tail -39 | grep '...' | cut -c 6-9,61-65,83-87 | \ sed 's/ /,/g' "year","percent of population","percent of employed" 1973,44.7,42.0 1974,45.7,42.6 1975,46.3,42.0 1976,47.3,43.2 ...
Now we have a file that can be understood by modern software tools. (Wasn’t that was easier than writing a dedicated program to process this data?)
Investigating data consistency
For our last example we will work with the XML data contained in the Recreation Information Database (RIDB). The RIDB server seems a little overtaxed so we download the file for further processing:
curl http://www.recdata.gov/RIDBWebService/RIDBService.jws/getAllRecElementsForOrgID?anOrgID=-1 -o RIDB.xml
This database contains information compiled from various agencies and is riddled with inconsistencies. Let’s use our tools to examine just one. As grep and sed work on a single line at a time we will need to ‘pretty print’ our RIDB.xml file to put each XML element on its own line. This is where xmllint comes in:
xmllint --format RIDB.xml | head -20 <!--?xml version="1.0" encoding="utf-8"?--> ...
Let’s take a few seconds to process this 25 MByte file to see which organizations have contributed data:
xmllint --format RIDB.xml | grep 'OrgName' | cut -d'>' -f2 | cut -d'<' -f1 | sort -u American Battle Monuments Commission Bureau of Engraving and Printing Bureau of Land Management Bureau of Reclamation Department of Agriculture Department of Commerce Department of Defense Department of the Interior Department of the Treasury Department of Transportation Fairfax County Fairfax County Park Authority FEDERAL Fish and Wildlife Service National Archives and Records Administration National Ocean Service National Park Service National Register of Historic Places PRIVATE ReserveAmerica Smithsonian Institution Smithsonian Institution Affiliations Program STATE Tennessee Valley Authority US Air Force US Army Corps of Engineers USDA Forest Service Wilderness Institute Wilderness.net
Now let’s take a closer look at the phone numbers to see if they are consistently formatted. We’ll start by exporting them to their own file:
xmllint --format RIDB.xml | grep 'RecAreaPhone' | cut -d'>' -f2 | \ cut -d'<' -f1 | sort -u -n > sorted_phone.txt
Reviewing the sorted_phone.txt file we see the various formats we might expect:
- (###) ###-####
- (###) ### ####
as well as some serious outliers. Most of the offenders can be found by checking for letters:
grep [a-zA-Z] sorted_phone.txt 207-827-6138, ext. 17 304-636-1800 x280 410-228-2692 x118 413-548-8002 ext. 111 505-761-8794 ext. 29 509-575-5848 X202 510-562-PARK (706) 856-0300 or 888-893-0678 800-939-PARK 828-271-4779 ext 210 870-473-2869 Or 870-364-3167 907-246-4250 or 907-246-3339 907-456-0250 and 800-362-4546 970-882-2213 or 970-946-0461 Headquarters 202.485.9880 <br> Visitor Informaiton 202.426.6841 <br> Headquarters 307-739-3300 <br> Headquarters 503-861-2471 <br> Visitor Information 503-861-2471ext.214 <br> Visitor Information (TDD) 503-861-4408 <br> Education Program Information 503-861-4422 <br> Headquarters 815/588-6040 <br> Headquarters 907-442-8300 <br> Summer Visitor Information 907-442-3760 <br> Headquarters (TDD) (615) 893-9501 <br> Visitor Information (TDD) (615) 893-9501 <br> Visitor Information (301) 432-5124 <br> Headquarters (301) 432-7672 <br> Visitor Information (308) 586-2581 <br> Visitor Information 757-898-2410 <br> Visitor Information (804) 732-3531 <br>
The last few entries are problematic. Storing text descriptions and escaped HTML markup in a field reserved for telephone numbers is not really considered best practice. The folks in charge of RIDB should do a little more data validation before merging the various sources that contribute to their database.
So there you have it. You can do data reformatting, remote data access and XML investigation with only a handful of Unix tools and essentially no programming. Hone these skills and you will become wonderfully efficient at working with the datasets that exist out there ‘in the wild’.