Ten UNIX commands every data manager should know

Working with data from varied sources can be frustrating — some data will be in CSV format; some in XML; some available as HTML pages; other data as relational databases or MS Excel spreadsheets.

This post will cover the UNIX tools that every data manager needs to be familiar with in order to work with varied data sources.

When data formats are inconsistent the best thing to do is to find some lowest common denominator that is shared by all.  Ultimately, that common denominator is ASCII text and good data managers need to have a suite of tools to help them convert data files into ASCII and then to work with that ASCII text.  (A previous post already covered converting Microsoft Access “.mdb” files into CSV files.)

Unix shell basics

Readers should not be concerned if they are not Unix gurus.  We assume only a passing familiarity with Unix commands and access to a machine with a Unix shell terminal (e.g. Mac OSX).  However, you should already be familiar with the following basic commands:

  • pwd — present working directory
  • mkdir — make a new directory
  • cd — change directory
  • ls — list contents of a directory
  • rm — remove a file
  • man — display a manual page for a command
  • echo — print something to the stdout
  • ; — separator between Unix commands (runs multiple commands without returning the shell prompt)
  • \ — line continuation (spread a long command over multiple input lines)
  • | — pipe the output of one command as the input of another command
  • > — send the output of a command to a file
  • >> — append the output of a command to an existing file

Find out more about any Unix command by using the man command (e.g. man rm).  When a man page is being displayed you can use CTRL-f and CTRL-b to page forward and back and q to quit and return to the shell prompt.

Here are the 10 commands you should try to master:

1) cat

The cat command concatenates one or more files and sends the result to standard output.  You can use cat at the beginning of a Unix pipeline.

2) head & tail

The head and tail commands allow you to trim a file to include only the first (or last) N lines.

3) wc

Use wc to count the number of lines, words or bytes in a file.

4) cut

The cut command cuts a column out of an ascii file based on character position or some delimiter.

5) paste

The paste command can take two files with the same number of lines and join them together into a single file.

Interlude

To exercise these first few commands we will need two sample files, one with rownames and another with rowdata.  For mysterious reasons, these files begin life separated.  (Detailed explanations of the use of each command are not given.  Read the man pages to learn more about how command line flags change the default behavior of each command.)

rownames.txt

rowdata.txt

Now we can use our Unix commands to do some useful work.

Assuming we had never seen these files before, we could take a quick peek at the beginning of the rowdata.txt file to see what it looks like:

After similarly checking on rownames.txt we might inquire whether the files have the same number of lines:

They each have eleven lines so we could use paste to join them into a single file.  But perhaps we only want the “2-digit ID” column from rowdata.txt.  Let’s first practice cutting out that column:

Excellent.  Now let’s create a new file by cutting out the third column of rowdata.txt and piping it as input to the paste command which we will tell to use the TAB delimiter:

As we see, the real power of the Unix shell environment is the ability to string together commands that have tightly focused functionality.

Now on to the remaining, more complex commands.

6) grep

The grep command is a very powerful pattern matching command that uses regular expressions to match individual lines in a file..  Learning how to use regular expressions will take a while but they are extremely powerful once you do.

7) sort

The sort command takes a bunch of lines of text as input, sorts them and writes them to stdout.

8) sed

The Unix stream editor processes input one line at a time and can do all kinds of manipulations on the text based on a special set of commands passed to sed.  The sed command uses basic regular expressions to find text to manipulate.  These are slightly different from the extended regular expressions used by grep.  One common source of confusion: Do not use the newline character, ‘\n’ in sed regular expressions. Thankfully, there are many examples to learn from.  You should spend some time looking at the sed oneliners page to see what is possible.

9) curl

Use the curl command to get the contents at a particular URL.  This command has a scary number of arguments but is super helpful for grabbing data from a web site.

10) xmllint

The xmllint command is very useful for ‘pretty printing’ data in XML format so that tools like grep and sed can work on the output one line at a time.

Putting it all Together

In the following three examples we will use our set of tools to perform the following tasks:

  1. process the latest data from the CDC to review the progress of the H1N1 flu virus
  2. convert a human-friendly but machine-unreadable set of unemployment data into a simple CSV file
  3. examine the consistency of data formats used within a Federal XML dataset

Getting regular updates from the CDC

Our first example will involve the weekly flu report from the CDC.  The data are available here.

This is an XML dataset that can best be processed with XML tools.  But for those who don’t live and breathe XML this can be an onerous task.  Instead we will gather some information of interest using our Unix tools  Because this file is updated weekly we will want to create a script that works on the remote data.  Looking at the data we determine that the last 273 lines correspond to the most recent week:

To figure out what reporting levels are possible we examine the contents of the <label> element for the entire XML document.  We use curl in ‘silent mode’, grep for only those lines containing ‘label’ and then use cut to snip out that portion of each line between ‘>’ and ‘<‘.  Finally, we sort with the ‘-u’  flag to remove duplicate results:

For a quick update on how many states and territories are reporting ‘Widespread’ flu we can use grep and wc:

This presentation of information isn’t as nice as the CDC’s page but serves to demonstrate how one can go about harvesting XML data from the web without writing any code.


2016-09-09 NOTE:  Examples below here reference sources of data that no longer exist. They are left in for the command line examples they provide.


Converting human-friendly data into machine readable data

The dataset we will be working with this time is titled “Employment status of the civilian non-institutional population 16 years and over by sex, 1973 to date” from the Bureau of Labor Statistics.  You may wish to look at the file in a new window.

First we download the data to a local file and look at the file to figure out which lines we want:

To extract the data just for the women we will clip out the relevant section and grep for those lines with ‘…’:

Because this file has constant column widths we can cut out the columns for “Year”, “Percent of Total Population” and “Percent of Employed”, making sure to include one space character before the latter data columns:

Now just convert the spaces to commas and include a header line:

Now we have a file that can be understood by modern software tools.  (Wasn’t that was easier than writing a dedicated program to process this data?)

Investigating data consistency

For our last example we will work with the XML data contained in the Recreation Information Database (RIDB).  The RIDB server seems a little overtaxed so we download the file for further processing:

This database contains information compiled from various agencies and is riddled with inconsistencies.  Let’s use our tools to examine just one.  As grep and sed work on a single line at a time we will need to ‘pretty print’ our RIDB.xml file to put each XML element on its own line.  This is where xmllint comes in:

Let’s take a few seconds to process this 25 MByte file to see which organizations have contributed data:

Now let’s take a closer look at the phone numbers to see if they are consistently formatted.  We’ll start by exporting them to their own file:

Reviewing the sorted_phone.txt file we see the various formats we might expect:

  • (###) ###-####
  • (###) ### ####
  • ###-###-####
  • ###.###.####

as well as some serious outliers.  Most of the offenders can be found by checking for letters:

The last few entries are problematic.  Storing text descriptions and escaped HTML markup in a field reserved for telephone numbers is not really considered best practice.  The folks in charge of RIDB should do a little more data validation before merging the various sources that contribute to their database.

So there you have it.  You can do data reformatting, remote data access and XML investigation with only a handful of Unix tools and essentially no programming.  Hone these skills and you will become wonderfully efficient at working with the datasets that exist out there ‘in the wild’.

Happy Exploring!

This entry was posted in Data Management and tagged , , . Bookmark the permalink.

2 Responses to Ten UNIX commands every data manager should know

  1. Neeraj Gupta says:

    This was very helpful indeed.

  2. naveen says:

    Hi,

    I got a XML file, how do i check the count of XML file.

    Thanks,