Liberating data from web sites

The great thing about putting data files up on a public web site is that data consumers can download the data with a point and a click.  Unfortunately, when lots of individual data files are involved, the consumer of data must point and click … and then point and click again … then point and click … and point and click … and point … and … and … and …

This post describes how 11 lines of python code can take the human eyes and fingers out of the equation and speed things up a bit.

USGS Data Series 140

We are currently working on a project that involves using data from one of our favorite government agencies — the United States Geological Survey (USGS).  The data of interest, Data Series 140, demonstrate the value of having a government agency do the hard work of collecting and managing data over more than a century.  To understand the vast amount of work that has gone into compiling this dataset we only need to look at the description:

This USGS digital database is an online compilation of historical U.S. statistics on mineral and material commodities. The database contains information on approximately 90 mineral commodities, including production, imports, exports, and stocks; reported and apparent consumption; and unit value (the real and nominal price in U.S. dollars of a metric ton of apparent consumption). For many of the commodities, data are reported as far back as 1900. Each commodity file includes a document that describes of the units of measure, defines terms, and lists USGS contacts for additional information.

End-use tables complement these statistics by supplying, for most of these commodities, information about the distribution of apparent consumption.

This publication draws on more than 125 years of minerals information experience. At the request of the 47th Congress of the United States (1882; 22 Stat. 329), the U.S. Government began the collection and public distribution of these types of data. The Federal agencies responsible for the collection of the data have changed through time. For the years 1882-1924, the USGS collected and published these data; the U.S. Bureau of Mines (USBM) performed these tasks from 1925-95; and in 1996, the responsibilities once again passed to the USGS (following the closure of the USBM) (Mlynarski, 1998).

The USGS collects data on a monthly, quarterly, semiannual, and annual basis from more than 18,000 minerals-related producer and consumer establishments that cooperate with the USGS. These companies voluntarily complete about 40,000 canvass forms that survey production, consumption, recycling, stocks, shipments, and other essential information. Data are also gathered from site visits, memberships on domestic and international minerals-related committees, and coordination with other government organizations and trade associations.

The data are made available as both pdf files and Microsoft Excel files at the following location:

We applaud the USGS for getting many things right on this page:

  1. The dataset is versioned.
  2. A thorough description is given of the methodology used in gathering the source data.
  3. Data are made available in human readable (pdf) and machine readable (xls) formats.
  4. The column headers (i.e. ‘variables’)  found in individual data files are (mostly) consistent.
  5. The data are presented in consistent units throughout.
  6. Human contacts responsible for each data file are provided.
  7. The data are intelligently organized by mineral, each in its own file.

We really have no complaints.  For our project all we need to do is download the data, do some additional processing to catch a few small inconsistencies and then harmonize the individual data files so they can be looked at as a whole.

Well, actually we have one complaint: all the pointing and clicking needed to download 163 separate Excel files.  That is where BeautifulSoup comes in.


The python module BeautifulSoup epitomizes the best of the python open-source world as its designer understands what it takes to get things done in the real world and has made that work sinfully easy.  We’ll highlight the features we love from the description:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

  1. Beautiful Soup won’t choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don’t have to create a custom parser for each application.
  3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it “Find all the links”, or “Find all the links of class externalLink“, or “Find all the links whose urls match “”, or “Find the table heading that’s got bold text, then give me that text.”

Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.

Assuming that you are working in a Unix/Linux environment (Mac OSX included) and have already installed EasyInstall, installation of BeautifulSoup is quite easy.  On Ubuntu just type:

sudo easy_install BeautifulSoup

Now we are ready to work with this module.

11 lines of python

Given the simple nature of our goal: download all the Excel files linked to from a web page, we would expect the python code required to achieve this to be reasonably succinct … and it is.  There are 11 lines of actual code in all and this short script reads like pseudocode.  That is truly the joy of programming in python.

#!/usr/bin/env python
# Python script to download all Excel spreadsheets that make up the USGS dataset:
#   "Historical Statistics for Mineral Commodoties in the United States, 
#    Data Series 2005-140"

import urllib
from BeautifulSoup import BeautifulSoup

location = ""
page = urllib.urlopen(location)
soup = BeautifulSoup(page)

# Find each <a href="...">XLS</a> and download the file pointed to by href="..."
for link in soup.findAll('a'):
    if link.string == 'XLS':
        filename = link.get('href')
        print("Retrieving " + filename)
        url = location + filename

Go ahead and try it out.  Just install BeautifulSoup, download this file and run it.  The combination of BeautifulSoup for parsing web pages and the suite of functionality found in the urllib module will make downloading and extracting data from web pages a much less tedious job than in the past.

In today’s world of data management increasing amounts of public data are being placed on line in wildly varying formats.  We should expect the need to work in this fashion to increase substantially before any sort of data dissemination standards make our lives easier. Even when standards are present we should never forget the data manager’s favorite joke:

The best thing about standards is … there are so many to choose from.

Best wishes for easy access to public data!

This entry was posted in Data Management and tagged , , , , . Bookmark the permalink.

Comments are closed.