Sunday, October 2, 2011

All public government data should be easily machine readable

The Bureau of Labor Statistics (BLS) has an annual budget of over $640 million (FY 2011), a budget  they use to create and then distribute detailed labor market data and analysis to policy makers, researchers, journalists and the general public. I can't speak to the "creation" part of their mission, but on the "distribution" part, the are failing---organizations with tiny fractions of their resources do a far better job.

It's not the case that government IT is invariably bad---the Federal Reserve Bank of St. Louis has an amazing interface (FRED) and API for working with their data. Unfortunately, not all government statistics are available here, especially some of the more interesting BLS series.

The essential problem with BLS is that all of their work products---reports, tables etc.---are designed to be printed out, not accessed electronically. Many BLS tables are embedded in PDFs, which makes the data they contain essentially impossible to extract; non-PDF, text-based tables, which are better, are difficult to parse electronically: structure is conveyed by tabs and white space, column headings are split over multiple lines with no separators; heading lengths vary etc.

Why does it matter? For one, when users can access data electronically, via an API,  they can combine it with other sources, look for patterns, test hypotheses, find bugs / measurement errors, create visualization and do all sorts of other things that make the data more useful.

BLS does offer a GUI tool for downloading data, but it's kludgy, requires a Java Applet, requires series to be hand-selected and then returns an Excel(!) spreadsheet w/ extraneous headers and formatting. Furthermore, it's not clear what series and what transformations are needed from GUI-data to make the more refined, aggregated tables.

To illustrate how hard it is to get the data out, I wrote a python script to extract the results this table (which shows the expected and estimated changes in employment for a number of industries). What I wanted to do was make this, which I think is far easier to understand than the table alone:

To actually create this figure, I needed to get data into in R by way of a CSV file.  The code required to get table data into a useful CSV file, while not rocket science, isn't trivial---there's lots of one-off/hacky things to work around the limitations of the table. Getting the nested structure of the industries e.g., ("Durable Goods" is a subset of "Manufacturing" and "Durable Goods" has 4 sub-classifications) required recursion (see the "bread_crumb" function). FWIW, here's the code:

Most of the code is dealing with the problems shows in this sketch:

My suggestion: BLS should borrow someone from FRED and help them create a proper API.


  1. Maybe they will migrate their data to soon.

  2. I did exactly this type of thing in a (very) old PHP system I wrote for dealing with census and home-building data. I remember doing a manual POST request against a URL I found by viewing the source of a form in an HTML page, stripping out the text data wrapped in HTML header and footer sections in the response, then parsing the text data into structures I could actually use.

    Here's a thought: if there are enough of us writing our own parsers for this stuff, would it be worth it to create our own API for it that others could use? Obviously would be better if the gov did it for us, but they seem uninterested in being helpful in this regard.