[liberationtech] Open Up Government Data

Yosem Companys companys at
Sun Mar 8 22:42:21 PDT 2009
Open Up Government Data From Wired How-To Wiki

Barack Obama rode into office with a high-tech, open source campaign that
digitized the book on campaigning.

Now, with his selection of a celebrated open data advocate as his Chief
Information Officer, Obama appears serious about bringing those same
principles to the executive branch's treasure trove of data.

Vivek Kundra, the new CIO, comes to the White House from a similar role as
the CTO of Washington, D.C., where he garnered kudos for his clear-headed
approach to making data feeds from dozens of city agencies accessible.

"I'm going to be working very closely with all Federal CIOs in terms of at
the agency level to make sure they are advancing an agenda that embraces
open government, an agenda that looks at how we could fundamentally
revolutionize technology in the public sector," Kundra said.

If you're a fan of free data flow into and out of the government, Vivek
Kundra seems like an ally. But we can't rest on our laurels. Now is exactly
the time when lobbying for particular data and documents to be made
accessible could be most effective. <> is coming: Let's help build it.

   - 1 The Problem<>
   - 2 The Solution:
   - 3 How to Get
   - 4 Government Datasets: The Good, the Bad and the
      - 4.1 Action Item: Require data to be available in at least one public
      or open-source
      - 4.2 U.S. Department of
         - 4.2.1 Economic Research
            - Action Item: Turn current spreadsheet data on crops,
            livestock, etc., into XML feeds to enable easier and more
dynamic reuse.<>
          - 4.3 Food and Drug
         - 4.3.1<>
         - 4.3.2 FDA website
(Drugs at FDA)<>
            - Action
          - 4.4 U.S. Geological
      - 4.5 National Aeronautics and Space
         - 4.5.1 Technical Reports
            - Action Item: OCR all technical reports server
            - Action Item: Put all publicly available documents
            - Action Item: Attempt to identify data buried in
          - 5 Models for Government Data Release,
      - 5.1 Washington, D.C. Data
      - 5.2 Human Genome Project:<>
      - 5.3 NOAA's Climate Database Modernization
      - 5.4 Earth Science Data and Information System Project at Goddard:
      - 5.5 Hubble Space Telescope
      - 5.6 TriMet and
   - 6 Government-Wide
      - 6.1 Create Data Catalog of Every Agency's Data
      - 6.2 Make Data Release the Rule, Rather Than the
      - 6.3 View Data Release From the User's Point of View, not the
      - 6.4 Reward Making Datasets Publicly Available Through Grantmaking
      Bodies like
      - 6.5 Fund Data Reanalysis
         - 6.5.1 Action Item: Crowdsource the problem by encouraging high
         school and college level science classes to add government
data reanalysis
         to their
      - 6.6 Prioritize Open Government Investments and the Right to
   - 7 Models for Opening and Using Government
      - 7.1<>
      - 7.2<>
      - 7.3 Sunlight Labs' Apps for
      - 7.4 GTFS Data

 The Problem

More than 100 government agencies <> collect
statistics and data. Though some agencies have done a great job of getting
their data and documents online, the accessibility and usability of
government data overall can be improved.

The Federal government is a big beast with more lots of agencies collecting
and publishing data, some of it stretching back decades. Much of the data
about the workings of our country is stranded in PDFs, Excel spreadsheets
and other less-than-ideal formats. The net effect is that scientists,
lawmakers, journalists and citizens can't access key decision-making

While green tech, the banking crisis and war dominate headlines, data
quietly underpins decision making in all of those areas.

The numbers — about how much corn we grow, what the universe looks like from
Hubble, how much coal we have, and how well drugs work — are the results
from the grand experiment of this country. We'll only know how to proceed,
making refinements to our politics, policies and science, if we know what's
happening in the world around us.

Vivek Kundra, Barack Obama's new appointee is well aware of the problem and
wants to spark a revolution in the way government deals with data. But he's
going to need your help to steer the (big) government boat.
 The Solution: You

We've established this wiki to help focus attention on valuable data
resources that need to be made more accessible or usable. Do you know of a
legacy dataset in danger of being lost? How about a set of Excel (or —
shudder — Lotus 1-2-3) spreadsheets that would work better in another
format? Data locked up in PDF's?

*This is your place to report where government data is locked up by design,
neglect or misapplication of technology.* We want you to point out the
government data that you need or would like to have. Get involved!

Based on what you contribute here, we'll follow up with government agencies
to see what their plans are for that data — and track the results of the
emerging era of

With your help, we can combine the best of new social media and old-school
journalism to get more of the data we've already paid for in our hands.
 How to Get Involved

Just jump in and edit the wiki. Add links to data that's out of date or in
danger of being forgotten or that comes stored in a less-than-ideal format.
Help define how gets built by making sure that the data you need is

We're not writing a policy paper here. We're trying to highlight datasets
and sources of knowledge that the new Administration — and it's open-data
friendly CIO — could make more widely available and accessible with small,
concrete actions.

If you're not comfortable with the MediaWiki formatting language, feel free
to get in touch with staff writer, Alexis Madrigal, either by
e-mail alexis.madrigal[at] or on Twitter:

 Government Datasets: The Good, the Bad and the Ugly

Our government generates tons of data. It's a data-making machine. There are
three types of data that the government tends to make: information about
internal government functioning, statistics like those provided by the U.S.
Department of Agriculture and scientific data generated by the nation's
scientific agencies.

Legends like Carl Malamud <> and The Sunlight
Foundation <> have focused most of their
efforts on exposing details about how the government works. Scientists tend
to push for the latter — and they're doing an increasingly good job of
making datasets available.

Those statistics, though, don't receive as much attention, even though they
tend to reflect most directly on making economic and political decisions.

We can judge government efforts on two criteria. First, how accessible is
the data? Is it a) online, b) as raw as possible, c) "feedable," and D)
fully downloadable? <> hosts an
expanded list of 8 Principles of Government
Data<>that seem reasonable and

Second, how easy is it to use? Is the data well-described? Are there
summaries available for less technical users? (Or, alternatively, is the
data structured so outside people are likely to build applications with it,
exposing more people to the data?)

Using these criteria, help us find out which valuable datasets we can
campaign to make better.
 Action Item: Require data to be available in at least one public or
open-source format.

In addition to providing data in closed, proprietary, or semi-proprietary
formats such as Excel, Word, or PDF, require that at least one open-source
or public format be given. This allows government data to be offered in
convenient forms for use in commercial applications and also protects the
data from being antiquated if those commercial applications cease to exist.

An example of a "dataset" that has generally been "locked up" in PDF are the
mission, vision, values, goals, and objectives statements documented in
agency strategic plans, which are required by the Government Performance and
Result Act (GPRA). The authoritative sources of such data should be XML
documents conforming to a voluntary consensus standard like Strategy Markup
Language (StratML <>).
 U.S. Department of Agriculture Economic Research Service

The USDA's Economic Research Service is an excellent example of an agency
that does a great job collecting data and making it available online.
However, the usability of its data is limited. For example, statistics on
crop harvests and fertilizer applications are stored in the form of
individual Excel spreadsheets and PDFs. That frustrates researchers like
Pamela Martin, a geochemist at the University of Chicago, who investigates
the nation's food system. She wants to more accurately determine the energy
inputs that go into various types of food so she can evaluate the carbon and
energy intensity of different types of diets.

Economic data is held separately from the chemical application data, so it's
tough to to look at how fertilizer and pesticides have impacted yields. It'd
be easy if the USDA provided raw data, but they don't. They cut and massage
it into reports, which means Martin and her team have to try to back out the
aggregated stats into the data they want.

"They package it in a way that's useful for them, but what really would be
useful is if all that data were available in nearly raw form," Martin said.
"I'd want a single clearinghouse for their data."
 Action Item: Turn current spreadsheet data on crops, livestock, etc., into
XML feeds to enable easier and more dynamic reuse.

 Food and Drug Administration is a registry for clinical trials that began in 1999.
Registries allow you to track trials that were planned. That allows you to
see if a trial was published, but it doesn't tell you how the trial turned
out. Starting in late 2008, because of a new law that was passed (FDAAA), began posting study results. This is great for future
drugs, but it doesn't help us with all the drugs on the market today.
 FDA website (Drugs at FDA)

The FDA is sitting on data on essentially all drugs currently marketed in
the US. They have results not only on the clinical trials that got
published, but they have data on lots of trials that haven't seen the light
of day. (And sometimes the FDA version of the trial results tell a different
story from what the journal articles say.)

Starting in 1997, the FDA began posting their reviews on drugs they've
approved. Unfortunately, this is a hit-or-miss proposition. Sometimes a drug
review is there, and sometimes it's not. The other problem is that, if you
want a review on a drug approved before 1997, you have to file an
old-fashioned "paper" Freedom of Information (FOIA) request. Then you get to
wait and wait, and maybe in a year or two you'll get it. When the FDA
receives the next request for the same drug, they reinvent the wheel all
over again and make that person wait and wait.

Drugs at FDA:
Article about this issue:
 Action Items

• Make the FDA's clinical trial data from before 1997 available online.
• Cover all drugs, not just some of them.

 U.S. Geological Survey

USGS data about the mineral resources in our country is some of the most
respected data in the world. Some of this data is available online through
the Minerals Resource Program <>. Could it be
improved? What's missing?
 National Aeronautics and Space Administration Technical Reports Server

The NASA technical reports server is a treasure trove of reports and
documents on topics from Shuttle launches to wind turbines. There are
175,631 full-text PDFs on the server, but none of them have been OCR'd. Only
the abstracts are searchable.

147,560 other documents have searchable abstracts but not are not available
online. They have to be ordered through the Center for AeroSpace Information
— which costs nontrivial amounts of money. This is true even for documents
labeled, "Unclassified, No Copyright, Unlimited, Publicly available."
 Action Item: OCR all technical reports server documents.

This is a simple first step that would make a tremendous amount of knowledge
more freely accessible.
 Action Item: Put all publicly available documents online.

To be publicly available in today's world has to mean that the documents are
available online.
 Action Item: Attempt to identify data buried in documents.

Behind these documents lie vast datasets that are probably stored on
outdated media. From scanned reports, we can identify what they datasets
underlying NASA's reports are.

 Models for Government Data Release, Transparency Washington, D.C. Data

The city's Data Catalog is simple: It just provides downloadable data and
web-accessible feeds on all kinds of city information from juvenile arrests
to completed construction projects to roadkill pickups. Building on that
platform, the city even sponsored an Apps for Democracy contest, which saw
independent developers create 47 mashups from DC's data streams.
 Human Genome Project: <>

As Vivek Kundra mentioned in his CIO acceptance press conference, the public
release of the human genome has led to the creation of 500 new drugs that
are now in the FDA approval pipeline. This is widely considered one of the
best examples of data sharing.
 Help us out. *What made this effort so successful?*

 NOAA's Climate Database Modernization

The CDMP has digitized 53 million images and more than 7 terabytes of data,
making it available through a special web-based software interface. They are
saving priceless, often handwritten climate data.
On the other hand, access to the data is restricted to "U.S. government
employees and their contractors, educational institutions doing
environmental research, and other researchers associated with NOAA
projects." *What prevents this data from being opened up to the public?*

 Earth Science Data and Information System Project at Goddard:

The ESDIS program is currently saving crucial datasets from NASA's early
mission. These include data from the weather research NIMBUS
Heat Capacity Mapping
Mission<>and the
Radiation Budget Experiment <>.

*What are the best practices for finding high-value historical datasets? How
has the ESDIS program done it?*

  Hubble Space Telescope

Peter Fox, a computer scientist at Rensselaer Polytechnic Institute noted
that the Hubble Space Telescope archive has been a smashing success. "Six
times the amount of data has been taken out of that archive than has gone
into it," he said. "The data has been used 6X more than you paid for."
 TriMet <> and BART <>

While most public transit agencies have resisted giving out even simple
digital equivalents of their paper schedules, Portland, Oregon's TriMet and
the San Francisco Bay Area's BART systems have led the way in
developer-friendliness. Not only do they provide their full timetables and
geographic information in
but both also provide real-time GPS information about where their
buses and trains are at any given moment. Both also maintain dedicated
developer areas on their website and actively reach out to the transit
developer community <>.
 Government-Wide Changes Create Data Catalog of Every Agency's Data Streams

We agree with the Sunlight Foundation's Greg Elin that the single most
important thing any government agency could do to make itself more
transparent would be to create a data catalog of all its data streams.

"If there was one thing I could do, it would simply be creating a data
catalog at every agency at every department that has data," said Greg Elin
of the Sunlight Foundation, a group that promotes government transparency.
"Every website has an About Us. Every website has a Frequently Asked
Questions. Every website should have a data catalog." <> is the current attempt to do this,
but it clearly doesn't rise to the level of a true data catalog.

Under the previous administration, the Federal Enterprise Architecture (FEA)
Data Reference Model (DRM) was supposed to have become the governmentwide
data catalog. An XML schema <> (XSD) was
drafted to facilitate the sharing of DRM data (e.g., in readily searchable
data catalogs). However, agencies pushed back against the thought of
rendering their data descriptions in XML format, so the draft XSD was not
finalized and implemented. Under the new administration, the status of the
FEA models, including the DRM, is uncertain.
 Make Data Release the Rule, Rather Than the Exception

Jessy Cowan-Sharp, who works on a team at NASA Ames looking at
data-accessibility issues, had this simple suggestion. "If we changed policy
to have automatic time frames within which data became publicly releasable,
or even had certain categories of data that were by default publicly
releasable, I think we would overcome one of the major hurdles to
accessibility — paperwork."
 View Data Release From the User's Point of View, not the Agency's

The EPA web manager, Jeffrey Levy, has made a trenchant argument via Twitter
about this issue. "Why should ppl have to know which agency governs nat'l
parks to find info? or fixes potholes? or explains env. issues?"

In short, why should user-citizens have to know which government silo
handles which problem to get answers? creates the possibility that
at least in data, that won't be the case. But one key will be designing the
site to think about data use-cases, not agency needs.
 Reward Making Datasets Publicly Available Through Grantmaking Bodies like

Andy Maffei of the Woods Hole Oceanographic Institute speaks for many
scientists when he notes that when they make their data available, they
aren't rewarded with promotions and recognition. "They don't get much
attribution in terms of how much their data is used by other people," he

*How can we encourage the NSF and other scientific grantmaking bodies to
reward data release?*
 Fund Data Reanalysis Projects

Andy Fox at RPI told us that only between three and ten percent of
scientific data is actually analyzed, which means that virtually no data is
reanalyzed. He argues — and we agree — that mining the data we already have
could be a low-cost, high-value means of scientific investigation.

The problem is, that doesn't sound like cutting-edge scientific research.
*How do we convince funding agencies that data analysis can be valuable
science, even if you do no more measurement?*

 Action Item: Crowdsource the problem by encouraging high school and college
level science classes to add government data reanalysis to their curriculum

It might be good to get students involved with the scientific process by
reanalyzing actual scientific data. It may be possible for the classes to
get into contact with the agency/team/scientist that generated the data, and
the students would get a chance to learn data analysis techniques that are
valuable later in their scientific career (statistics, mathematical
modeling, etc). If students or instructors wanted to tackle large data sets,
the students could be exposed <> to different
programming techniques as well (something that many professional scientists
aren't good at themselves).
 Prioritize Open Government Investments and the Right to Know

Government should not only makes its data available for third party use, it
must also make the information it places online more accessible and relevant
in a timely, personalized manner. The detailed report Moving Toward a 21st
Century Right-to-Know Agenda: Recommendations to President-elect Obama and
Congress <> and the short
article, Ten practical online steps for government support of
provide useful context.

*How do we ensure that e-government investments coached in terms of one-way
service delivery transactions are required in part to promote government
accountability, transparency, accessibility and engagement with the public?*
 Models for Opening and Using Government Data

Infochimps <> is dedicated to finding and hosting
free, redistributable datasets. It's a simple but absolutely enormous
mission. So far, they've got thousands waiting for you to use. <> is the home of Carl
Malamud's many-tentacled government data extraction program. From public
safety codes to California's entire Code of Regulations to hundreds of
Federally-produced movies, the website provides a tantalizing peek at the
incredible extent of the government's information warehouses.
 Sunlight Labs' Apps for America

Sunlight Labs <> is an organization dedicated to
"turning government data into useful information." They are currently
hosting an Apps for America
<>contest to design web
services that promote transparency in Congress.
 GTFS Data Exchange

GTFS Data Exchange <> is a site that
aggregates public transportation schedule data in
Both officially-released and hand-entered/scraped schedules are
