Suppose that you, or your employers, take an interest in the data collected by the U.S. Office of Idle Inquiries, or its private-sector cousin, General Information, Inc. It would be good to get hold of quantities of this data, summarize it, total it, analyze it. But you don't have it. What then?
The data held by the government office may be somewhere on data.gov if you know how to find it. Otherwise, it may well be available under the Freedom of Information Act (FOIA). In theory, you can "FOIA it": put in a FOIA request, and eventually get back a DVD or a thumb drive with the data you want. In practice, you may get nothing but explanations about why FOIA does not apply; or a mass of metadata with no data; or data in a format that nobody still in the workforce has ever seen. And it may take a while. General Information, Inc., as a private company, is certainly not putting its information out on data.gov and not subject to the Freedom of Information Act.
But the data that you want is published in some fashion or another on the web. You can go to https://oii.gov or https/gii.com, use a search box, and get the data a record or a dozen at a time. What will you do?
Well, you might put together a script or two to retrieve and parse the data. If the site is amenable to the treatment, you can use Python's urllib.request module to retrieve the data a page at a time. If the site is heavy with JavaScript and wants you to click on a "Next" button for every score of records, you can use the Selenium module to drive a browser. And you can extend html.parser.HTMLParser to take apart what you have retrieved and reshape it. This works well enough.
On the other hand, you may run across a government website that has a few thousand records of much interest. To get what you want, you will have to use Selenium to drive a browser to retrieve the records ten at a time. Each set of ten records you will have to parse, and each of the records will require another retrieval and parsing. About then, you will say "There must be a better way!" As you do so, though, you probably will not pound your fist. More likely you will sigh.
The work is not especially difficult. A task such as I speak of might require a couple hundred lines of Python--emphatically not counting what's in the libraries, only what I'd have to write. The weariness comes from the reflection that
- A team of programmers has defined a database to hold structured data.
- Many persons, whether as their sole duty or as an aspect of it have populated that database with information about the agency's findings.
- Another team of programmers has taken pains to write programs that will display the data in a format humans can conveniently read when they go to the website.
- And here I come to pick the presentation format apart into data to load into a database.
- Can't somebody just send me the database?
(For those who use Python and have not heard of Raymond Hettinger, I strongly recommend going to YouTube and watching his presentation "Beyond PEP 8 -- Practices for beautiful intelligible code". I will not say that my code is beautiful, but I will say that it is less ugly than it was before I saw the presentation.)