Thursday, June 27, 2019

There Must Be a Better Way

Raymond Hettinger, a core contributor to the Python language and a frequent speaker at Python conferences, has a routine he uses in talks: he will bring his fist down on the podium, and the audience is to exclaim, "There must be a better way!" I thought of that last week, with Python only part of the context.

Suppose that you, or your employers, take an interest in the data collected by the U.S. Office of Idle Inquiries, or its private-sector cousin, General Information, Inc. It would be good to get hold of quantities of this data, summarize it, total it, analyze it. But you don't have it. What then?

The data held by the government office may be somewhere on data.gov if you know how to find it. Otherwise, it may well be available under the Freedom of Information Act (FOIA). In theory, you can "FOIA it": put in a FOIA request, and eventually get back a DVD or a thumb drive with the data you want. In practice, you may get nothing but explanations about why FOIA does not apply; or a mass of metadata with no data; or data in a format that nobody still in the workforce has ever seen. And it may take a while. General Information, Inc., as a private company, is certainly not putting its information out on data.gov and not subject to the Freedom of Information Act.

But the data that you want is published in some fashion or another on the web. You can go to https://oii.gov or https/gii.com, use a search box, and get the data a record or a dozen at a time. What will you do?

Well, you might put together a script or two to retrieve and parse the data. If the site is amenable to the treatment, you can use Python's urllib.request module to retrieve the data a page at a time. If the site is heavy with JavaScript and wants you to click on a "Next" button for every score of records, you can use the Selenium module to drive a browser. And you can extend html.parser.HTMLParser to take apart what you have retrieved and reshape it. This works well enough.

On the other hand, you may run across a government website that has a few thousand records of much interest. To get what you want, you will have to use Selenium to drive a browser to retrieve the records ten at a time. Each set of ten records you will have to parse, and each of the records will require another retrieval and parsing. About then, you will say "There must be a better way!" As you do so, though, you probably will not pound your fist. More likely you will sigh.

The work is not especially difficult. A task such as I speak of might require a couple hundred lines of Python--emphatically not  counting what's in the libraries, only what I'd have to write. The weariness comes from the reflection that
  • A team of programmers has defined a database to hold structured data.
  • Many persons, whether as their sole duty or as an aspect of it have populated that database with information about the agency's findings.
  • Another team of programmers has taken pains to write programs that will display the data in a format humans can conveniently read when they go to the website.
  • And here I come to pick the presentation format apart into data to load into a database.
  • Can't somebody just send me the database?
But until someone tells we when I'll get the databases, I'll keep on writing scripts.

(For those who use Python and have not heard of Raymond Hettinger, I strongly recommend going to YouTube and watching his presentation "Beyond PEP 8 -- Practices for beautiful intelligible code". I will not say that my code is beautiful, but I will say that it is less ugly than it was before I saw the presentation.)

Wednesday, June 19, 2019

UMUC

The Monday Washington Post has an article on the University of Maryland University College (UMUC), the continuing-education arm of the University System of Maryland. The article takes up about half a page. It appears to me that the burden of the story is that
  • The administration wishes to increase revenue, which comes primarily from tuition.
  • The administration supposes that the students are looking primarily for credentials, to be gained as quickly as possible.
  • The administration infers that students prefer online courses.
  • The administration considers that students who cannot get the credentials as quickly and conveniently at UMUC will enroll with competitors instead.
  • The faculty thinks that it would be good to get paid better to teach.
  • The faculty thinks that it would be good to see the occasional student face to face, and for more weeks per class than eight.
I think it is well if students can learn quickly and conveniently, and I know it is the case that UMUC's students are mostly employed full time and otherwise busy.  But the faculty may be the better judge of the pace of learning than either students or administration. And I think that courses taught in a classroom are preferable to online courses, even those taught by an instructor online as the class proceeds.

About thirty years ago I took classes at UMUC, for I wished to learn about computers, about programming mostly. I think it was only the last class I took there that was supposed to have a distance-learning component, for the benefit of students at other sites; but it didn't work well and may have been dropped. The classes I took were all at the campus in College Park. I did not in fact acquire a credential--I think I could have with one more class. But I did learn a certain amount about programming and generally about computers.

A few years later I taught one course there several times. These were not online classes. I discovered, as most adjuncts must, that the payment for the first class or two, reckoned against the hours one spends, amounts to something less than minimum wage.  I enjoyed the teaching even so.

The only persons I ever met, not heads of departments, that I knew to make or have made a living as UMUC faculty were a couple of accountants, who had previously taught for the school in Europe at US military bases. The instructors for my classes were working engineers or programmers, with the occasional graduate student. Generally they were good to very good. Most must have made far more at a day job than they did teaching. I sympathize with the adjuncts there now; but as far as I know the reliance on adjuncts is not new at UMUC.

The last time I thought about UMUC before the Post's story was some months ago. Out of curiosity to see where ESL students might advance their education, I looked up the tuition and fees at some local community colleges. I tried to look up the tuition for classes at UMUC also and could not find the information--the school site offers a "Time and Tuition Estimator" for the cost of a degree program, but does not make it easy to find the price per credit hour. Thirty years ago, UMUC was not so coy.

Sunday, June 16, 2019

Father's Day Books

I have come to think of a certain range of books as "Father's Day books", namely volumes one can safely get Dad if he doesn't golf or need new ties. I complain of them only because now and then I must read one. Otherwise, I think they serve the public by keeping publishers and printers solvent.

Such books tend to involve history, generally American history. Military history, the history of exploration, or both (Lewis and Clark) serve well. There are a number of authors who have made something of an industry of these books. The reader who has received or read some will recognize the style. At worst it combines the didactic and the sloppy, giving one lessons to be learned with misstated facts to illustrate them. At the not quite worst it reads like a junior high school history pageant, where the greats come on stage, say a piece, bow, and make way for the next. At almost best it tends to bury the reader in details.

Now, the matter of the books largely overlaps with many books I think well of and re-read. I admire Henry Adams's history of the US during Jefferson's and Madison's administrations, which heaven knows has plenty of battles and some explorations. I have Parkman's histories on my shelves, and some of Samuel Eliot Morison's. On the shelves is Elkins and McKittrick's history of the Federalist era. There are memoirs of war service by Grant and Sherman, and by some who never achieved a commission.

Why do I find Adams, Morison, to a lesser degree Parkman fascinating, and some of their would-be successors tiresome?  I think that it must come down to perspective. The historians I admire master the details, but in service to a larger scope: the US coming into possession of what it had possessed on paper; the European discovery of America; France and England contending for North America. If a small-unit engagement is described in detail, it will be at Fort Defiance or Fallen Timbers--it will matter in some way. Above all, the masters know what to omit: when they quote, they quote for a purpose.

Anyway, Happy Father's Day to any father who may read this. If your offspring give you one of these books, consider the possibility that you may have failed to let them know your preferences clearly enough. Remember that it's the thought that counts.



Thursday, June 13, 2019

No Thirty-Seconds

Partly because I work with computers, the number thirty-two has often been on my mind. With thirty-two bits one can address about four billion locations in memory, or express a positive signed integer as large as about two billion. Going on forty-five or fifty years ago, the computer industry discovered that it needed thirty-two bits in an address. Tracy Kidder's book The Soul of a New Machine tells how Data General made its 32-bit computer. (Now and for some years, sixty-four bits has been standard, but quite recently I've heard from techies who couldn't process a file of size larger than two gigabytes.)

And the compass has thirty-two points. In "boxing the compass", one recites them in order clockwise from North to North by West. Any high school geometry student should be about to make a compass rose with the points using only compass and straight edge.

The other day, though, I discovered a property of thirty-two that hadn't occurred to me: it is the smallest positive integer that cannot be a day of a month. I was looking to validate some input to a script we use, and discovered that JavasSript will happily parse '2/31/2019' as a date, but not '2/32/2019'.  The former becomes March 3, 2019, the latter is not a valid date. The script was to run under the Microsoft Scripting Host (cscript.exe), but I found that this works the same in a browser console. In what follows, the output in in italic:
for (var i = 29; i < 33; i++) {
    var dateString = '2/' + i + '/2019';
    var d = new Date(dateString);
    console.log(dateString + ' -> ' + d);
}
2/29/2019 -> Fri Mar 01 2019 00:00:00 GMT-0500 (Eastern Standard Time)
2/30/2019 -> Sat Mar 02 2019 00:00:00 GMT-0500 (Eastern Standard Time)
2/31/2019 -> Sun Mar 03 2019 00:00:00 GMT-0500 (Eastern Standard Time)
2/32/2019 -> Invalid Date
I'm not sure what to say about this. Obviously, JavaScript has "thirty days hath September" worked out, and it is not saving time or space by skipping leap year calculations:
for (var i = 29; i < 33; i++) {
    var dateString = '2/' + i + '/2020';
    var d = new Date(dateString);
    console.log(dateString + ' -> ' + d);
}
2/29/2020 -> Sat Feb 29 2020 00:00:00 GMT-0500 (Eastern Standard Time)
2/30/2020 -> Sun Mar 01 2020 00:00:00 GMT-0500 (Eastern Standard Time)
2/31/2020 -> Mon Mar 02 2020 00:00:00 GMT-0500 (Eastern Standard Time)
2/32/2020 -> Invalid Date
I know how to work around this without much effort. I'm just a little surprised that I should have to.

[Edit: changed "two million" to "two billion"]

Friday, June 7, 2019

Information Wanted

Working as I do near Lafayette Square, I see demonstrations from time to time. Some I see marching to the square with banners up, some I see only when they are gathered in the park or on Pennsylvania Avenue. It is not always convenient to walk across the park to see what they are demonstrating for or against. Neither the National Park Service nor the Metropolitan Police Department makes it easy to find out who has permits for such demonstrations. I wish that someone would.

Demonstrations are not tied to a time of year, but graduations occur about the beginning of June. Many of them take place at Constitution Hall, at 17th and D NW, and sometimes at lunch time or after work I will see graduates in or carrying gowns and mortarboards, and wonder from what school. Last week one day there was green garb in the morning and red in the afternoon. The Daughters of the American Revolution quite reasonably do not include private events, including graduations, on the Constitution Hall calendar.

Now, one can go to the websites of local school systems and find some of the graduations. For example, last Thursday morning's graduation  in green was apparently Bethesda-Chevy Chase High School. Yesterday's in black was apparently Walt Whitman. So far, so good for Montgomery County. But what of the Virginia schools? Do the Fairfax or Arlington County schools cross the river? Maybe I should make a guide to the graduations. But who else would want it?