Thursday, January 12, 2012


Long ago, I was hired as a copy editor. The first manuscript handed to me discussed unemployment among farm workers; within a few pages it gave remarkably different figures for the number of farm workers and for the number unemployed, and the inconsistencies went on from there. As I remember it, I put together a list of questions about a page and a half long, single-spaced, for a manuscript that was about 20 pages double-spaced. It seemed remarkable to me that the author had not noticed and accounted for the discrepancies.

I have since discovered that data manipulation is engaging work, often well paid, and capable of revealing important and otherwise invisible patterns. Some time ago I read an interview about the early days of relational databases, which I can't now find; the subject (I think Jim Gray) said something like

We'd talk to the customers and they'd say "The query took four hours to run." We'd say "Well, I guess we have to work on the optimizer.", and they'd say "But don't get us wrong. We love it. We're learning things about the business we never knew."
But I have also discovered that much of the time data collection is tedious, ill-paid, and uncertain work. The early triumphs of data warehousing were made possible not just by the engineers who wrote the programs but by armies of operators entering data from grocery register tapes. Those operators at least had seats and shelter; canvassers going door to door are out in the weather and may have to write with a clipboard or a wall for a surface. The best typists make mistakes and the most conscientious canvassers now and then print unreadably.*

And I find that in many cases the analyst's confidence in the data is in inverse proportion to what he knows of the circumstances of its collection. One number looks just like another on a computer monitor or a piece of paper, and who is to say that the five-digit number is not made of identical units? The person, perhaps, who saw the canvass sheets, or listened to samples of the interviews, or spot-checked the data entry. But that person is not usually in the meeting where analyses are presented without distracting footnotes.

I believe it was Dennis Healy who wrote in his memoirs of an incident of WW II service, when he was convalescing from an operation and assigned to a railway station. His duty was to keep track of the numbers of all soldiers boarding or leaving trains at this station. It was a large station, and he was on crutches, so an actual count was not possible. He made his best estimates and checked them with the civilian station master, who was himself estimating. Healy wrote that when he became a cabinet minister in the 1960s and 1970s, the memory served as a caution against blind faith in the numbers he was given.

* These days, of course, with so many cash registers computerized, and so many purchases by card, the clerks and the customers do the data entry. And businesses love cell phones and smart phones--what better than having the public provide a lot of data on its comings, goings, and loiterings?

No comments:

Post a Comment