The last time, I set up some scripts to de-duplicate contact lists. The machine I ran them on was shut down and removed years ago, so over the last week I had a look to see what was involved in rewriting them. It wasn't difficult to write something that actually winnowed the junk. On the other hand, the number of the addresses was amazing.
How is that the list has the email addresses of forty persons at the Ford Foundation? I have never, that I know, exchanged an email with any of them. Why a handful of persons at this or that university where I have never had dealings? I suspect that I have the slightly misspelled email address of one of our department heads because somebody's fingers remembered "i before e" when it was not applicable. I know why I have the email addresses of persons who left our organization in 2010 or 2011, but do I really want to cull them one at a time?
Rather than grapple with these questions, I've set up a simple web page where the help desk techs may, if they wish, submit a comma-separate variable (csv) file, and get back three files:
- The better, meaning that every record is no worse than any other for the same email address, where "better" has to do with whether the name fields look like names, or like something split automatically from an email address.
- The worse, meaning that every record is no better than at least one other for the same email address, using the definition of "better" given above.
- The bad, meaning that there is no email address, or that it is in a less useful format such as an LDAP path.