Thursday, February 2, 2012

CPAN is Your Friend

The past couple of weeks have offered some lessons in the use of Perl to manipulate HTML and use HTTP.
  1. Don't try to tidy up weirdly tagged  HTML with regular expressions: HTML::TreeBuilder is your friend. (But remember the implications if you use $node->replace_content.)
  2. Don't try to replace HTML entities (á and so on) with regular expressions: HTML::Entities is your friend.
  3. Remember that Encode::encode('utf8', $string) does not turn $string into UTF-8 Unicode; it gives you an octet string, which after all is what HTML::Message wants. You do not want Encode::decode here.
  4. If you have multiple parameters of the same name to send to a URL with POST,  call HTML::Request::Common::POST with "Content => [p => 1, p => 2] ". Do not call it with "Content => {p => [1,2]}, for this subroutine regards a list reference as specifying a file to load. You will scratch your head wondering why your script is trying and failing to open a file named "1", and it will take you a moment to check back with the documentation and figure it out.
I can go weeks or months without needing to look at CPAN, but what a wonderful resource it is! A couple of hours getting acquainted with HTML::TreeBuilder made it possible to write a script that saved hours of tedious work.

No comments:

Post a Comment