Selling cookie info to third-parties is a classic example of you can make money without doing evil.
RSS

WWW::Mechanize::Plugin::Web::Scraper

2008/06/11 filed under /perl

Joffie asked me if Web::Scraper could handle authentication while retrieving the website in question. A good question and after digging in Tatsuhiko's code, I noticed that you can simply dump HTML in the scrape function, instead of just the documented URI object.

I remembered Tatsuhiko mentioning integration with WWW::Mechanize somewhere but I couldn't find anything yet. So I decided to write the little Mechanize plugin. Shockingly, and completely surprising, it now carries the name WWW::Mechanize::Plugin::Web::Scraper.

Scrape the planet!

Posted by: B10m | permanent link | comments (1)

Perl code highlighters

2008/04/28 filed under /perl

Perl can be a real mess, yes. Everyone knows it, a few try to disagree, but in the end, you can make Perl code look very cryptic. So maybe this post isn't really fair. Never the less, I'd like to point out an annoyance I have noticed for some time now.

All over the web, websites exist that allows you to dump some code. The website will highlight it accordingly to the chosen language. While this usually works fine, it fails a lot of times on the Perl variable $#. This special variable specifies the last index of a list. As you might guess, most highlighters see the hash and think: comment!

Let's use this code:

#!/usr/bin/perl

my @test = qw(Just another Perl Hacker);
print "Last index of the test list is:", $#test, "\n";
print "Oh, of course ... ", join " ", @test, "\n";

This is fairly easy code to follow, even for a non-Perl programmer, I believe, so it's up to you to figure out what it does ;-)

Now, let's see how a 10 random sites handle this:

Wrong (see the hash as a commenting prefix):

Correct:

Sad but true ...

Posted by: B10m | permanent link | comments (0)

CGI's UPLOAD_HOOK

2008/02/02 filed under /perl

Many a time, I see people asking and messing with CGI uploads and progress bars.

First of all, I believe an upload progress bar is the responsibility of the browser (client) and not of the server. The client knows the file size it is uploading and how many bytes it sent over the wire. Regardless, progress bars are fairly nice, especially with large(r) files. So let's see how we can implement one.

Perl is very well suited to show you the upload progress (I believe it's more tricky with PHP), due to the UPLOAD_HOOK facility of CGI

The documentation isn't too extensive, so let's just look at an example. First of all, you'd need to understand what needs to be done. After someone hits the upload button, we need to query the server over and over, to get the upload status. Javascript kicks in here.

To display the bar, I simply use an existing script, for a) it looks better than anything I'd ever create and b) it works :)

Bram.us 's jsProgressBarHandler is the one I chose for this example.

Ok, so now first take a look at the hook subroutine. First you'd have to create an instance of the CGI object like this:

my $q = CGI->new(\&hook);

The hook subroutine isn't too fancy either. I use File::Slurp to write the percentage to a file that we can query later.

sub hook {
   my ($filename, $buffer, $bytes_read, $data) = @_;
   my $perc = sprint("%i", (($bytes_read / $ENV{CONTENT_LENGTH}) * 100));
   write_file("/tmp/$ENV{REMOTE_ADDR}", {overwrite => 1}, "$perc");
}

That's all it takes.

Now, on the frontend, we simple query this file over and over, like this:

   function doUpload() {
      $('progress').show();
      var intervalID = window.setInterval('doProgress()', 1000);
   }

   function doProgress() {
      var d = new Date;
      new Ajax.Request('progress.cgi?time='+d.getMinutes()+
                       '_'+d.getSeconds(), {
         method:'get',
         onSuccess: function(transport){
            myJsProgressBarHandler.setPercentage('progress', 
                                                 transport.responseText);
         }
      });
   }

The function doUpload shows the progress bar and calls doProgress every second. Since Internet Explorer seems to think that caching the Ajax.Request is a smart thing to do, I simply post the minutes and seconds to the script aswell.

And progress.cgi isn't so fancy either:

#!/usr/bin/perl

use strict;
use File::Slurp;

print "Content-type: text/plain\n\n";
my $perc = read_file("/tmp/$ENV{REMOTE_ADDR}");
print $perc;

This works rather well on my machine(s) and it's really simple, as you can see. The only downside is that when two people sharing the same IP address start uploading at the same time, they'll probably get the wrong information. But hey, who cares? ;-)

Posted by: B10m | permanent link | comments (4)

HTML::BBCode 2.0 released

2007/11/19 filed under /perl

HTML::BBCode, the module I wish I never wrote was plagued by XSS exploits (yeah, I didn't test enough), so I decided to run the HTML results it generates, through the awesome module HTML::StripScripts.

Due to these changes, some methods are no longer supported (see the POD, if you care enough) and that made me bump the version up to 2.0! Woohoo! The module that initially started out with a few sprintf's now really looks like a module. Hopefully I'm can now ignore this module for a while and no bugs are spotted ;-)

Posted by: B10m | permanent link | comments (0)

Scraping Yahoo! Search with Web::Scraper

2007/09/02 filed under /perl

Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. Tatsuhiko Miyagawa comes to the rescue! His Web::Scraper makes scraping the web easy and fast.

Since the documentation is scarce (there are the POD and the slides of a presentation I missed), I'll post this blog entry in which I'll show how to effectively scrape Yahoo! Search.

First we'll define what we want to see. We'll going to run a query for 'Perl'. From that page, we want to fetch the following things:

  • title (the linked text)
  • url (the actual link)
  • description (the text beneath the link)

So let's start our first little script:

use Data::Dumper;
use URI;
use Web::Scraper;

my $yahoo = scraper {
   process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
   process "div.yschabstr", 'description' => "TEXT";

   result 'description', 'title', 'url';
};

print Dumper $yahoo->scrape(URI->new("http://search.yahoo.com/search?p=Perl"));

Now what happens here? The important stuff can be found in the process statements. Basically, you may translate those lines to "Fetch an A-element with the CSS class named 'yschttl' and put the text in 'title', and the href value in url. Then fetch the text of the div with the class named 'yschabstr' and put that in description.

The result looks something like this:

$VAR1 = {
          'url' => 'http://www.perl.com/',
          'title' => 'Perl.com',
          'description' => 'Central resource for Perl developers. It contains
 the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited 
by Clay Irving.'
        };

Fun and a good start, but hey, do we really get only one result for a query on 'Perl'? No way! We need a loop!

The slides tell you to append '[]' to the key, to enable looping. The process lines then look like this:

   process "a.yschttl", 'title[]' => 'TEXT', 'url[]' => '@href';
   process "div.yschabstr", 'description[]' => "TEXT";

And when we run it now, the result looks like this:

$VAR1 = {
          'url' => [
                     'http://www.perl.com/',
                     'http://www.perl.org/',
                     'http://www.perl.com/download.csp',
                   ...
                   ],
          'title' => [
                       'Perl.com',
                       'Perl Mongers',
                       'Getting Perl',
                     ...
                     ],
          'description' => [
                             'Central resource for Perl developers. It contains 
the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by
 Clay Irving.',
                             'Nonprofit organization, established to support the
 Perl community.',
                             'Instructions on downloading a Perl interpreter for
 your computer platform. ... On CPAN, you will find Perl source in the /src 
directory. ...',
                           ...
                           ]
        };

That looks a lot better! We now get all the search results and could loop through the different arrays to get the right title with the right url. But still we shouldn't be satisfied, for we don't want three arrays, we want one array of hashes! For that we need a little trickery; we need another process line! All the stuff we grab already is located in a big ordered list (the OL-element), so let's find that one first, and for each list element (LI) find our title,url and description. For this we don't use the CSS selectors, but we'll go for the XPath selectors (heck, we can do both, so why not?).

To grab an XPath I really suggest firebug , a FireFox addon. With the easy point and click interface, you can grab the path within seconds.

use Data::Dumper;
use URI;
use Web::Scraper;

my $yahoo = scraper {
   process "/html/body/div[5]/div/div/div[2]/ol/li", 'results[]' => scraper {
      process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
      process "div.yschabstr", 'description' => "TEXT";

      result 'description', 'title', 'url';
   };
   result 'results';
};

print Dumper $yahoo->scrape( URI->new("http://search.yahoo.com/search?p=Perl") );

You see that we switched our title, url and description fields back to the old notation (without []), for we don't want to loop those fields. We've moved the looping a step higher, being to the li-elements. Then we open another scraper which will dump the hashes into the results array (note the '[]' in 'results[]').

The result is exactly what we wanted:

$VAR1 = [
          {
            'url' => 'http://www.perl.com/',
            'title' => 'Perl.com',
            'description' => 'Central resource for Perl developers. It 
contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, 
edited by Clay Irving.'
          },
          {
            'url' => 'http://www.perl.org/',
            'title' => 'Perl Mongers',
            'description' => 'Nonprofit organization, established to support 
the Perl community.'
          },
          {
            'url' => 'http://www.perl.com/download.csp',
            'title' => 'Getting Perl',
            'description' => 'Instructions on downloading a Perl interpreter 
for your computer platform. ... On CPAN, you will find Perl source in the /src 
directory. ...'
          },
...
        ];

Again Tatsuhiko impresses me with a Perl module. Well done! Very well done!


Update: Tatsuhiko had some wise words on this article:

A couple of things:

You might just skip result() stuff if you're returning the entire hash, which is the default. (The API is stolen from Ruby's one that needs result() for some reason, but my perl port doesn't require) Now with less code :)

The use of nested scraper in your example seems pretty good, but using hash reference could be also useful, like:

my $yahoo = scraper {
   process "a.yschttl", 'results[]', {
      title => 'TEXT', url => '@href',
   };
};

This way you'll get title and url from TEXT and @href from a.yschttl, which would be handier if you don't need the description. TIMTOWTDI :)

Posted by: B10m | permanent link | comments (2)
return-member