Selling cookie info to third-parties is a classic example of you can make money without doing evil.
RSS

The face of anonymity

2007/09/29 filed under /web

The web is getting one big social spot where we can define our friends (read: complete strangers) over and over again. Luckily for those who are terrible at remembering names, the ability to upload a picture or avatar is usually given.

But what happens when you don't want to upload a picture and want to remain semi-anonymous? You'll get the default image! The all look very similar (well, most of them), yet all slightly different.

I've scanned a few sites and looked for their way of giving the anonymous user a face. Here are 15 examples (all scaled down to 48×48 pixels).

43things.com digg.com facebook.com flickr.com friendster.com
gofish.com jumpcut.com last.fm myspace.com newsvine.com
technorati.com vox.com Y! Answers Y! Movies youtube.com

Can't we get one global symbol for mr/mrs Anonymous?

Posted by: B10m | permanent link | comments (0)

reCAPTCHA

2007/09/26 filed under /web

Sometimes, little Perl modules on CPAN can bring you to nice websites. In this case the website of reCAPTCHA.

The website opens bravely with the tagline STOP SPAM. READ BOOKS. In this day and age of text message (SMS) language where everything has to be as short as possible, reCAPTCHA scores fairly well with their motto. But if you do take the time to look a little further than that, you see the great concept behind the website.

What is a CAPTCHA? reCAPTCHA defines it as:

A CAPTCHA is a program that can generate and grade tests that humans can pass but current computer programs cannot. For example, humans can read distorted text [...], but current computer programs can't.

The term CAPTCHA (for Completely Automated Turing Test To Tell Computers and Humans Apart) was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University. At the time, they developed the first CAPTCHA to be used by Yahoo.

In the past, I have shown my disagreement with the whole CAPTCHA movement on this blog, for I still believe CAPTCHAs are horribly annoying. But since they are everywhere now, why not use it for a good cause? reCAPTCHA just did that!

reCAPTCHA will show you an image they received out of an OCR process. This word, unidentified by the OCR, is fed to the user and this way, the user is helping to digitize a book. This is in short what they do. Please do read their learn more page and see why this is a really awesome concept!

Posted by: B10m | permanent link | comments (1)

Big Brother Awards 2007

2007/09/23 filed under /news
Big Brother Award

Every year, the Big Brother Awards are given to persons, companies and governmental organizations that blatantly violate, ignore or disregard privacy. Of course, the name of the award is taken directly out of Orwell's 1984 (as is the image of the award itself, I assume ;-)

I, as one of the last Mohican's who value privacy over terrorism FUD safety, was pleased to see the results of the category "Persons". This year, the award went to "the Dutch citizen". The jury felt the Dutch citizens were the biggest threat to their own privacy out of disinterest and the "I've got nothing to hide" point of view.

Wholeheartedly I applaud this award for I claim for years that no one cares about privacy anymore. Only a few people see that PGP/GPG encrypted mail is useful, regardless of having something to hide. People dump their entire life on facebook, myspace or any of the other completely useless sites and people just don't seem to care (or even know about) data retention proposals and/or laws. A lot of people don't care about mandatory identification laws and the list goes on and on.

I accept the award on behalf of my uninterested countrymen. Hopefully it does make the news (besides the geeky RSS feeds ;-)

Posted by: B10m | permanent link | comments (2)

Tour de Telegraaf

2007/09/15 filed under /personal
Telegraaf Building

Recently I discovered a XSS hole and got invited for a tour of the building ("de Telegraaf" is one of the major newspapers in the Netherlands; founded in 1893). Unfortunately the presses weren't pressing any papers when I was there, so I have to come back for that tour some day.

I did however get to see the data center of de Telegraaf. Not sure of what to expect, I went over there and was warmly welcomed. I got my official "thank you" for pointing out the hole (which got patched rather fast!). After a brief chat, I was allowed in the many halls packed with servers, backup tape robots and all the goodie blinking leds. I was surprised by the volume of servers and network connections (mostly fiber, of course).

All in all I had a great time walking through the data center, chatting with the technicians and have to conclude that the IT department of the Telegraaf took my discovery very well. The tour and friendliness have made me rethink my opinion of the newspaper. I still don't think it's a good newspaper, yet at least it's a nice bunch of folks! ;-)

Posted by: B10m | permanent link | comments (4)

Scraping Yahoo! Search with Web::Scraper

2007/09/02 filed under /perl

Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. Tatsuhiko Miyagawa comes to the rescue! His Web::Scraper makes scraping the web easy and fast.

Since the documentation is scarce (there are the POD and the slides of a presentation I missed), I'll post this blog entry in which I'll show how to effectively scrape Yahoo! Search.

First we'll define what we want to see. We'll going to run a query for 'Perl'. From that page, we want to fetch the following things:

  • title (the linked text)
  • url (the actual link)
  • description (the text beneath the link)

So let's start our first little script:

use Data::Dumper;
use URI;
use Web::Scraper;

my $yahoo = scraper {
   process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
   process "div.yschabstr", 'description' => "TEXT";

   result 'description', 'title', 'url';
};

print Dumper $yahoo->scrape(URI->new("http://search.yahoo.com/search?p=Perl"));

Now what happens here? The important stuff can be found in the process statements. Basically, you may translate those lines to "Fetch an A-element with the CSS class named 'yschttl' and put the text in 'title', and the href value in url. Then fetch the text of the div with the class named 'yschabstr' and put that in description.

The result looks something like this:

$VAR1 = {
          'url' => 'http://www.perl.com/',
          'title' => 'Perl.com',
          'description' => 'Central resource for Perl developers. It contains
 the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited 
by Clay Irving.'
        };

Fun and a good start, but hey, do we really get only one result for a query on 'Perl'? No way! We need a loop!

The slides tell you to append '[]' to the key, to enable looping. The process lines then look like this:

   process "a.yschttl", 'title[]' => 'TEXT', 'url[]' => '@href';
   process "div.yschabstr", 'description[]' => "TEXT";

And when we run it now, the result looks like this:

$VAR1 = {
          'url' => [
                     'http://www.perl.com/',
                     'http://www.perl.org/',
                     'http://www.perl.com/download.csp',
                   ...
                   ],
          'title' => [
                       'Perl.com',
                       'Perl Mongers',
                       'Getting Perl',
                     ...
                     ],
          'description' => [
                             'Central resource for Perl developers. It contains 
the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by
 Clay Irving.',
                             'Nonprofit organization, established to support the
 Perl community.',
                             'Instructions on downloading a Perl interpreter for
 your computer platform. ... On CPAN, you will find Perl source in the /src 
directory. ...',
                           ...
                           ]
        };

That looks a lot better! We now get all the search results and could loop through the different arrays to get the right title with the right url. But still we shouldn't be satisfied, for we don't want three arrays, we want one array of hashes! For that we need a little trickery; we need another process line! All the stuff we grab already is located in a big ordered list (the OL-element), so let's find that one first, and for each list element (LI) find our title,url and description. For this we don't use the CSS selectors, but we'll go for the XPath selectors (heck, we can do both, so why not?).

To grab an XPath I really suggest firebug , a FireFox addon. With the easy point and click interface, you can grab the path within seconds.

use Data::Dumper;
use URI;
use Web::Scraper;

my $yahoo = scraper {
   process "/html/body/div[5]/div/div/div[2]/ol/li", 'results[]' => scraper {
      process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
      process "div.yschabstr", 'description' => "TEXT";

      result 'description', 'title', 'url';
   };
   result 'results';
};

print Dumper $yahoo->scrape( URI->new("http://search.yahoo.com/search?p=Perl") );

You see that we switched our title, url and description fields back to the old notation (without []), for we don't want to loop those fields. We've moved the looping a step higher, being to the li-elements. Then we open another scraper which will dump the hashes into the results array (note the '[]' in 'results[]').

The result is exactly what we wanted:

$VAR1 = [
          {
            'url' => 'http://www.perl.com/',
            'title' => 'Perl.com',
            'description' => 'Central resource for Perl developers. It 
contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, 
edited by Clay Irving.'
          },
          {
            'url' => 'http://www.perl.org/',
            'title' => 'Perl Mongers',
            'description' => 'Nonprofit organization, established to support 
the Perl community.'
          },
          {
            'url' => 'http://www.perl.com/download.csp',
            'title' => 'Getting Perl',
            'description' => 'Instructions on downloading a Perl interpreter 
for your computer platform. ... On CPAN, you will find Perl source in the /src 
directory. ...'
          },
...
        ];

Again Tatsuhiko impresses me with a Perl module. Well done! Very well done!


Update: Tatsuhiko had some wise words on this article:

A couple of things:

You might just skip result() stuff if you're returning the entire hash, which is the default. (The API is stolen from Ruby's one that needs result() for some reason, but my perl port doesn't require) Now with less code :)

The use of nested scraper in your example seems pretty good, but using hash reference could be also useful, like:

my $yahoo = scraper {
   process "a.yschttl", 'results[]', {
      title => 'TEXT', url => '@href',
   };
};

This way you'll get title and url from TEXT and @href from a.yschttl, which would be handier if you don't need the description. TIMTOWTDI :)

Posted by: B10m | permanent link | comments (2)
return-member