UNB/ CS/ David Bremner/ comments/ blog/ posts/ counting-symbols
From: "Peter Pöschl" <address deleted>
Subject: Re: Counting symbols from a debian symbols file
Date: Mon, 19 Oct 2009 10:37:58 +0200
Hi,

> # check the rest of the command line arguments for matches against symbols. 
> Omega(n^2), sigh.
> foreach my $file (@ARGV){
>  foreach my $sym (keys %count){
>    my $string=read_file($file);
>    if ($string =~ m/\b$sym\b/){
>      $count{$sym}++;
>    }
>  }
>}

>    my $string=read_file($file);
Why do you do this again and again for every symbol?
This should be moved in front of 'foreach my $sym'.

For searching multiple symbols at once I would use this:

  # construct regex which can match every key in %count
  my $rx1_sym = join '|', reverse sort keys %count;
  $rx1_sym = qr{\b($rx1_sym)\b}msx;

  foreach my $file (@ARGV){
    my $string=read_file($file);
    while ( $string =~ m{$rx1_sym}g ) {
      ++ $count{$1};
    }
  }

Best regards,

  Peter Poeschl


Date: Mon, 19 Oct 2009 08:19:57 -0300
From: "David Bremner" <address deleted>
Subject: Re: Counting symbols from a debian symbols file
Peter Poeschl wrote:

>>    my $string=read_file($file);
>Why do you do this again and again for every symbol?
>This should be moved in front of 'foreach my $sym'.

Thanks, that was really silly, and the script is now much faster.

>  # construct regex which can match every key in %count
>  my $rx1_sym = join '|', reverse sort keys %count;
>  $rx1_sym = qr{\b($rx1_sym)\b}msx;
>  foreach my $file (@ARGV){
>    my $string=read_file($file);
>    while ( $string =~ m{$rx1_sym}g ) {
>      ++ $count{$1};
>    }
>  }

This seems to count the total number of times a symbol occurs, rather
than the total number of files it occurs in. The latter is what I am
interested in.  In fact, it probably suffices to fine list of those
symbols that occur only once.  I think your code could be modified to
use a hash storing where the symbol was first found, and maybe an
auxilary hash marking those found more than once.  Whether it is
worthwhile depends on how enormous the symbol files in question are I
guess. In my unscientific experiments, after modifying the script to
only read each file once (which sped things up by a factor of 3 or
so), the other modification yields something like a 3% or 4%
improvement.

Thanks again for your suggestions.

David