Monday, September 10, 2012

How to Run the Charniak-Johnson Re-ranking Parser (BLLIP) as a Server

I have been using the BLLIP parser mainly for parsing biomedical text. The default parser and re-ranker models included in the package were trained on WSJ and therefore are not likely to work very well on biomedical text. However, there are publicly available models which were trained on biomedical text, namely the Genia corpus, which work pretty well on biomedical text or at least better than the Stanford parser with its default models. Here I am writing the steps down so that anybody can use it as a reference.

Download BLLIP parser and decompress it.
 tar xvzf master   

If you don't have flex installed then install it.
 sudo apt-get install flex  

Build the parser and re-ranker.
 cd BLLIP*   

Test the parser.
  <s> This is a test . </s>   
  [Ctrl-D to terminate]   

You should see the following output:
 (S1 (S (NP (DT This)) (VP (AUX is) (NP (DT a) (NN test))) (. .)))  

Download the biomedical model and decompress it:
  tar xvzf biopars*   

To test the biomedical model use the following script:
 #! /bin/sh   
  first-stage/PARSE/parseIt -l399 -N50 ${BIOPARSINGMODEL}/parser/ $* | second-stage/programs/features/best-parses -l ${BIOPARSINGMODEL}/reranker/features.gz ${BIOPARSINGMODEL}/reranker/weights.gz   

To run the parser as a server I have modified a perl script that accompanies the Illinois Semantic Role Labeler package. It was originally written to run the Charniak parser as a server. Here is the perl script:
 $MAXCHAR = 799;  
 $MAXWORD = 400;  
 $BIOPARSINGMODEL = "./biomodel";  
 $command = "first-stage/PARSE/parseIt -K -l399 -N50 $BIOPARSINGMODEL/parser/ | second-stage/programs/features/best-parses -l $BIOPARSINGMODEL/reranker/features.gz $BIOPARSINGMODEL/reranker/weights.gz";  
 #$charniakDir = "$ENV{CHARNIAK}";  
 #$command = "$charniakDir/PARSE/parseIt $charniakDir/DATA/EN/ -K -l$MAXWORD";  
 #$endProtocol = "\n\n\n";  
 $endProtocol = "\n";  
 $TIMEOUT = 60;         # undef if no timeout  
 $PORT = 4449;               # pick something not in use  
 #read port  
 $PORT = $ARGV[0] if (scalar(@ARGV) > 0);  
 use Expect;  
 #create main program that will be communicating throught pipe.  
 $main = NewExpect($command);  
 sub NewExpect {  
  my $command = shift;  
  my $main;  
  print "[Initializing...]\n";  
  $main = new Expect();  
  $main->raw_pty(1);   # no local echo   
  $main->log_stdout(0); # no echo  
  $main->spawn($command) or die "Cannot start: $command\n";  
  $main->send("<s> This is a test . </s>\n"); #send input to main program  
  @res = $main->expect(undef,$endProtocol); # read output from main program  
  print $res[3];  
  print "[Done initializing.]\n";  
  return $main;  
 #server initialization matter  
 use IO::Socket;  
 use Net::hostent;          # for OO version of gethostbyaddr  
 $server = IO::Socket::INET->new( Proto   => 'tcp',  
                  LocalPort => $PORT,  
                  Listen  => SOMAXCONN,  
                  Reuse   => 1);  
 die "Can't setup server\n" unless $server;  
 #end server initialization  
 #set autoflush  
 $old_handle = select(STDOUT);  
 $| = 1;  
 $old_handle = select(STDERR);  
 $| = 1;  
 print "[Server $0 accepting clients]\n";  
 while ($client = $server->accept()) {  
  $main->expect(0); # flush old stuff if any  
  $main->clear_accum(); # clear buffer  
  $clientinfo = gethostbyaddr($client->peeraddr);  
  if (defined($clientinfo)) {  
   $clientname = ($clientinfo->name || $client->peerhost);  
  } else {  
   $clientname = $client->peerhost;  
  printf "[Connect from %s]\n", $clientname;  
  printf "[Connection closed from %s]\n", $clientname;  
 sub RunClient {  
  my $client = shift;  
  my $msg;  
  my $output;  
  my @res;  
  my $timeout;  
  my $sent;  
  while ($sent = <$client>) {  
   chomp $sent;  
   $sent =~ s/^\s+//;  
   $sent =~ s/\s+$//;  
   if ($sent =~ /^\s*$/) { # sending blank line will cause the parser to quit  
    $output = "\n\n";  
   } elsif (length > $MAXCHAR) {  
    $output = "\n\n";  
   } else {  
    $msg = "<s> $sent </s>\n";  
    print "Parse: $msg";  
    $main->send("$msg"); #send input to main program  
    @res = $main->expect($TIMEOUT,$endProtocol); # read output from main program  
    # @res = ($mp, $er, $ms, $bf, $af);  
    # $mp is ???  
    # $er is undef or 1:TIMEOUT  
    # $ms is the matched message  
    # $bf is the message before $ms  
    # $af is the message after $ms  
    $timeout = $res[1];  
    $out = $res[3];  
    if ($timeout) { # parser possibly gets stuck, restart it.  
     print "Time out!\n";  
     $output = "\n\n"; # output blank  
     print "Restart parser\n";  
     $main = NewExpect($command);  
    } else {  
     if ($out =~ /^Parse failed/) {  
      print "Parse failed\n";  
      $output = "\n\n";  
      @res = $main->expect($TIMEOUT,$endProtocol); # read off the original sentence  
      $timeout = $res[1];  
      if ($timeout) { # parser possibly gets stuck, restart it.  
       print "Time out when reading off the original sentence!\n";  
       print "Restart parser\n";  
       $main = NewExpect($command);  
     } elsif ($out =~ /^error:|^parseIt.*Assertion.*failed/) { # parser dies  
      print "Parser died!\n";  
      $output = "\n\n"; # output blank  
      print "Restart parser\n";  
      $main = NewExpect($command);  
     } else {  
      print "Parse ok\n";  
      $output = "$out\n";  
      if ($out =~ /^\s*$/) { $numBlank = 1; }  
      else { $numBlank = 0; }  
   $output = &fixoutput($sent, $output);  
   print $client $output; # send output back to client  
   $main->clear_accum(); # clear buffer  
 sub fixoutput {  
  my ($input, $output) = @_;  
  my @input;  
  my @output;  
  my ($i, $length, $outlength);  
  @input = split /\s+/, replacesymbol($input);  
  $length = scalar(@input);  
  $outlength = 0;  
  while ($output =~ /[^\)]\)/g) { $outlength++; }  
  if ($outlength == 0) {  
   $output = "(S1 H:0 (X H:0";  
   for ($i = 0; $i < $length; $i++) {  
    $output .= " (. H:0 $input[$i])";  
   $output .= "))\n\n\n";  
  } elsif ($length != $outlength) {  
   $output =~ s/\)\s*$//;  
   for ($i = $outlength; $i < $length; $i++) {  
    $output .= " (. H:0 $input[$i])";  
   $output .= ")\n\n\n";  
  return $output;  
 sub replacesymbol {  
  my $input = shift;  
  $input =~ s/\(/-LRB-/g;  
  $input =~ s/\)/-RRB-/g;  
  $input =~ s/\[/-LSB-/g;  
  $input =~ s/\]/-RSB-/g;  
  $input =~ s/\{/-LCB-/g;  
  $input =~ s/\}/-RCB-/g;  
  return $input;  

Run the server on the background:
  nohup perl ./ &  

The parser should now be listening to port 4449 for incoming request. Each request should consist of a single tokenized line ending with an LF. If you want the parser to tokenize the text then remove the parameter '-K' in line 5. A response also consists of a single line which also ends with an LF.

Test the server:
 echo This is a test . | nc localhost 4449  
 (S1 (S (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test)))) (. .)))  

That's it! The server is now ready to serve!

No comments:

Post a Comment