I have been using the BLLIP parser mainly for parsing biomedical text. The default parser and re-ranker models included in the package were trained on WSJ and therefore are not likely to work very well on biomedical text. However, there are publicly available models which were trained on biomedical text, namely the Genia corpus, which work pretty well on biomedical text or at least better than the Stanford parser with its default models. Here I am writing the steps down so that anybody can use it as a reference.
Step1:
Download BLLIP parser and decompress it.
wget https://github.com/BLLIP/bllip-parser/tarball/master
tar xvzf master
Step2:
If you don't have flex installed then install it.
sudo apt-get install flex
Step3:
Build the parser and re-ranker.
cd BLLIP*
make
Step4:
Test the parser.
You should see the following output:
./parse.sh
<s> This is a test . </s>
[Ctrl-D to terminate]
You should see the following output:
(S1 (S (NP (DT This)) (VP (AUX is) (NP (DT a) (NN test))) (. .)))
Step5:
Download the biomedical model and decompress it:
Step6:
wget http://bllip.cs.brown.edu/download/bioparsingmodel-rel1.tar.gz
tar xvzf biopars*
Step6:
To test the biomedical model use the following script:
#! /bin/sh
BIOPARSINGMODEL=./biomodel
first-stage/PARSE/parseIt -l399 -N50 ${BIOPARSINGMODEL}/parser/ $* | second-stage/programs/features/best-parses -l ${BIOPARSINGMODEL}/reranker/features.gz ${BIOPARSINGMODEL}/reranker/weights.gz
Step7:
To run the parser as a server I have modified a perl script that accompanies the Illinois Semantic Role Labeler package. It was originally written to run the Charniak parser as a server. Here is the perl script:
#!/usr/bin/perl
$MAXCHAR = 799;
$MAXWORD = 400;
$BIOPARSINGMODEL = "./biomodel";
$command = "first-stage/PARSE/parseIt -K -l399 -N50 $BIOPARSINGMODEL/parser/ | second-stage/programs/features/best-parses -l $BIOPARSINGMODEL/reranker/features.gz $BIOPARSINGMODEL/reranker/weights.gz";
#$charniakDir = "$ENV{CHARNIAK}";
#$command = "$charniakDir/PARSE/parseIt $charniakDir/DATA/EN/ -K -l$MAXWORD";
#$endProtocol = "\n\n\n";
$endProtocol = "\n";
$TIMEOUT = 60; # undef if no timeout
$PORT = 4449; # pick something not in use
#read port
$PORT = $ARGV[0] if (scalar(@ARGV) > 0);
use Expect;
#create main program that will be communicating throught pipe.
$main = NewExpect($command);
sub NewExpect {
my $command = shift;
my $main;
print "[Initializing...]\n";
$main = new Expect();
$main->raw_pty(1); # no local echo
$main->log_stdout(0); # no echo
$main->spawn($command) or die "Cannot start: $command\n";
$main->send("<s> This is a test . </s>\n"); #send input to main program
@res = $main->expect(undef,$endProtocol); # read output from main program
print $res[3];
print "[Done initializing.]\n";
return $main;
}
#server initialization matter
use IO::Socket;
use Net::hostent; # for OO version of gethostbyaddr
$server = IO::Socket::INET->new( Proto => 'tcp',
LocalPort => $PORT,
Listen => SOMAXCONN,
Reuse => 1);
die "Can't setup server\n" unless $server;
#end server initialization
#set autoflush
$old_handle = select(STDOUT);
$| = 1;
select($old_handle);
$old_handle = select(STDERR);
$| = 1;
select($old_handle);
print "[Server $0 accepting clients]\n";
while ($client = $server->accept()) {
$main->expect(0); # flush old stuff if any
$main->clear_accum(); # clear buffer
$client->autoflush(1);
$clientinfo = gethostbyaddr($client->peeraddr);
if (defined($clientinfo)) {
$clientname = ($clientinfo->name || $client->peerhost);
} else {
$clientname = $client->peerhost;
}
printf "[Connect from %s]\n", $clientname;
&RunClient($client);
shutdown($client,3);
close($client);
printf "[Connection closed from %s]\n", $clientname;
}
$main->hard_close();
sub RunClient {
my $client = shift;
my $msg;
my $output;
my @res;
my $timeout;
my $sent;
while ($sent = <$client>) {
chomp $sent;
$sent =~ s/^\s+//;
$sent =~ s/\s+$//;
if ($sent =~ /^\s*$/) { # sending blank line will cause the parser to quit
$output = "\n\n";
} elsif (length > $MAXCHAR) {
$output = "\n\n";
} else {
$msg = "<s> $sent </s>\n";
print "Parse: $msg";
$main->send("$msg"); #send input to main program
@res = $main->expect($TIMEOUT,$endProtocol); # read output from main program
# @res = ($mp, $er, $ms, $bf, $af);
# $mp is ???
# $er is undef or 1:TIMEOUT
# $ms is the matched message
# $bf is the message before $ms
# $af is the message after $ms
$timeout = $res[1];
$out = $res[3];
if ($timeout) { # parser possibly gets stuck, restart it.
print "Time out!\n";
$output = "\n\n"; # output blank
print "Restart parser\n";
$main->hard_close();
$main = NewExpect($command);
} else {
if ($out =~ /^Parse failed/) {
print "Parse failed\n";
$output = "\n\n";
@res = $main->expect($TIMEOUT,$endProtocol); # read off the original sentence
$timeout = $res[1];
if ($timeout) { # parser possibly gets stuck, restart it.
print "Time out when reading off the original sentence!\n";
print "Restart parser\n";
$main->hard_close();
$main = NewExpect($command);
}
} elsif ($out =~ /^error:|^parseIt.*Assertion.*failed/) { # parser dies
print "Parser died!\n";
$output = "\n\n"; # output blank
print "Restart parser\n";
$main->hard_close();
$main = NewExpect($command);
} else {
print "Parse ok\n";
$output = "$out\n";
if ($out =~ /^\s*$/) { $numBlank = 1; }
else { $numBlank = 0; }
}
}
}
$output = &fixoutput($sent, $output);
print $client $output; # send output back to client
$main->clear_accum(); # clear buffer
}
}
sub fixoutput {
my ($input, $output) = @_;
my @input;
my @output;
my ($i, $length, $outlength);
@input = split /\s+/, replacesymbol($input);
$length = scalar(@input);
$outlength = 0;
while ($output =~ /[^\)]\)/g) { $outlength++; }
if ($outlength == 0) {
$output = "(S1 H:0 (X H:0";
for ($i = 0; $i < $length; $i++) {
$output .= " (. H:0 $input[$i])";
}
$output .= "))\n\n\n";
} elsif ($length != $outlength) {
$output =~ s/\)\s*$//;
for ($i = $outlength; $i < $length; $i++) {
$output .= " (. H:0 $input[$i])";
}
$output .= ")\n\n\n";
}
return $output;
}
sub replacesymbol {
my $input = shift;
$input =~ s/\(/-LRB-/g;
$input =~ s/\)/-RRB-/g;
$input =~ s/\[/-LSB-/g;
$input =~ s/\]/-RSB-/g;
$input =~ s/\{/-LCB-/g;
$input =~ s/\}/-RCB-/g;
return $input;
}
Run the server on the background:
nohup perl ./bioserver.pl &
The parser should now be listening to port 4449 for incoming request. Each request should consist of a single tokenized line ending with an LF. If you want the parser to tokenize the text then remove the parameter '-K' in line 5. A response also consists of a single line which also ends with an LF.
Test the server:
echo This is a test . | nc localhost 4449
(S1 (S (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test)))) (. .)))
That's it! The server is now ready to serve!
No comments:
Post a Comment