Saturday, September 29, 2012

Mallet and LibSVM

Mallet and LibSVM are the two machine learning libraries that I have been using the most. I felt the need of a way to directly use LibSVM from Mallet. As I mentioned in another post, I made a lightly refactored version of the Java implementation of LibSVM mainly for easy integration of custom kernel functions. Doing that gave me a better understanding of how LibSVM works and consequently helped me to integrate it with Mallet.
For classification tasks a Mallet instance pipe creates a FeatureVector out of an instance. So, it is quite straight forward to transform it into a format suitable for LibSVM. However, custom kernel functions that work on data structures other than vectors need to be handled differently. In the current version I have not kept any option for providing any arbitrary data structure from the Mallet end, however the code can be easily tweaked for that.
Mallet and LibSVM being separate libraries handle class labels differently. All I had to do in SVMClassifier is to align the class labels and scores from these two libraries. I have kept an option to tell LibSVM whether to predict probabilities or not which is required if you not only need the best class but also the scores given to the other classes.
If you are interested get it from github. Let me know if you have any suggestion.

Wednesday, September 12, 2012

Writing Custom Kernel Functions in Java for LibSVM

For my research on protein-protein interaction extraction I had to experiment with several different custom kernel functions. For that I looked into two most prevalent support vector machine libraries - SVMLight and LibSVM. In SVMLight one can plug in a custom kernel function through the kernel.h header file. LibSVM on the other hand does not allow custom kernel functions directly; however, one can pre-compute the kernel matrix (or Gram matrix) beforehand and feed it as input to the SVM. To me it seemed SVMLight would be the way to go. But then I found that LibSVM comes with an official Java implementation. I looked for a library that modifies that Java port to allow direct integration of kernel functions. I found jlibsvm which might have worked if I had found a little documentation in it. Then I decided to write a lightly refactored LibSVM on my own. Without much effort I have done that and am using it ever since. If you prefer to write your custom kernel functions in Java you can give it a try: 

Writing a kernel function can not be easier. All you have to do is to implement the CustomKernel interface. Here is how you can write a linear kernel:
  * <code>LinearKernel</code> implements a linear kernel function.  
  * @author Syeed Ibn Faiz  
 public class LinearKernel implements CustomKernel {  
   public double evaluate(svm_node x, svm_node y) {              
     if (!( instanceof SparseVector) || !( instanceof SparseVector)) {  
       throw new RuntimeException("Could not find sparse vectors in svm_nodes");  
     SparseVector v1 = (SparseVector);  
     SparseVector v2 = (SparseVector);  

The kernel function you want to use should then be registered with the KernelManager. The following code snippet may give you a better idea of the whole work flow:
 public static void testLinearKernel(String[] args) throws IOException, ClassNotFoundException {  
     String trainFileName = args[0];  
     String testFileName = args[1];  
     String outputFileName = args[2];  
     //Read training file  
     Instance[] trainingInstances = DataFileReader.readDataFile(trainFileName);      
     //Register kernel function  
     KernelManager.setCustomKernel(new LinearKernel());      
     //Setup parameters  
     svm_parameter param = new svm_parameter();          
     //Train the model  
     System.out.println("Training started...");  
     svm_model model = SVMTrainer.train(trainingInstances, param);  
     System.out.println("Training completed.");              
     //Read test file  
     Instance[] testingInstances = DataFileReader.readDataFile(testFileName);  
     //Predict results  
     double[] predictions = SVMPredictor.predict(testingInstances, model, true);    

Monday, September 10, 2012

Using phpSyntaxTree to Visualize Parse Tree

phpSyntaxTree is a very nice php library to generate graphical syntax trees. I have been using it to visualize both syntax trees and dependency trees. Analysing the graphical version is a lot convenient than looking at the text and imagining its structure. I made a simple interface to the library which I am going to dump here.

I modified the file stgraph.png.php so that it now accepts GET requests. Here is the patch:
 < if ( !isset( $_SESSION['data'] ) )  
 > if ( !isset( $_GET['data'] ) )  
 < $data = $_SESSION['data'];  
 > $data = $_GET['data'];  
 < $color   = isset( $_SESSION['color'] )   ? $_SESSION['color']   : 0;  
 < $triangles = isset( $_SESSION['triangles'] ) ? $_SESSION['triangles'] : FALSE;  
 < $antialias = isset( $_SESSION['antialias'] ) ? $_SESSION['antialias'] : 0;  
 < $autosub  = isset( $_SESSION['autosub'] )  ? $_SESSION['autosub']  : 0;  
 < $font   = isset( $_SESSION['font'] )   ? $_SESSION['font']   : 'Vera.ttf';  
 < $fontsize = isset( $_SESSION['fontsize'] ) ? $_SESSION['fontsize'] : 8;  
 > $color   = isset( $_GET['color'] )   ? $_GET['color']   : 1;  
 > $triangles = isset( $_GET['triangles'] ) ? $_GET['triangles'] : FALSE;  
 > $antialias = isset( $_GET['antialias'] ) ? $_GET['antialias'] : 1;  
 > $autosub  = isset( $_GET['autosub'] )  ? $_GET['autosub']  : 0;  
 > $font   = isset( $_GET['font'] )   ? $_GET['font']   : 'Vera.ttf';  
 > $fontsize = isset( $_GET['fontsize'] ) ? $_GET['fontsize'] : 8;  

The patched version was named draw.php. This is my interface to the library. To test it I wrote the following  script:
 $phrase = $_GET['data'];  
 $phrase = str_replace("(", "[", $phrase);  
 $phrase = str_replace(")", "]", $phrase);  
 $color   = isset( $_GET['color'] )   ? $_GET['color']   : 1;  
 $antialias = isset( $_GET['antialias'] ) ? $_GET['antialias'] : 1;  
 $font   = isset( $_GET['font'] )   ? $_GET['font']   : 'Vera.ttf';  
 $fontsize = isset( $_GET['fontsize'] ) ? $_GET['fontsize'] : 8;  
 $query = "data=" . $phrase;  
 $query .= "&" . "color=" . $color;  
 $query .= "&" . "antilias=" . $antilias;  
 $query .= "&" . "font=" . $font;  
 $query .= "&" . "fontsize=" . $fontsize;  
 $img  = sprintf( "<img src=\"draw.php?%s\" alt=\"\" title=\"%s\"/>", $query, $phrase );  
 echo $img;  

Running the script like:
 test.php?data=(NP (DT a) (NP ball))   
generates the following image:
That's it!

How to Run the Charniak-Johnson Re-ranking Parser (BLLIP) as a Server

I have been using the BLLIP parser mainly for parsing biomedical text. The default parser and re-ranker models included in the package were trained on WSJ and therefore are not likely to work very well on biomedical text. However, there are publicly available models which were trained on biomedical text, namely the Genia corpus, which work pretty well on biomedical text or at least better than the Stanford parser with its default models. Here I am writing the steps down so that anybody can use it as a reference.

Download BLLIP parser and decompress it.
 tar xvzf master   

If you don't have flex installed then install it.
 sudo apt-get install flex  

Build the parser and re-ranker.
 cd BLLIP*   

Test the parser.
  <s> This is a test . </s>   
  [Ctrl-D to terminate]   

You should see the following output:
 (S1 (S (NP (DT This)) (VP (AUX is) (NP (DT a) (NN test))) (. .)))  

Download the biomedical model and decompress it:
  tar xvzf biopars*   

To test the biomedical model use the following script:
 #! /bin/sh   
  first-stage/PARSE/parseIt -l399 -N50 ${BIOPARSINGMODEL}/parser/ $* | second-stage/programs/features/best-parses -l ${BIOPARSINGMODEL}/reranker/features.gz ${BIOPARSINGMODEL}/reranker/weights.gz   

To run the parser as a server I have modified a perl script that accompanies the Illinois Semantic Role Labeler package. It was originally written to run the Charniak parser as a server. Here is the perl script:
 $MAXCHAR = 799;  
 $MAXWORD = 400;  
 $BIOPARSINGMODEL = "./biomodel";  
 $command = "first-stage/PARSE/parseIt -K -l399 -N50 $BIOPARSINGMODEL/parser/ | second-stage/programs/features/best-parses -l $BIOPARSINGMODEL/reranker/features.gz $BIOPARSINGMODEL/reranker/weights.gz";  
 #$charniakDir = "$ENV{CHARNIAK}";  
 #$command = "$charniakDir/PARSE/parseIt $charniakDir/DATA/EN/ -K -l$MAXWORD";  
 #$endProtocol = "\n\n\n";  
 $endProtocol = "\n";  
 $TIMEOUT = 60;         # undef if no timeout  
 $PORT = 4449;               # pick something not in use  
 #read port  
 $PORT = $ARGV[0] if (scalar(@ARGV) > 0);  
 use Expect;  
 #create main program that will be communicating throught pipe.  
 $main = NewExpect($command);  
 sub NewExpect {  
  my $command = shift;  
  my $main;  
  print "[Initializing...]\n";  
  $main = new Expect();  
  $main->raw_pty(1);   # no local echo   
  $main->log_stdout(0); # no echo  
  $main->spawn($command) or die "Cannot start: $command\n";  
  $main->send("<s> This is a test . </s>\n"); #send input to main program  
  @res = $main->expect(undef,$endProtocol); # read output from main program  
  print $res[3];  
  print "[Done initializing.]\n";  
  return $main;  
 #server initialization matter  
 use IO::Socket;  
 use Net::hostent;          # for OO version of gethostbyaddr  
 $server = IO::Socket::INET->new( Proto   => 'tcp',  
                  LocalPort => $PORT,  
                  Listen  => SOMAXCONN,  
                  Reuse   => 1);  
 die "Can't setup server\n" unless $server;  
 #end server initialization  
 #set autoflush  
 $old_handle = select(STDOUT);  
 $| = 1;  
 $old_handle = select(STDERR);  
 $| = 1;  
 print "[Server $0 accepting clients]\n";  
 while ($client = $server->accept()) {  
  $main->expect(0); # flush old stuff if any  
  $main->clear_accum(); # clear buffer  
  $clientinfo = gethostbyaddr($client->peeraddr);  
  if (defined($clientinfo)) {  
   $clientname = ($clientinfo->name || $client->peerhost);  
  } else {  
   $clientname = $client->peerhost;  
  printf "[Connect from %s]\n", $clientname;  
  printf "[Connection closed from %s]\n", $clientname;  
 sub RunClient {  
  my $client = shift;  
  my $msg;  
  my $output;  
  my @res;  
  my $timeout;  
  my $sent;  
  while ($sent = <$client>) {  
   chomp $sent;  
   $sent =~ s/^\s+//;  
   $sent =~ s/\s+$//;  
   if ($sent =~ /^\s*$/) { # sending blank line will cause the parser to quit  
    $output = "\n\n";  
   } elsif (length > $MAXCHAR) {  
    $output = "\n\n";  
   } else {  
    $msg = "<s> $sent </s>\n";  
    print "Parse: $msg";  
    $main->send("$msg"); #send input to main program  
    @res = $main->expect($TIMEOUT,$endProtocol); # read output from main program  
    # @res = ($mp, $er, $ms, $bf, $af);  
    # $mp is ???  
    # $er is undef or 1:TIMEOUT  
    # $ms is the matched message  
    # $bf is the message before $ms  
    # $af is the message after $ms  
    $timeout = $res[1];  
    $out = $res[3];  
    if ($timeout) { # parser possibly gets stuck, restart it.  
     print "Time out!\n";  
     $output = "\n\n"; # output blank  
     print "Restart parser\n";  
     $main = NewExpect($command);  
    } else {  
     if ($out =~ /^Parse failed/) {  
      print "Parse failed\n";  
      $output = "\n\n";  
      @res = $main->expect($TIMEOUT,$endProtocol); # read off the original sentence  
      $timeout = $res[1];  
      if ($timeout) { # parser possibly gets stuck, restart it.  
       print "Time out when reading off the original sentence!\n";  
       print "Restart parser\n";  
       $main = NewExpect($command);  
     } elsif ($out =~ /^error:|^parseIt.*Assertion.*failed/) { # parser dies  
      print "Parser died!\n";  
      $output = "\n\n"; # output blank  
      print "Restart parser\n";  
      $main = NewExpect($command);  
     } else {  
      print "Parse ok\n";  
      $output = "$out\n";  
      if ($out =~ /^\s*$/) { $numBlank = 1; }  
      else { $numBlank = 0; }  
   $output = &fixoutput($sent, $output);  
   print $client $output; # send output back to client  
   $main->clear_accum(); # clear buffer  
 sub fixoutput {  
  my ($input, $output) = @_;  
  my @input;  
  my @output;  
  my ($i, $length, $outlength);  
  @input = split /\s+/, replacesymbol($input);  
  $length = scalar(@input);  
  $outlength = 0;  
  while ($output =~ /[^\)]\)/g) { $outlength++; }  
  if ($outlength == 0) {  
   $output = "(S1 H:0 (X H:0";  
   for ($i = 0; $i < $length; $i++) {  
    $output .= " (. H:0 $input[$i])";  
   $output .= "))\n\n\n";  
  } elsif ($length != $outlength) {  
   $output =~ s/\)\s*$//;  
   for ($i = $outlength; $i < $length; $i++) {  
    $output .= " (. H:0 $input[$i])";  
   $output .= ")\n\n\n";  
  return $output;  
 sub replacesymbol {  
  my $input = shift;  
  $input =~ s/\(/-LRB-/g;  
  $input =~ s/\)/-RRB-/g;  
  $input =~ s/\[/-LSB-/g;  
  $input =~ s/\]/-RSB-/g;  
  $input =~ s/\{/-LCB-/g;  
  $input =~ s/\}/-RCB-/g;  
  return $input;  

Run the server on the background:
  nohup perl ./ &  

The parser should now be listening to port 4449 for incoming request. Each request should consist of a single tokenized line ending with an LF. If you want the parser to tokenize the text then remove the parameter '-K' in line 5. A response also consists of a single line which also ends with an LF.

Test the server:
 echo This is a test . | nc localhost 4449  
 (S1 (S (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test)))) (. .)))  

That's it! The server is now ready to serve!