Sunday, January 6, 2013

Convert Weka Instances to Mallet InstanceList

If you ever come across this issue the following code snippet may help you:
 /**  
    * Converts Weka Instances to Mallet InstanceList  
    * @param instances Weka instances  
    * @return Mallet instanceList  
    */  
   public static InstanceList wekaInstances2MalletInstanceList(Instances instances) {  
     Alphabet dataAlphabet = new Alphabet();  
     LabelAlphabet targetAlphabet = new LabelAlphabet();  
     InstanceList instanceList = new InstanceList(new Noop(dataAlphabet, targetAlphabet));  
     int classIndex = instances.classIndex();  
     int numAttributes = instances.numAttributes();      
     for (int i = 0; i < numAttributes; i++) {  
       if (i == classIndex) {  
         continue;  
       }  
       Attribute attribute = instances.attribute(i);  
       dataAlphabet.lookupIndex(attribute.name());        
     }  
     Attribute classAttribute = instances.attribute(classIndex);  
     int numClasses = classAttribute.numValues();      
     for (int i = 0; i < numClasses; i++) {        
       targetAlphabet.lookupLabel(classAttribute.value(i));  
     }  
     int numInstance = instances.numInstances();  
     for (int i = 0; i < numInstance; i++) {  
       weka.core.Instance instance = instances.instance(i);  
       double[] values = instance.toDoubleArray();  
       int indices[] = new int[numAttributes];  
       int count = 0;  
       for (int j = 0; j < values.length; j++) {  
         if (j != classIndex && values[j] != 0.0) {  
           values[count] = values[j];  
           indices[count] = j;  
           count++;  
         }  
       }  
       indices = Arrays.copyOf(indices, count);  
       values = Arrays.copyOf(values, count);  
       FeatureVector fv = new FeatureVector(dataAlphabet, indices, values);  
       String classValue = instance.stringValue(classIndex);  
       Label classLabel = targetAlphabet.lookupLabel(classValue);  
       Instance malletInstance = new Instance(fv, classLabel, null, null);  
       instanceList.addThruPipe(malletInstance);  
     }  
     return instanceList;  
   }  

Saturday, September 29, 2012

Mallet and LibSVM

Mallet and LibSVM are the two machine learning libraries that I have been using the most. I felt the need of a way to directly use LibSVM from Mallet. As I mentioned in another post, I made a lightly refactored version of the Java implementation of LibSVM mainly for easy integration of custom kernel functions. Doing that gave me a better understanding of how LibSVM works and consequently helped me to integrate it with Mallet.
For classification tasks a Mallet instance pipe creates a FeatureVector out of an instance. So, it is quite straight forward to transform it into a format suitable for LibSVM. However, custom kernel functions that work on data structures other than vectors need to be handled differently. In the current version I have not kept any option for providing any arbitrary data structure from the Mallet end, however the code can be easily tweaked for that.
Mallet and LibSVM being separate libraries handle class labels differently. All I had to do in SVMClassifier is to align the class labels and scores from these two libraries. I have kept an option to tell LibSVM whether to predict probabilities or not which is required if you not only need the best class but also the scores given to the other classes.
If you are interested get it from github. Let me know if you have any suggestion.

Wednesday, September 12, 2012

Writing Custom Kernel Functions in Java for LibSVM

For my research on protein-protein interaction extraction I had to experiment with several different custom kernel functions. For that I looked into two most prevalent support vector machine libraries - SVMLight and LibSVM. In SVMLight one can plug in a custom kernel function through the kernel.h header file. LibSVM on the other hand does not allow custom kernel functions directly; however, one can pre-compute the kernel matrix (or Gram matrix) beforehand and feed it as input to the SVM. To me it seemed SVMLight would be the way to go. But then I found that LibSVM comes with an official Java implementation. I looked for a library that modifies that Java port to allow direct integration of kernel functions. I found jlibsvm which might have worked if I had found a little documentation in it. Then I decided to write a lightly refactored LibSVM on my own. Without much effort I have done that and am using it ever since. If you prefer to write your custom kernel functions in Java you can give it a try:
https://github.com/syeedibnfaiz/libsvm-java-kernel.git 

Writing a kernel function can not be easier. All you have to do is to implement the CustomKernel interface. Here is how you can write a linear kernel:
 /**  
  * <code>LinearKernel</code> implements a linear kernel function.  
  * @author Syeed Ibn Faiz  
  */  
 public class LinearKernel implements CustomKernel {  
   @Override  
   public double evaluate(svm_node x, svm_node y) {              
     if (!(x.data instanceof SparseVector) || !(y.data instanceof SparseVector)) {  
       throw new RuntimeException("Could not find sparse vectors in svm_nodes");  
     }      
     SparseVector v1 = (SparseVector) x.data;  
     SparseVector v2 = (SparseVector) y.data;  
     return v1.dot(v2);  
   }    
 }  

The kernel function you want to use should then be registered with the KernelManager. The following code snippet may give you a better idea of the whole work flow:
 public static void testLinearKernel(String[] args) throws IOException, ClassNotFoundException {  
     String trainFileName = args[0];  
     String testFileName = args[1];  
     String outputFileName = args[2];  
       
     //Read training file  
     Instance[] trainingInstances = DataFileReader.readDataFile(trainFileName);      
       
     //Register kernel function  
     KernelManager.setCustomKernel(new LinearKernel());      
       
     //Setup parameters  
     svm_parameter param = new svm_parameter();          
       
     //Train the model  
     System.out.println("Training started...");  
     svm_model model = SVMTrainer.train(trainingInstances, param);  
     System.out.println("Training completed.");              
       
     //Read test file  
     Instance[] testingInstances = DataFileReader.readDataFile(testFileName);  
     //Predict results  
     double[] predictions = SVMPredictor.predict(testingInstances, model, true);    
   }  

Monday, September 10, 2012

Using phpSyntaxTree to Visualize Parse Tree

phpSyntaxTree is a very nice php library to generate graphical syntax trees. I have been using it to visualize both syntax trees and dependency trees. Analysing the graphical version is a lot convenient than looking at the text and imagining its structure. I made a simple interface to the library which I am going to dump here.

I modified the file stgraph.png.php so that it now accepts GET requests. Here is the patch:
 40c40  
 < if ( !isset( $_SESSION['data'] ) )  
 ---  
 > if ( !isset( $_GET['data'] ) )  
 43c43  
 < $data = $_SESSION['data'];  
 ---  
 > $data = $_GET['data'];  
 45,50c45,50  
 < $color   = isset( $_SESSION['color'] )   ? $_SESSION['color']   : 0;  
 < $triangles = isset( $_SESSION['triangles'] ) ? $_SESSION['triangles'] : FALSE;  
 < $antialias = isset( $_SESSION['antialias'] ) ? $_SESSION['antialias'] : 0;  
 < $autosub  = isset( $_SESSION['autosub'] )  ? $_SESSION['autosub']  : 0;  
 < $font   = isset( $_SESSION['font'] )   ? $_SESSION['font']   : 'Vera.ttf';  
 < $fontsize = isset( $_SESSION['fontsize'] ) ? $_SESSION['fontsize'] : 8;  
 ---  
 > $color   = isset( $_GET['color'] )   ? $_GET['color']   : 1;  
 > $triangles = isset( $_GET['triangles'] ) ? $_GET['triangles'] : FALSE;  
 > $antialias = isset( $_GET['antialias'] ) ? $_GET['antialias'] : 1;  
 > $autosub  = isset( $_GET['autosub'] )  ? $_GET['autosub']  : 0;  
 > $font   = isset( $_GET['font'] )   ? $_GET['font']   : 'Vera.ttf';  
 > $fontsize = isset( $_GET['fontsize'] ) ? $_GET['fontsize'] : 8;  
 91a92  
 >   

The patched version was named draw.php. This is my interface to the library. To test it I wrote the following  script:
 <html>  
 <body>  
 <?php  
 $phrase = $_GET['data'];  
 $phrase = str_replace("(", "[", $phrase);  
 $phrase = str_replace(")", "]", $phrase);  
 $color   = isset( $_GET['color'] )   ? $_GET['color']   : 1;  
 $antialias = isset( $_GET['antialias'] ) ? $_GET['antialias'] : 1;  
 $font   = isset( $_GET['font'] )   ? $_GET['font']   : 'Vera.ttf';  
 $fontsize = isset( $_GET['fontsize'] ) ? $_GET['fontsize'] : 8;  
 $query = "data=" . $phrase;  
 $query .= "&" . "color=" . $color;  
 $query .= "&" . "antilias=" . $antilias;  
 $query .= "&" . "font=" . $font;  
 $query .= "&" . "fontsize=" . $fontsize;  
 $img  = sprintf( "<img src=\"draw.php?%s\" alt=\"\" title=\"%s\"/>", $query, $phrase );  
 echo $img;  
 ?>  
 </body>  
 </html>  

Running the script like:
 test.php?data=(NP (DT a) (NP ball))   
generates the following image:
That's it!

How to Run the Charniak-Johnson Re-ranking Parser (BLLIP) as a Server

I have been using the BLLIP parser mainly for parsing biomedical text. The default parser and re-ranker models included in the package were trained on WSJ and therefore are not likely to work very well on biomedical text. However, there are publicly available models which were trained on biomedical text, namely the Genia corpus, which work pretty well on biomedical text or at least better than the Stanford parser with its default models. Here I am writing the steps down so that anybody can use it as a reference.

Step1:
Download BLLIP parser and decompress it.
 wget https://github.com/BLLIP/bllip-parser/tarball/master   
 tar xvzf master   

Step2:
If you don't have flex installed then install it.
 sudo apt-get install flex  

Step3:
Build the parser and re-ranker.
 cd BLLIP*   
 make  

Step4:
Test the parser.
 ./parse.sh   
  <s> This is a test . </s>   
  [Ctrl-D to terminate]   

You should see the following output:
 (S1 (S (NP (DT This)) (VP (AUX is) (NP (DT a) (NN test))) (. .)))  

Step5:
Download the biomedical model and decompress it:
 wget http://bllip.cs.brown.edu/download/bioparsingmodel-rel1.tar.gz   
  tar xvzf biopars*   

Step6:
To test the biomedical model use the following script:
 #! /bin/sh   
  BIOPARSINGMODEL=./biomodel   
  first-stage/PARSE/parseIt -l399 -N50 ${BIOPARSINGMODEL}/parser/ $* | second-stage/programs/features/best-parses -l ${BIOPARSINGMODEL}/reranker/features.gz ${BIOPARSINGMODEL}/reranker/weights.gz   

Step7:
To run the parser as a server I have modified a perl script that accompanies the Illinois Semantic Role Labeler package. It was originally written to run the Charniak parser as a server. Here is the perl script:
 #!/usr/bin/perl  
 $MAXCHAR = 799;  
 $MAXWORD = 400;  
 $BIOPARSINGMODEL = "./biomodel";  
 $command = "first-stage/PARSE/parseIt -K -l399 -N50 $BIOPARSINGMODEL/parser/ | second-stage/programs/features/best-parses -l $BIOPARSINGMODEL/reranker/features.gz $BIOPARSINGMODEL/reranker/weights.gz";  
 #$charniakDir = "$ENV{CHARNIAK}";  
 #$command = "$charniakDir/PARSE/parseIt $charniakDir/DATA/EN/ -K -l$MAXWORD";  
 #$endProtocol = "\n\n\n";  
 $endProtocol = "\n";  
 $TIMEOUT = 60;         # undef if no timeout  
 $PORT = 4449;               # pick something not in use  
 #read port  
 $PORT = $ARGV[0] if (scalar(@ARGV) > 0);  
 use Expect;  
 #create main program that will be communicating throught pipe.  
 $main = NewExpect($command);  
 sub NewExpect {  
  my $command = shift;  
  my $main;  
  print "[Initializing...]\n";  
  $main = new Expect();  
  $main->raw_pty(1);   # no local echo   
  $main->log_stdout(0); # no echo  
  $main->spawn($command) or die "Cannot start: $command\n";  
  $main->send("<s> This is a test . </s>\n"); #send input to main program  
  @res = $main->expect(undef,$endProtocol); # read output from main program  
  print $res[3];  
  print "[Done initializing.]\n";  
  return $main;  
 }  
 #server initialization matter  
 use IO::Socket;  
 use Net::hostent;          # for OO version of gethostbyaddr  
 $server = IO::Socket::INET->new( Proto   => 'tcp',  
                  LocalPort => $PORT,  
                  Listen  => SOMAXCONN,  
                  Reuse   => 1);  
 die "Can't setup server\n" unless $server;  
 #end server initialization  
 #set autoflush  
 $old_handle = select(STDOUT);  
 $| = 1;  
 select($old_handle);  
 $old_handle = select(STDERR);  
 $| = 1;  
 select($old_handle);  
 print "[Server $0 accepting clients]\n";  
 while ($client = $server->accept()) {  
  $main->expect(0); # flush old stuff if any  
  $main->clear_accum(); # clear buffer  
  $client->autoflush(1);  
  $clientinfo = gethostbyaddr($client->peeraddr);  
  if (defined($clientinfo)) {  
   $clientname = ($clientinfo->name || $client->peerhost);  
  } else {  
   $clientname = $client->peerhost;  
  }  
  printf "[Connect from %s]\n", $clientname;  
  &RunClient($client);  
  shutdown($client,3);  
  close($client);  
  printf "[Connection closed from %s]\n", $clientname;  
 }  
 $main->hard_close();  
 sub RunClient {  
  my $client = shift;  
  my $msg;  
  my $output;  
  my @res;  
  my $timeout;  
  my $sent;  
  while ($sent = <$client>) {  
   chomp $sent;  
   $sent =~ s/^\s+//;  
   $sent =~ s/\s+$//;  
   if ($sent =~ /^\s*$/) { # sending blank line will cause the parser to quit  
    $output = "\n\n";  
   } elsif (length > $MAXCHAR) {  
    $output = "\n\n";  
   } else {  
    $msg = "<s> $sent </s>\n";  
    print "Parse: $msg";  
    $main->send("$msg"); #send input to main program  
    @res = $main->expect($TIMEOUT,$endProtocol); # read output from main program  
    # @res = ($mp, $er, $ms, $bf, $af);  
    # $mp is ???  
    # $er is undef or 1:TIMEOUT  
    # $ms is the matched message  
    # $bf is the message before $ms  
    # $af is the message after $ms  
    $timeout = $res[1];  
    $out = $res[3];  
    if ($timeout) { # parser possibly gets stuck, restart it.  
     print "Time out!\n";  
     $output = "\n\n"; # output blank  
     print "Restart parser\n";  
     $main->hard_close();  
     $main = NewExpect($command);  
    } else {  
     if ($out =~ /^Parse failed/) {  
      print "Parse failed\n";  
      $output = "\n\n";  
      @res = $main->expect($TIMEOUT,$endProtocol); # read off the original sentence  
      $timeout = $res[1];  
      if ($timeout) { # parser possibly gets stuck, restart it.  
       print "Time out when reading off the original sentence!\n";  
       print "Restart parser\n";  
       $main->hard_close();  
       $main = NewExpect($command);  
      }  
     } elsif ($out =~ /^error:|^parseIt.*Assertion.*failed/) { # parser dies  
      print "Parser died!\n";  
      $output = "\n\n"; # output blank  
      print "Restart parser\n";  
      $main->hard_close();  
      $main = NewExpect($command);  
     } else {  
      print "Parse ok\n";  
      $output = "$out\n";  
      if ($out =~ /^\s*$/) { $numBlank = 1; }  
      else { $numBlank = 0; }  
     }  
    }  
   }  
   $output = &fixoutput($sent, $output);  
   print $client $output; # send output back to client  
   $main->clear_accum(); # clear buffer  
  }  
 }  
 sub fixoutput {  
  my ($input, $output) = @_;  
  my @input;  
  my @output;  
  my ($i, $length, $outlength);  
  @input = split /\s+/, replacesymbol($input);  
  $length = scalar(@input);  
  $outlength = 0;  
  while ($output =~ /[^\)]\)/g) { $outlength++; }  
  if ($outlength == 0) {  
   $output = "(S1 H:0 (X H:0";  
   for ($i = 0; $i < $length; $i++) {  
    $output .= " (. H:0 $input[$i])";  
   }  
   $output .= "))\n\n\n";  
  } elsif ($length != $outlength) {  
   $output =~ s/\)\s*$//;  
   for ($i = $outlength; $i < $length; $i++) {  
    $output .= " (. H:0 $input[$i])";  
   }  
   $output .= ")\n\n\n";  
  }  
  return $output;  
 }  
 sub replacesymbol {  
  my $input = shift;  
  $input =~ s/\(/-LRB-/g;  
  $input =~ s/\)/-RRB-/g;  
  $input =~ s/\[/-LSB-/g;  
  $input =~ s/\]/-RSB-/g;  
  $input =~ s/\{/-LCB-/g;  
  $input =~ s/\}/-RCB-/g;  
  return $input;  
 }  


Run the server on the background:
  nohup perl ./bioserver.pl &  

The parser should now be listening to port 4449 for incoming request. Each request should consist of a single tokenized line ending with an LF. If you want the parser to tokenize the text then remove the parameter '-K' in line 5. A response also consists of a single line which also ends with an LF.

Test the server:
 echo This is a test . | nc localhost 4449  
 (S1 (S (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test)))) (. .)))  

That's it! The server is now ready to serve!

Monday, September 26, 2011

Mallet and Weka

I have been using Mallet for some time now. I have also used Weka but preferred Mallet for some reasons. However, since I sometimes tend to use Weka I created a way for me to convert a Mallet InstanceList into Weka ARFF format. This also allowed me to use the classifiers in Weka quite easily.


 package ca.uwo.csd.ai.nlp.weka;  
 import cc.mallet.types.Alphabet;  
 import cc.mallet.types.FeatureVector;  
 import cc.mallet.types.Instance;  
 import cc.mallet.types.InstanceList;  
 import java.io.IOException;  
 import java.io.StringReader;  
 import weka.core.Instances;  
 /**  
  * Converts Mallet instanceList to Weka ARFF/Instances  
  * @author Syeed Ibn Faiz  
  */  
 public class Converter {  
   /**  
    * Converts Mallet InstanceList into Weka ARFF format  
    * @param instances Mallet instances  
    * @param description a String description required by Weka  
    * @return ARFF representation of the InstanceList  
    */  
   public static String convert2ARFF(InstanceList instances, String description) {  
     Alphabet dataAlphabet = instances.getDataAlphabet();  
     Alphabet targetAlphabet = instances.getTargetAlphabet();  
     StringBuilder sb = new StringBuilder();  
     sb.append("@Relation \"").append(description).append("\"\n\n");  
     int size = dataAlphabet.size();  
     for (int i = 0; i < size; i++) {  
       sb.append("@attribute \"").append(dataAlphabet.lookupObject(i).toString().replaceAll("\\s+", "_")).append("_").append(i);  
       sb.append("\" numeric\n");  
     }  
     sb.append("@attribute target {");  
     for (int i = 0; i < targetAlphabet.size(); i++) {  
       if (i != 0) sb.append(",");  
       sb.append(targetAlphabet.lookupObject(i).toString().replace(",", ";"));  
     }  
     sb.append("}\n\n@data\n");  
     for (int i = 0; i < instances.size(); i++) {  
       Instance instance = instances.get(i);  
       sb.append("{");  
       FeatureVector fv = (FeatureVector) instance.getData();  
       int[] indices = fv.getIndices();  
       double[] values = fv.getValues();  
       boolean[] attrFlag = new boolean[size];  
       double[] attrValue = new double[size];  
       for (int j = 0; j < indices.length; j++) {  
         attrFlag[indices[j]] = true;  
         attrValue[indices[j]] = values[j];  
       }        
       for (int j = 0; j < attrFlag.length; j++) {          
         if (attrFlag[j]) {            
           //sb.append(j).append(" 1, ");            
           sb.append(j).append(" ").append(attrValue[j]).append(", ");  
         }          
       }  
       sb.append(attrFlag.length).append(" ").append(instance.getTarget().toString().replace(",", ";"));  
       sb.append("}\n");        
     }  
     return sb.toString();  
   }  
   /**  
    * Converts Mallet InstanceList into Weka Instances  
    * @param instanceList  
    * @return  
    * @throws IOException   
    */  
   public static Instances convert2WekaInstances(InstanceList instanceList) throws IOException {  
     String arff = convert2ARFF(instanceList, "DESC");  
     StringReader reader = new StringReader(arff);  
     Instances instances = new Instances(reader);  
     instances.setClassIndex(instances.numAttributes() - 1);  
     return instances;  
   }  
 }  

It is now quite straight forward to call a classifier in Weka as shown in the following example:

 public static void main(String[] args) throws IOException, Exception {  
     ArrayList<Pipe> pipes = new ArrayList<Pipe>();  
     pipes.add(new Target2Label());  
     pipes.add(new CharSequence2TokenSequence());  
     pipes.add(new TokenSequence2FeatureSequence());  
     pipes.add(new FeatureSequence2FeatureVector());  
     SerialPipes pipe = new SerialPipes(pipes);  
     //prepare training instances  
     InstanceList trainingInstanceList = new InstanceList(pipe);  
     trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("webkb-train-stemmed.txt"),  
         "(.*)\t(.*)", 2, 1, -1));  
     //prepare test instances  
     InstanceList testingInstanceList = new InstanceList(pipe);  
     testingInstanceList.addThruPipe(new CsvIterator(new FileReader("webkb-test-stemmed.txt"),  
         "(.*)\t(.*)", 2, 1, -1));  
     //Using a classifier in Mallet  
     ClassifierTrainer trainer = new NaiveBayesTrainer();  
     Classifier classifier = trainer.train(trainingInstanceList);  
     System.out.println("Accuracy[Mallet]: " + classifier.getAccuracy(testingInstanceList));  
     //Getting Weka Instances  
     Instances trainingInstances = Converter.convert2WekaInstances(trainingInstanceList);  
     Instances testingInstances = Converter.convert2WekaInstances(testingInstanceList);  
     //Using a classifier in Weka  
     NaiveBayesMultinomial naiveBayesMultinomial = new NaiveBayesMultinomial();  
     naiveBayesMultinomial.buildClassifier(trainingInstances);  
     Evaluation evaluation = new Evaluation(testingInstances);  
     evaluation.evaluateModel(naiveBayesMultinomial, testingInstances);  
     System.out.println("Accuracy[Weka]: " + evaluation.correct() / testingInstanceList.size());      
   }  
Using the WebKB dataset I got the following output:

 Accuracy[Mallet]: 0.836676217765043  
 Accuracy[Weka]: 0.836676217765043  

Friday, July 29, 2011

Porting Genia POS Tagger 3.0.1 to Windows

I have just ported the latest version of Genia POS Tagger (3.0.1 ) to windows. One can use it on windows though without porting, via cygwin. In fact that was what I was doing so far. Lately, I decided for a reason to run the tagger as a server. So I wrote a Java wrapper. Today when I scanned through the code again I found that only a few adjustments can port the codes to windows. All one has to do is to change the put_stop_watch function in bidir.cpp. I replaced gettimeofday, which is not available in windows, with GetTickCount() from Windows.h. A few other minor adjustments are also required, like changing the paths to the model files etc.

You can download the windows port from here.