Wednesday, 2 September 2015

Co-reference Resolution in Stanford CoreNLP


In the previous blog, we discussed about Dependency parsing. Now we will discuss about how to identify the expressions or entities which refer to the same person/thing or any object. This problem is solved using Co-reference resolution concept.

Co-reference resolution(or anaphora resolution) is the task of finding all the expressions that refers to the same entity in multiple sentences.

Example :  James told that he would go out for dinner.

Here you can see that ‘James’ and ‘he’, both are referring to the same person.
Co-reference resolution is an important step in Natural language processing i.e. Information retrieval, Question answering etc.

Now we’ll see how we can implement it using Stanford CoreNLP package in java.


   public class CoRefExample {

                public static void main(String[] args) throws IOException {
                                Properties props = new Properties();
                                props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
                                StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

                                // read some text in the text variable
                                String text = "The Revolutionary War occurred in the 1700s .it was the first war in the US states";
                 
                                // create an empty Annotation just with the given text
                                Annotation document = new Annotation(text);

                                // run all Annotators on this text
                                pipeline.annotate(document);

                                // This is the coreference link graph
                                // Each chain stores a set of mentions that link to each other,
                                // along with a method for getting the most representative mention
                                // Both sentence and token offsets start at 1!
                                Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

                                for (Map.Entry<Integer, CorefChain> entry : graph.entrySet()) {
                                                CorefChain c = entry.getValue();
                                 
                                                // this is because it prints out a lot of self references which aren't that useful
                                                 
                                                CorefMention cm = c.getRepresentativeMention();
                                                String clust = "";
                                                List<CoreLabel> tks = document.get(SentencesAnnotation.class)
                                                                                .get(cm.sentNum - 1).get(TokensAnnotation.class);
                               
                                                for (int i = cm.startIndex - 1; i < cm.endIndex - 1; i++)
                                                                clust += tks.get(i).get(TextAnnotation.class) + " ";
                                                clust = clust.trim();
                                                System.out.println("representative mention: \"" + clust
                                                                                + "\" is mentioned by:");
                                                Iterable<Set<CorefMention>> cSet = c.getMentionMap().values();
                                 
                                                CorefMention m = c.getRepresentativeMention();
                                                String clust2 = "";
                                                tks = document.get(SentencesAnnotation.class).get(m.sentNum - 1)
                                                                                .get(TokensAnnotation.class);
                                                for (int i = m.startIndex - 1; i < m.endIndex - 1; i++)
                                                                clust2 += tks.get(i).get(TextAnnotation.class) + " ";
                                                clust2 = clust2.trim();
                                                // don't need the self mention
                                                if (clust.equals(clust2))
                                                                continue;
                                                System.out.println("\t" + clust2);
                                 }
                }
   }

Once you execute the above code, you will get “Revolutionary War” and “It” as same entities.
Now it’s your turn to try it out. You can find the full code on github .

Sunday, 30 August 2015

Dependency Parsing in Stanford CoreNLP


If you are working on Natural language Processing, this post will be useful for triplet Extraction from the documents.
Here we assume, you have basic knowledge about Part-of-Speech tagging, tokens etc. concepts.  Let’s discuss about Dependency Parsing first.

Stanford Dependency Parsing:
Stanford dependencies provide a representation of grammatical relations between words in a sentence. These dependencies are triplets : Name of the relation, governor and dependent.
Here is an example sentence :
Bell,based in Los Angeles, makes and distributes electronic, computer and building products.

We can see that  “the subject for verb ‘distributes’ is Bell.”  For the above sentence, Stanford dependencies(SD) representation is :

     nsubj(makes-8, Bell-1)
     nsubj(distributes-10, Bell-1)
     vmod(Bell-1, based-3)
     nn(Angeles-6, Los-5)
     prep_in(based-3, Angeles-6)
     root(ROOT-0, makes-8)
     conj_and(makes-8, distributes-10)
     amod(products-16, electronic-11)
     conj_and(electronic-11, computer-13)
     amod(products-16, computer-13)
     conj_and(electronic-11, building-15)
     amod(products-16, building-15)
     dobj(makes-8, products-16)





In above representation, first term is dependency tag, which represents the relation between governor(2nd term) and dependent(3rd term) .
There are various dependency tags, which are listed in the Stanford Dependency manual.

Following are two type of dependencies :
  •  Basic/Non Collapased: This representation gives the basic dependencies as well as the extra ones (which break the tree structure), without any collapsing or propagation of conjuncts. Eg.
                prep(based-7, in-8)
                pobj(in-8, LA-9) 
  •  Collapased : In the collapsed representation, dependencies involving prepositions, conjuncts, as well as information about the referent of relative clauses are collapsed to get direct dependencies between content words. For instance, the dependencies involving the preposition “in” in the above example will be collapsed into one single relation:
               prep(based-7, in-8)
               pobj(in-8, LA-9) 
         will become :  prep_in(based-7, LA-9)

Now we’ll see, how can we get these using JAVA Code.

import java.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class ParserDemo {
                public static void main(String[] args) {
                                LexicalizedParser lp = LexicalizedParser
                                                                .loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
                                lp.setOptionFlags(new String[] { "-maxLength", "80",
                                                                "-retainTmpSubcategories" });
                                String[] sent = { "This", "is", "an", "easy", "sentence", "." };
                                List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
                                Tree parse = lp.apply(rawWords);
                                parse.pennPrint();
                                System.out.println();
                               
                                TreebankLanguagePack tlp = new PennTreebankLanguagePack();
                                GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
                                GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
                                List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
                                System.out.println(tdl);
                                TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
                                tp.printTree(parse);
                }
}


 Now you can easily extract the triplets from document. You can find the example code in github repo.


Monday, 3 August 2015

Setting up Hadoop Cluster in Pseudo-distributed mode on Ubuntu


Here we’ll discuss the pseudo-distributed mode Hadoop cluster setup on linux environment. We are using Hadoop 2.x for this.
Pre-requisites:
     -     Java 7
     -      Adding a dedicated user
     -     Configuring ssh

Step 1: Install Java:

324532@ubuntu:~$ sudo apt-get install openjdk-7-jdk
324532@ubuntu:~$ java –version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)
OpenJDK Client VM (build 24.79-b02, mixed mode, sharing)

Step2 : Add a dedicated hadoop user

Though it is not mandatory,we create it for separating the Hadoop installation from other packages.
324532@ubuntu:~$sudo addgroup hadoop
324532@ubuntu:~$sudo adduser –ingroup hadoop hduser
It will add hduser user in hadoop group.

Step 3: Install ssh:

324532@ubuntu:~$ sudo apt-get install ssh
324532@ubuntu:~$ sudo apt-get install openssh-server
Once it is installed, make sure ssh service is running.

Step4 : Configure ssh

Hadoop uses ssh to manages its nodes. So we need to make ssh running and configured for authentication

First generate an SSH key for hduser.

324532@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
hduser@ubuntu:~$

Once the key is generated, copy the public key to authorized keys.

hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Once key is copied, you can ssh to localhost and continue the Hadoop setup.

hduser@ubuntu:~$ ssh localhost


Step 4: Setup Hadoop cluster

Download a release from Apache download mirrors. And extract it into a folder i.e. ‘/usr/local/hadoop/’. Set JAVA_HOME and other Hadoop related environment variables in .bash_profile file of hduser. 
# set to the root of your Java installation
 export JAVA_HOME=/usr/java/latest
 export HADOOP_INSTALL=/usr/local/hadoop
 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
 export HADOOP_COMMON_HOME=$HADOOP_INSTALL
 export HADOOP_HDFS_HOME=$HADOOP_INSTALL
 export YARN_HOME=$HADOOP_INSTALL
 export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
Hadoop can run in 3 modes :
     1.    Single node cluster
     2.    Pseudo distributed mode
     3.    Fully distributed mode
  
Single Distributed Mode : All daemons run in  non-distributed manner as a single java process. Local filesystem is used for data storage.

Pseudo Distributed Mode : Hadoop can also be run on single node in pseudo distributed mode where each daemon runs as a separate java process.

Fully Distributed Mode : Hadoop runs on multiple nodes in master slave architecture where each daemon runs as a separate java process.

Configuration :

Following are the minimal configuration you need to add in the configuration files to start a cluster.

etc/hadoop/core-site.xml 
<configuration>
    <property>
        <name>fs.defaultFS</name>
<!-- It is namenode filesystem path -->
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>


etc/Hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

For yarn daemons : etc/hadoop/mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value> <!--other values are local, classic -->
    </property>
</configuration>

etc/hadoop/yarn-site.xml:
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Execution
Once the above configuration is done, next step is to format the namenode.

To format the filesystem:
  hduser@ubuntu:~$ bin/hdfs namenode -format

To start NameNode and DataNode  daemon:
  hduser@ubuntu:~$ sbin/start-dfs.sh

Browse the NameNode web interface. By default it is : http://localhost:50070

To start yarn daemons (Resource manager and Node manager), run following:
hduser@ubuntu:~ $ sbin/start-yarn.sh

You can browse resource manager at http://localhost:8088

If you want to run all daemons together, you can run following:
hduser@ubuntu:~$ sbin/start-all.sh

Now your cluster is successfully started. You can see all Hadoop daemons running using jps command.
Now start writing Map Reduce job..!!!!!!!!!

Thursday, 2 July 2015

Writing a custom NameFinder model in OpenNLP

Though we have various NER models available in OpenNLP, but entity extraction doesn’t end here with the existing one only. We may need to find the entities based on Clinical, Biological, Sports, Banking domain etc.
So should we restrict ourselves with the models already provided?  - No,
We can build our own Name Finder model.  Steps required doing this are: Get the sample training dataset, build the model and test it.

What type of data should we have for training a model:
Sentences should be separated with new line character (\n).  Values should be separated from <Start> and <END> tags with a space character.

<START:medicine> Augmentin-Duo <END> is a penicillin antibiotic that contains two medicines - <START:medicine> amoxicillin trihydrate <END> and <START:medicine> potassium clavulanate <END>. They work together to kill certain types of bacteria and are used to treat certain types of bacterial infections.

You can refer a sample dataset for example. Training data should have at least 15000 sentences to get the better results.
Model can be trained via command line tool as well as Java Training API :

Command Line tool :
There are various argument which you need to pass while building the model as follows : 
$ opennlp TokenNameFinderTrainer
Usage: opennlp TokenNameFinderTrainer[.bionlp2004|.conll03|.conll02|.ad] [-resources resourcesDir] \
               [-type modelType] [-featuregen featuregenFile] [-params paramsFile] \
               [-iterations num] [-cutoff num] -model modelFile -lang language \
               -data sampleData [-encoding charsetName]

Arguments description:
        -resources resourcesDir
                The resources directory
        -type modelType
                The type of the token name finder model
        -featuregen featuregenFile
                The feature generator descriptor file
        -params paramsFile
                training parameters file.
        -iterations num
                number of training iterations, ignored if -params is used. Default value is 100.
        -cutoff num
                minimal number of times a feature must be seen, ignored if -params is used. Default value is 5.
        -model modelFile
                output model file.
        -lang language
                language which is being processed.
        -data sampleData
                data to be used, usually a file name.
        -encoding charsetName
                encoding for reading and writing text, if absent the system default is used.

Now lets say, we want to build a model “en-ner-drugs.bin” for data “drugsDetails.txt” in English language.

$opennlp TokenNameFinderTrainer -model en-ner-drugs.bin -lang en -data drugsDetails.txt -encoding UTF-8

Now we’ll see,  how can we train the same model using JAVA API.

Steps :
  • Open a sample data stream
  • Call the NameFinderME.train method
  • Save the TokenNameFinderModel to a file

Here is the example.
import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.HashMap;

import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

public class DrugsClassifierTrainer {
                static String onlpModelPath = "en-ner-drugs.bin";
                // training data set
                static String trainingDataFilePath = "D:/NLPTools/Datasets/drugsDetails.txt";

                public static void main(String[] args) throws IOException {
                                Charset charset = Charset.forName("UTF-8");
                                ObjectStream<String> lineStream = new PlainTextByLineStream(
                                                                new FileInputStream(trainingDataFilePath), charset);
                                ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                                                                lineStream);
                                TokenNameFinderModel model = null;
                                HashMap<String, Object> mp = new HashMap<String, Object>();
                                try {
                                                model = NameFinderME.train("en", "drugs", sampleStream, Collections.<String,Object>                                                                                                                emptyMap(),100,4);
                                } finally {
                                                sampleStream.close();
                                }
                                BufferedOutputStream modelOut = null;
                                try {
                                                modelOut = new BufferedOutputStream(new FileOutputStream(onlpModelPath));
                                                model.serialize(modelOut);
                                } finally {
                                                if (modelOut != null)
                                                                modelOut.close();
                                }
                }
}

Above code will generate the “en-ner-drugs.bin” model.

Now you are all set to use this model for finding entity like other NER models…!!!!!!!! 

For more details, you can go through OpenNLP Documentation.