Sunday, 30 August 2015

Dependency Parsing in Stanford CoreNLP

If you are working on Natural language Processing, this post will be useful for triplet Extraction from the documents.
Here we assume, you have basic knowledge about Part-of-Speech tagging, tokens etc. concepts.  Let’s discuss about Dependency Parsing first.

Stanford Dependency Parsing:
Stanford dependencies provide a representation of grammatical relations between words in a sentence. These dependencies are triplets : Name of the relation, governor and dependent.
Here is an example sentence :
Bell,based in Los Angeles, makes and distributes electronic, computer and building products.

We can see that  “the subject for verb ‘distributes’ is Bell.”  For the above sentence, Stanford dependencies(SD) representation is :

     nsubj(makes-8, Bell-1)
     nsubj(distributes-10, Bell-1)
     vmod(Bell-1, based-3)
     nn(Angeles-6, Los-5)
     prep_in(based-3, Angeles-6)
     root(ROOT-0, makes-8)
     conj_and(makes-8, distributes-10)
     amod(products-16, electronic-11)
     conj_and(electronic-11, computer-13)
     amod(products-16, computer-13)
     conj_and(electronic-11, building-15)
     amod(products-16, building-15)
     dobj(makes-8, products-16)

In above representation, first term is dependency tag, which represents the relation between governor(2nd term) and dependent(3rd term) .
There are various dependency tags, which are listed in the Stanford Dependency manual.

Following are two type of dependencies :
  •  Basic/Non Collapased: This representation gives the basic dependencies as well as the extra ones (which break the tree structure), without any collapsing or propagation of conjuncts. Eg.
                prep(based-7, in-8)
                pobj(in-8, LA-9) 
  •  Collapased : In the collapsed representation, dependencies involving prepositions, conjuncts, as well as information about the referent of relative clauses are collapsed to get direct dependencies between content words. For instance, the dependencies involving the preposition “in” in the above example will be collapsed into one single relation:
               prep(based-7, in-8)
               pobj(in-8, LA-9) 
         will become :  prep_in(based-7, LA-9)

Now we’ll see, how can we get these using JAVA Code.

import java.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class ParserDemo {
                public static void main(String[] args) {
                                LexicalizedParser lp = LexicalizedParser
                                lp.setOptionFlags(new String[] { "-maxLength", "80",
                                                                "-retainTmpSubcategories" });
                                String[] sent = { "This", "is", "an", "easy", "sentence", "." };
                                List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
                                Tree parse = lp.apply(rawWords);
                                TreebankLanguagePack tlp = new PennTreebankLanguagePack();
                                GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
                                GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
                                List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
                                TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");

 Now you can easily extract the triplets from document. You can find the example code in github repo.

Monday, 3 August 2015

Setting up Hadoop Cluster in Pseudo-distributed mode on Ubuntu

Here we’ll discuss the pseudo-distributed mode Hadoop cluster setup on linux environment. We are using Hadoop 2.x for this.
     -     Java 7
     -      Adding a dedicated user
     -     Configuring ssh

Step 1: Install Java:

324532@ubuntu:~$ sudo apt-get install openjdk-7-jdk
324532@ubuntu:~$ java –version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)
OpenJDK Client VM (build 24.79-b02, mixed mode, sharing)

Step2 : Add a dedicated hadoop user

Though it is not mandatory,we create it for separating the Hadoop installation from other packages.
324532@ubuntu:~$sudo addgroup hadoop
324532@ubuntu:~$sudo adduser –ingroup hadoop hduser
It will add hduser user in hadoop group.

Step 3: Install ssh:

324532@ubuntu:~$ sudo apt-get install ssh
324532@ubuntu:~$ sudo apt-get install openssh-server
Once it is installed, make sure ssh service is running.

Step4 : Configure ssh

Hadoop uses ssh to manages its nodes. So we need to make ssh running and configured for authentication

First generate an SSH key for hduser.

324532@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:

Once the key is generated, copy the public key to authorized keys.

hduser@ubuntu:~$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Once key is copied, you can ssh to localhost and continue the Hadoop setup.

hduser@ubuntu:~$ ssh localhost

Step 4: Setup Hadoop cluster

Download a release from Apache download mirrors. And extract it into a folder i.e. ‘/usr/local/hadoop/’. Set JAVA_HOME and other Hadoop related environment variables in .bash_profile file of hduser. 
# set to the root of your Java installation
 export JAVA_HOME=/usr/java/latest
 export HADOOP_INSTALL=/usr/local/hadoop
Hadoop can run in 3 modes :
     1.    Single node cluster
     2.    Pseudo distributed mode
     3.    Fully distributed mode
Single Distributed Mode : All daemons run in  non-distributed manner as a single java process. Local filesystem is used for data storage.

Pseudo Distributed Mode : Hadoop can also be run on single node in pseudo distributed mode where each daemon runs as a separate java process.

Fully Distributed Mode : Hadoop runs on multiple nodes in master slave architecture where each daemon runs as a separate java process.

Configuration :

Following are the minimal configuration you need to add in the configuration files to start a cluster.

<!-- It is namenode filesystem path -->


For yarn daemons : etc/hadoop/mapred-site.xml
        <value>yarn</value> <!--other values are local, classic -->


Once the above configuration is done, next step is to format the namenode.

To format the filesystem:
  hduser@ubuntu:~$ bin/hdfs namenode -format

To start NameNode and DataNode  daemon:
  hduser@ubuntu:~$ sbin/

Browse the NameNode web interface. By default it is : http://localhost:50070

To start yarn daemons (Resource manager and Node manager), run following:
hduser@ubuntu:~ $ sbin/

You can browse resource manager at http://localhost:8088

If you want to run all daemons together, you can run following:
hduser@ubuntu:~$ sbin/

Now your cluster is successfully started. You can see all Hadoop daemons running using jps command.
Now start writing Map Reduce job..!!!!!!!!!