Wednesday, 2 September 2015

Co-reference Resolution in Stanford CoreNLP


In the previous blog, we discussed about Dependency parsing. Now we will discuss about how to identify the expressions or entities which refer to the same person/thing or any object. This problem is solved using Co-reference resolution concept.

Co-reference resolution(or anaphora resolution) is the task of finding all the expressions that refers to the same entity in multiple sentences.

Example :  James told that he would go out for dinner.

Here you can see that ‘James’ and ‘he’, both are referring to the same person.
Co-reference resolution is an important step in Natural language processing i.e. Information retrieval, Question answering etc.

Now we’ll see how we can implement it using Stanford CoreNLP package in java.


   public class CoRefExample {

                public static void main(String[] args) throws IOException {
                                Properties props = new Properties();
                                props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
                                StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

                                // read some text in the text variable
                                String text = "The Revolutionary War occurred in the 1700s .it was the first war in the US states";
                 
                                // create an empty Annotation just with the given text
                                Annotation document = new Annotation(text);

                                // run all Annotators on this text
                                pipeline.annotate(document);

                                // This is the coreference link graph
                                // Each chain stores a set of mentions that link to each other,
                                // along with a method for getting the most representative mention
                                // Both sentence and token offsets start at 1!
                                Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

                                for (Map.Entry<Integer, CorefChain> entry : graph.entrySet()) {
                                                CorefChain c = entry.getValue();
                                 
                                                // this is because it prints out a lot of self references which aren't that useful
                                                 
                                                CorefMention cm = c.getRepresentativeMention();
                                                String clust = "";
                                                List<CoreLabel> tks = document.get(SentencesAnnotation.class)
                                                                                .get(cm.sentNum - 1).get(TokensAnnotation.class);
                               
                                                for (int i = cm.startIndex - 1; i < cm.endIndex - 1; i++)
                                                                clust += tks.get(i).get(TextAnnotation.class) + " ";
                                                clust = clust.trim();
                                                System.out.println("representative mention: \"" + clust
                                                                                + "\" is mentioned by:");
                                                Iterable<Set<CorefMention>> cSet = c.getMentionMap().values();
                                 
                                                CorefMention m = c.getRepresentativeMention();
                                                String clust2 = "";
                                                tks = document.get(SentencesAnnotation.class).get(m.sentNum - 1)
                                                                                .get(TokensAnnotation.class);
                                                for (int i = m.startIndex - 1; i < m.endIndex - 1; i++)
                                                                clust2 += tks.get(i).get(TextAnnotation.class) + " ";
                                                clust2 = clust2.trim();
                                                // don't need the self mention
                                                if (clust.equals(clust2))
                                                                continue;
                                                System.out.println("\t" + clust2);
                                 }
                }
   }

Once you execute the above code, you will get “Revolutionary War” and “It” as same entities.
Now it’s your turn to try it out. You can find the full code on github .