Thursday, 2 July 2015

Writing a custom NameFinder model in OpenNLP

Though we have various NER models available in OpenNLP, but entity extraction doesn’t end here with the existing one only. We may need to find the entities based on Clinical, Biological, Sports, Banking domain etc.
So should we restrict ourselves with the models already provided?  - No,
We can build our own Name Finder model.  Steps required doing this are: Get the sample training dataset, build the model and test it.

What type of data should we have for training a model:
Sentences should be separated with new line character (\n).  Values should be separated from <Start> and <END> tags with a space character.

<START:medicine> Augmentin-Duo <END> is a penicillin antibiotic that contains two medicines - <START:medicine> amoxicillin trihydrate <END> and <START:medicine> potassium clavulanate <END>. They work together to kill certain types of bacteria and are used to treat certain types of bacterial infections.

You can refer a sample dataset for example. Training data should have at least 15000 sentences to get the better results.
Model can be trained via command line tool as well as Java Training API :

Command Line tool :
There are various argument which you need to pass while building the model as follows : 
$ opennlp TokenNameFinderTrainer
Usage: opennlp TokenNameFinderTrainer[.bionlp2004|.conll03|.conll02|.ad] [-resources resourcesDir] \
               [-type modelType] [-featuregen featuregenFile] [-params paramsFile] \
               [-iterations num] [-cutoff num] -model modelFile -lang language \
               -data sampleData [-encoding charsetName]

Arguments description:
        -resources resourcesDir
                The resources directory
        -type modelType
                The type of the token name finder model
        -featuregen featuregenFile
                The feature generator descriptor file
        -params paramsFile
                training parameters file.
        -iterations num
                number of training iterations, ignored if -params is used. Default value is 100.
        -cutoff num
                minimal number of times a feature must be seen, ignored if -params is used. Default value is 5.
        -model modelFile
                output model file.
        -lang language
                language which is being processed.
        -data sampleData
                data to be used, usually a file name.
        -encoding charsetName
                encoding for reading and writing text, if absent the system default is used.

Now lets say, we want to build a model “en-ner-drugs.bin” for data “drugsDetails.txt” in English language.

$opennlp TokenNameFinderTrainer -model en-ner-drugs.bin -lang en -data drugsDetails.txt -encoding UTF-8

Now we’ll see,  how can we train the same model using JAVA API.

Steps :
  • Open a sample data stream
  • Call the NameFinderME.train method
  • Save the TokenNameFinderModel to a file

Here is the example.
import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Collections;
import java.util.HashMap;

import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

public class DrugsClassifierTrainer {
                static String onlpModelPath = "en-ner-drugs.bin";
                // training data set
                static String trainingDataFilePath = "D:/NLPTools/Datasets/drugsDetails.txt";

                public static void main(String[] args) throws IOException {
                                Charset charset = Charset.forName("UTF-8");
                                ObjectStream<String> lineStream = new PlainTextByLineStream(
                                                                new FileInputStream(trainingDataFilePath), charset);
                                ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                                                                lineStream);
                                TokenNameFinderModel model = null;
                                HashMap<String, Object> mp = new HashMap<String, Object>();
                                try {
                                                model = NameFinderME.train("en", "drugs", sampleStream, Collections.<String,Object>                                                                                                                emptyMap(),100,4);
                                } finally {
                                                sampleStream.close();
                                }
                                BufferedOutputStream modelOut = null;
                                try {
                                                modelOut = new BufferedOutputStream(new FileOutputStream(onlpModelPath));
                                                model.serialize(modelOut);
                                } finally {
                                                if (modelOut != null)
                                                                modelOut.close();
                                }
                }
}

Above code will generate the “en-ner-drugs.bin” model.

Now you are all set to use this model for finding entity like other NER models…!!!!!!!! 

For more details, you can go through OpenNLP Documentation.

23 comments:

  1. Hi, I keep getting this error whenever I train in CLI or in the code...
    Exception in thread "main" java.lang.IllegalArgumentException: Model not compatible with name finder!
    at opennlp.tools.namefind.TokenNameFinderModel.(TokenNameFinderModel.java:103)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:444)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:473)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:491)

    ReplyDelete
  2. I figured it out! The data was in the wrong format.
    Thank you for this information, I've been searching up and down for this. Thanks a lot.
    Happy coding :)

    ReplyDelete
    Replies
    1. great..!!! yeah, this error comes in case of wrong data format... so to avoid these kind of error, you should properly annotate data.

      Delete
  3. Now I'm getting an exception on this line;

    model = NameFinderME.train("en","drugs",sampleStream, Collections. emptyMap());

    ReplyDelete
    Replies
    1. You need to defined the map type in the argument like this :
      model = NameFinderME.train("en", "drugs", sampleStream, Collections. emptyMap());

      It will fix the exception.

      Delete
  4. Here's the full error information;

    Indexing events using cutoff of 5

    Computing event counts... done. 0 events
    Indexing... done.
    Sorting and merging events... Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(Unknown Source)
    at java.util.ArrayList.get(Unknown Source)
    at opennlp.tools.ml.model.AbstractDataIndexer.sortAndMerge(AbstractDataIndexer.java:89)
    at opennlp.tools.ml.model.TwoPassDataIndexer.(TwoPassDataIndexer.java:105)
    at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
    at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:419)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:473)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:491)

    The other first error was this line;
    model = NameFinderME.train("en","drugs",sampleStream, Collections. emptyMap(),100,4);

    Eclipse complained that;
    The method train(String, String, ObjectStream, Map) in the type NameFinderME is not applicable for the arguments (String, String, ObjectStream, Map, int, int)

    Could you post the libraries you are using? I might have the wrong import statements.
    I wanted to ask you; is en-ner-drugs.bin a file that exists somewhere on the system (where?) or is it created inside the main method?
    Likewise is the drugDetails.txt file empty at the beginning or is it full of formatted data?




    These are not really errors (warnings) but I'm not sure how they are impacting the method
    The first one is on the lineStream object;
    The constructor PlainTextByLineStream(InputStream, Charset) is deprecated

    The second one is on the model object, it says;
    The method train(String, String, ObjectStream, Map) from the type NameFinderME is deprecated

    Is it ok to use the deprecated constructor and method?

    ReplyDelete
    Replies
    1. I am using opennlp-tools 1.5.3 package.
      In this PlainTextByLineStream(InputStream, Charset) and NameFinderME.train(String, String, ObjectStream, Map) are not deprecated.

      To resolve this error " The method train(String, String, ObjectStream, Map) in the type NameFinderME is not applicable for the arguments (String, String, ObjectStream, Map, int, int)", define the map type in train method.

      model = NameFinderME.train("en", "medicine", sampleStream, Collections. emptyMap(),100,4);

      en-ner-drugs.bin is the output model file, which is created in main method.

      drugDetails.txt file contains properly annotated dataset on which model is trained.

      You can use suppresswarnings incase if you have to use deprecated method. That won't impact the code.

      Can you share your dataset to check, how you have annotated it?

      Delete
  5. Hi, i am trying to figure out how to use different features for training. For instance, stem, POSTAG and the suffixes. In addition to that, is there any option to use gazetteers for finding named entities. And also i should notice that my target language is Turkish.
    Thank you.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Hi,
    how can i use your code for Arabic model, can you help me on that?

    ReplyDelete
  8. If we have our data with multiple tags like ,, <START: mothername how can we get a bin file for a person ner?

    ReplyDelete
  9. Hi.. I have followed the same set of instructions. I have used the same training data. That is:
    drug.txt:

    Augmentin-Duo is a penicillin antibiotic that contains two medicines - amoxicillin trihydrate and potassium clavulanate . They work together to kill certain types of bacteria and are used to treat certain types of bacterial infections.

    For the input : "Augmentin-Duo is the medicine I have recommended for my friends. Even amoxicillin trihydrate works fine."

    I am getting the output as: Augmentin-Duo is the medicine I have recommended for my friends. Even amoxicillin trihydrate works fine.

    But for the input: "The medicine I have recommended for my fiends is Augmentin-Duo. Even amoxicillin trihydrate works fine"

    The output I got is : The medicine I have recommended for my fiends is Augmentin-Duo.
    Even amoxicillin trihydrate works fine.

    Why the outputs are different? I know that for training the data, we need at least 15000 sentences. But, for this example, at least the outputs should be the same.

    ReplyDelete
    Replies
    1. Hi.. there is some problem with the output I have typed:

      I am retyping the question here:

      Hi.. I have followed the same set of instructions. I have used the same training data.

      For the input : "Augmentin-Duo is the medicine I have recommended for my friends. Even amoxicillin trihydrate works fine."

      In the output: Augmentin-Duo is identified correctly.

      But for the input: "The medicine I have recommended for my fiends is Augmentin-Duo. Even amoxicillin trihydrate works fine"

      In the output: It identifies "even" as the medicine. But fails to recognize the actual medicine.

      Why the outputs are different? I know that for training the data, we need at least 15000 sentences. But, for this example, at least the outputs should be the same.

      Delete
    2. Entities are identified based on annotations. Can you please share the text with annotations?

      Delete
    3. Hi.. As tags are not shown properly, I am using <*... *>

      For input1:
      input.txt: Augmentin-Duo is the medicine I have recommended for my friends.
      Even amoxicillin trihydrate works fine
      Output: <*START:medicine*> Augmentin-Duo <*END*> is the medicine I have recommended for my friends.
      <*START:medicine*> Even <*END*> amoxicillin trihydrate works fine
      For input2:
      inputnext.txt: The medicine I have recommended for my fiends is Augmentin-Duo.
      Even amoxicillin trihydrate works fine.
      Output:
      The medicine I have recommended for my fiends is Augmentin-Duo.
      <*START:medicine*> Even <*END*> amoxicillin trihydrate works fine



      Delete
    4. The medicine I have recommended for my fiends is Augmentin-Duo.
      <*START:medicine*> Even <*END*> amoxicillin trihydrate works fine

      In the above text, you have annotated 'Even' instead of 'amoxicillin trihydrate' medicine, that's why it returns 'Even'.
      Instead use <*START:medicine*> amoxicillin trihydrate <*END*>

      Delete
  10. I have annotated the text correctly. But, whatever be the training data, it always returns the first word of every sentence as the output.

    ReplyDelete
  11. Hello, I'm trying to create a model for parsing. It does not follow the same structure of the NER. You could help me with some information?

    ReplyDelete
  12. custom wrtiting
    The quality can be tested with time. Our custom writing service has many years of experience and thousands of satisfied clients. Become one of them and order a high-quality paper with us!

    ReplyDelete
  13. Can I get any sample text file to train data other than the person model sample ? I have tried to customize model like crime using some many synonyms of the word crime in my training data. At then end it did't give me a out put. Its only matches if I give the exact sentences which is used in my training data. I also need to understand that, How this name entity find works so that i will train my data more better way ? Thanks in advanced.

    ReplyDelete
    Replies
    1. You can refer this file : https://github.com/mccraigmccraig/opennlp/blob/master/src/test/resources/opennlp/tools/namefind/AnnotatedSentencesWithTypes.txt
      Entities are defined with the tokens, so it shouldn't match against the sentences.
      Whatever name you want to label as person, put that inside and tags.

      Delete
  14. I am getting the following error with opennlp 1.7.1
    The method train(String, String, ObjectStream, TrainingParameters, TokenNameFinderFactory) in the type NameFinderME is not applicable for the arguments (String, String, ObjectStream, Map, int, int)

    ReplyDelete
  15. Thank you for sharing such a informative information with us. Keep on sharing the blog like this.

    PhD Thesis Writing Services, Dissertation Writing Services & Research Paper Writing Services

    ReplyDelete