Saturday, 21 February 2015

Writing Java UDF in Apache Pig


Isn’t it good that we can write user defined functions (UDF) for custom processing in Pig also? Here we’ll talk about writing UDF in java.

How to write the Java UDF:
First of all, add pig dependency in the java project.
Now, define a UDF class (eg. HexConversion) . Each UDF will extend EvalFunc<T> class. Here ‘T’ denotes the return type i.e. DataByteArray, DataBag,Tuple,String  etc.
The exec(Tuple input) method is implemented in the UDF  which is invoked on every input tuple. It takes tuple with input parameters in the order they are passed to function in the Pig Script.

Here in the following example, we are writing UDF to convert entire tuple into hexadecimal.

package com.test.udf;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataByteArray;
import org.apache.pig.data.Tuple;

public class HexConversion extends EvalFunc<DataByteArray> {
      /**
        * UDF to convert ASCII to hexadecimal.It returns the string into Hex format as DataByteArray
       */
        public DataByteArray exec(final Tuple input) throws IOException {
                    DataByteArray output = new DataByteArray();
                    if (input == null) {
                                output = null;
                    }
                    try {
                                final String str = input.get(0).toString();
                                String code;
                                int strlength = str.length();
                                StringBuilder builder = new StringBuilder();
                                char[] charArr = new char[strlength];
                                for (int i = 0; i < str.length(); i++) {
                                            char ch = str.charAt(i);
                                            code = Integer.toHexString(ch).toUpperCase();
                                            charArr[i] = code;
                               
                                }
                                builder.append(charArr);
                                output.append(builder.toString());
                    } catch (final Exception e) {
                                output.append(new byte[0]);
                    }
                    return output;
        }
}

Schema:
In case of Tuple or DataBag return type, Schema information needs to be passed explicitly in outputSchema method. You need to import following two classes and implement this method:

import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

   public Schema outputSchema(Schema input) {
        try{
            Schema tupleSchema = new Schema();
            tupleSchema.add(input.getField(1));
            tupleSchema.add(input.getField(0));
            return new Schema(new      Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),  input),tupleSchema, DataType.TUPLE));
        }catch (Exception e){
                return null;
        }
    }

Build the above UDF as a jar file : hexConvertor.jar
Now let’s see how to call this UDF in PigScript. Register the jar file and call the method.

  REGISTER hexConvertor.jar;
  A = LOAD 'sample_data' AS (field1: bytearray, age: int);
  B = FOREACH A GENERATE com.test.udf.HexConversion(field1);
  DUMP B;

Now you can also write your own UDF. Cheers..!!!

Friday, 20 February 2015

Penn Treebank POS Tags in Natural Language Processing

Part-of-speech(POS) tags are the most common things to be used in Natural Language processing. Let's say we are parsing a sentence and following is the result parse tree:

(TOP  (SBARQ (WHADVP (WRB How))
(SQ (VBZ is)
       (NP
 (NP (DT the) (NN author) )
 (PP (IN of)
        (NP
               (NP (DT The) (NNP Call) )
               (PP (IN of)
                      (NP (DT the) (NNP Wild?) )
                )
          )
 )
                                      )
                            )
  ))

Now the problem arises : What do these pos tags(i.e. SBARQ,WHADVP,VBZ etc.) stand for? 
So to avoid this situation, I am consolidating all the pos tags here :

Tag
Description
CC
Coordinating conjunction
CD
Cardinal number
DT
Determiner
EX
Existential there
FW
Foreign word
IN
Preposition or subordinating conjunction
JJ
Adjective
JJR
Adjective, comparative
JJS
Adjective, superlative
LS
List item marker
MD
Modal
NN
Noun, singular or mass
NNS
Noun, plural
NNP
Proper noun, singular
NNPS
Proper noun, plural
PDT
Pre determiner
POS
Possessive ending
PRP
Personal pronoun
PRP$
Possessive pronoun
RB
Adverb
RBR
Adverb, comparative
RBS
Adverb, superlative
RP
Particle
S
Simple declarative clause, i.e. one that is not introduced by a (possible empty) subordinating conjunction or a wh-word and that does not exhibit subject-verb inversion
SBAR
Clause introduced by a (possibly empty) subordinating conjunction
SBARQ
Direct question introduced by a wh-word or a wh-phrase. Indirect questions and relative clauses should be bracketed as SBAR, not SBARQ
SINV
Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.
SQ
 Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ.
SYM
Symbol
VBD
Verb, past tense
VBG
Verb, gerund or present participle
VBN
Verb, past participle
VBP
Verb, non-3rd person singular present
VBZ
Verb, 3rd person singular present
WDT
Wh-determiner
WP
Wh-pronoun
WP$
Possessive wh-pronoun
WRB
Wh-adverb

 will keep updating the list with new pos tags. Cheers!!!