Saturday, 21 February 2015

Writing Java UDF in Apache Pig


Isn’t it good that we can write user defined functions (UDF) for custom processing in Pig also? Here we’ll talk about writing UDF in java.

How to write the Java UDF:
First of all, add pig dependency in the java project.
Now, define a UDF class (eg. HexConversion) . Each UDF will extend EvalFunc<T> class. Here ‘T’ denotes the return type i.e. DataByteArray, DataBag,Tuple,String  etc.
The exec(Tuple input) method is implemented in the UDF  which is invoked on every input tuple. It takes tuple with input parameters in the order they are passed to function in the Pig Script.

Here in the following example, we are writing UDF to convert entire tuple into hexadecimal.

package com.test.udf;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataByteArray;
import org.apache.pig.data.Tuple;

public class HexConversion extends EvalFunc<DataByteArray> {
      /**
        * UDF to convert ASCII to hexadecimal.It returns the string into Hex format as DataByteArray
       */
        public DataByteArray exec(final Tuple input) throws IOException {
                    DataByteArray output = new DataByteArray();
                    if (input == null) {
                                output = null;
                    }
                    try {
                                final String str = input.get(0).toString();
                                String code;
                                int strlength = str.length();
                                StringBuilder builder = new StringBuilder();
                                char[] charArr = new char[strlength];
                                for (int i = 0; i < str.length(); i++) {
                                            char ch = str.charAt(i);
                                            code = Integer.toHexString(ch).toUpperCase();
                                            charArr[i] = code;
                               
                                }
                                builder.append(charArr);
                                output.append(builder.toString());
                    } catch (final Exception e) {
                                output.append(new byte[0]);
                    }
                    return output;
        }
}

Schema:
In case of Tuple or DataBag return type, Schema information needs to be passed explicitly in outputSchema method. You need to import following two classes and implement this method:

import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

   public Schema outputSchema(Schema input) {
        try{
            Schema tupleSchema = new Schema();
            tupleSchema.add(input.getField(1));
            tupleSchema.add(input.getField(0));
            return new Schema(new      Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),  input),tupleSchema, DataType.TUPLE));
        }catch (Exception e){
                return null;
        }
    }

Build the above UDF as a jar file : hexConvertor.jar
Now let’s see how to call this UDF in PigScript. Register the jar file and call the method.

  REGISTER hexConvertor.jar;
  A = LOAD 'sample_data' AS (field1: bytearray, age: int);
  B = FOREACH A GENERATE com.test.udf.HexConversion(field1);
  DUMP B;

Now you can also write your own UDF. Cheers..!!!

No comments:

Post a Comment