Apache Pig user defined functions (UDFs)
Python UDF example 
• Motivation 
– Simple tasks like string manipulation and math 
computations are easier with a scripting language. 
– Users can also develop custom scripting engines 
– Currently only Python is supported due to the 
availability of Jython 
• Example 
– Calculate the square of a column 
– Write Hello World
Python UDF 
• Pig script 
register 'test.py' using jython as myfuncs; 
register 'test.py' using 
org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs; 
b = foreach a generate myfuncs.helloworld(), myfuncs.square(3); 
• test.py 
@outputSchema("x:{t:(word:chararray)}") 
def helloworld(): 
return ('Hello, World’) 
@outputSchema("y:{t:(word:chararray,num:long)}") 
def complex(word): 
return(str(word),long(word)*long(word)) 
@outputSchemaFunction("squareSchema") 
def square(num): 
return ((num)*(num)) 
@schemaFunction("squareSchema") 
def squareSchema(input): 
return input
UDF’s 
• UDF’s are user defined functions and are of 
the following types: 
– EvalFunc 
• Used in the FOREACH clause 
– FilterFunc 
• Used in the FILTER by clause 
– LoadFunc 
• Used in the LOAD clause 
– StoreFunc 
• Used in the STORE clause
Writing a Simple EvalFunc 
• Eval is the most common function and can be used in 
FOREACH statement of Pig 
--myscript.pig 
REGISTER myudfs.jar; 
A = LOAD 'student_data' AS (name:chararray, age: 
int, gpa:float); 
B = FOREACH A GENERATE myudfs.UPPER(name); 
DUMP B;
Source for UPPER UDF 
package myudfs; 
import java.io.IOException; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.Tuple; 
import org.apache.pig.impl.util.WrappedIOException; 
public class UPPER extends EvalFunc<String> 
{ 
public String exec(Tuple input) throws IOException 
{ 
if (input == null || input.size() == 0) 
return null; 
try 
{ 
String str = (String)input.get(0); 
return str.toUpperCase(); 
} 
catch(Exception e) 
{ 
throw WrappedIOException.wrap("Caught exception processing input 
row ", e); 
} 
} 
}
EvalFunc’s returning Complex Types 
Create a jar of the UDFs 
$ls ExpectedClick/Eval 
LineAdToMatchtype.java 
$javac -cp pig.jar ExpectedClick/Eval/*.java 
$jar -cf ExpectedClick.jar ExpectedClick/Eval/* 
Use your function in the Pig Script 
register ExpectedClick.jar; 
offer = LOAD '/user/viraj/dataseta' USING Loader() AS (a,b,c); 
… 
offer_projected = FOREACH offer_filtered 
(chararray)a#'canon_query' AS a_canon_query, 
FLATTEN(ExpectedClick.Evals.LineAdToMatchtype((chararray)a#‘source')) AS matchtype, …
EvalFunc’s returning Complex Types 
package ExpectedClick.Evals; 
public class LineAdToMatchtype extends EvalFunc<DataBag> 
{ 
private String lineAdSourceToMatchtype (String lineAdSource) 
{ 
if (lineAdSource.equals("0") 
{ return "1"; } 
else if (lineAdSource.equals("9")) { return "2"; } 
else if (lineAdSource.equals("13")) { return "3"; } 
else return "0“; 
} 
…
EvalFunc’s returning Complex Types 
public DataBag exec (Tuple input) throws IOException 
{ 
if (input == null || input.size() == 0) 
return null; 
String lineAdSource; 
try { 
lineAdSource = (String)input.get(0); 
} catch(Exception e) { 
System.err.println("ExpectedClick.Evals.LineAdToMatchType: Can't 
convert field to a string; error = " + e.getMessage()); 
return null; 
} 
Tuple t = DefaultTupleFactory.getInstance().newTuple(); 
try { 
t.set(0,lineAdSourceToMatchtype(lineAdSource)); 
}catch(Exception e) {} 
DataBag output = DefaultBagFactory.getInstance().newDefaultBag(); 
output.add(t); 
return output; 
}
FilterFunc 
• Filter functions are eval functions that return a boolean value 
• Filter functions can be used anywhere a Boolean expression is 
appropriate 
– FILTER operator or Bincond 
• Example use Filter Func to implement outer join 
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); 
B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararay, contributions: 
float); 
C = COGROUP A BY name, B BY name; 
D = FOREACH C GENERATE group, flatten((IsEmpty(A) ? null : A)), flatten((IsEmpty(B) ? null : 
B)); 
dump D;
isEmpty FilterFunc 
import java.io.IOException; 
import java.util.Map; 
import org.apache.pig.FilterFunc; 
import org.apache.pig.backend.executionengine.ExecException; 
import org.apache.pig.data.DataBag; 
import org.apache.pig.data.Tuple; 
import org.apache.pig.data.DataType; 
import org.apache.pig.impl.util.WrappedIOException; 
public class IsEmpty extends FilterFunc 
{ 
public Boolean exec(Tuple input) throws IOException 
{ 
if (input == null || input.size() == 0) return null; 
try { 
Object values = input.get(0); 
if (values instanceof DataBag) 
return ((DataBag)values).size() == 0; 
else if (values instanceof Map) 
return ((Map)values).size() == 0; 
else { 
throw new IOException("Cannot test a " + DataType.findTypeName(values) + " for emptiness."); 
} 
} 
catch (ExecException ee) { 
throw WrappedIOException.wrap("Caught exception processing input row ", ee); 
} 
} 
}
LoadFunc 
• LoadFunc abstract class has the main methods for loading data 
• 3 important interfaces 
– LoadMetadata has methods to deal with metadata 
– LoadPushDown has methods to push operations from pig runtime into 
loader implementations 
– LoadCaster has methods to convert byte arrays to specific types 
• implement this method if your loader casts (implicit or explicit) from 
DataByteArray fields to other types 
• Functions to be implemented 
– getInputFormat() 
– setLocation() 
– prepareToRead() 
– getNext() 
– setUdfContextSignature() 
– relativeToAbsolutePath()
Regexp Loader Example 
public class RegexLoader extends LoadFunc { 
private LineRecordReader in = null; 
long end = Long.MAX_VALUE; 
private final Pattern pattern; 
public RegexLoader(String regex) { 
pattern = Pattern.compile(regex); 
} 
public InputFormat getInputFormat() throws IOException { 
return new TextInputFormat(); 
} 
public void prepareToRead(RecordReader reader, PigSplit split) 
throws IOException { 
in = (LineRecordReader) reader; 
} 
public void setLocation(String location, Job job) throws IOException { 
FileInputFormat.setInputPaths(job, location); 
}
Regexp Loader 
public Tuple getNext() throws IOException { 
if (!in.nextKeyValue()) { 
return null; 
} 
Matcher matcher = pattern.matcher(""); 
TupleFactory mTupleFactory = DefaultTupleFactory.getInstance(); 
String line; 
boolean tryNext = true; 
while (tryNext) { 
Text val = in.getCurrentValue(); 
if (val == null) { 
break; 
} 
line = val.toString(); 
if (line.length() > 0 && line.charAt(line.length() - 1) == 'r') { 
line = line.substring(0, line.length() - 1); 
} 
matcher = matcher.reset(line); 
ArrayList<DataByteArray> list = new ArrayList<DataByteArray>(); 
if (matcher.find()) { 
tryNext=false; 
for (int i = 1; i <= matcher.groupCount(); i++) { 
list.add(new DataByteArray(matcher.group(i))); 
} 
return mTupleFactory.newTuple(list); 
} 
} 
return null; 
} }
End of session 
Day – 3: Apache Pig user defined functions (UDFs)

05 pig user defined functions (udfs)

  • 1.
    Apache Pig userdefined functions (UDFs)
  • 2.
    Python UDF example • Motivation – Simple tasks like string manipulation and math computations are easier with a scripting language. – Users can also develop custom scripting engines – Currently only Python is supported due to the availability of Jython • Example – Calculate the square of a column – Write Hello World
  • 3.
    Python UDF •Pig script register 'test.py' using jython as myfuncs; register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs; b = foreach a generate myfuncs.helloworld(), myfuncs.square(3); • test.py @outputSchema("x:{t:(word:chararray)}") def helloworld(): return ('Hello, World’) @outputSchema("y:{t:(word:chararray,num:long)}") def complex(word): return(str(word),long(word)*long(word)) @outputSchemaFunction("squareSchema") def square(num): return ((num)*(num)) @schemaFunction("squareSchema") def squareSchema(input): return input
  • 4.
    UDF’s • UDF’sare user defined functions and are of the following types: – EvalFunc • Used in the FOREACH clause – FilterFunc • Used in the FILTER by clause – LoadFunc • Used in the LOAD clause – StoreFunc • Used in the STORE clause
  • 5.
    Writing a SimpleEvalFunc • Eval is the most common function and can be used in FOREACH statement of Pig --myscript.pig REGISTER myudfs.jar; A = LOAD 'student_data' AS (name:chararray, age: int, gpa:float); B = FOREACH A GENERATE myudfs.UPPER(name); DUMP B;
  • 6.
    Source for UPPERUDF package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try { String str = (String)input.get(0); return str.toUpperCase(); } catch(Exception e) { throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
  • 7.
    EvalFunc’s returning ComplexTypes Create a jar of the UDFs $ls ExpectedClick/Eval LineAdToMatchtype.java $javac -cp pig.jar ExpectedClick/Eval/*.java $jar -cf ExpectedClick.jar ExpectedClick/Eval/* Use your function in the Pig Script register ExpectedClick.jar; offer = LOAD '/user/viraj/dataseta' USING Loader() AS (a,b,c); … offer_projected = FOREACH offer_filtered (chararray)a#'canon_query' AS a_canon_query, FLATTEN(ExpectedClick.Evals.LineAdToMatchtype((chararray)a#‘source')) AS matchtype, …
  • 8.
    EvalFunc’s returning ComplexTypes package ExpectedClick.Evals; public class LineAdToMatchtype extends EvalFunc<DataBag> { private String lineAdSourceToMatchtype (String lineAdSource) { if (lineAdSource.equals("0") { return "1"; } else if (lineAdSource.equals("9")) { return "2"; } else if (lineAdSource.equals("13")) { return "3"; } else return "0“; } …
  • 9.
    EvalFunc’s returning ComplexTypes public DataBag exec (Tuple input) throws IOException { if (input == null || input.size() == 0) return null; String lineAdSource; try { lineAdSource = (String)input.get(0); } catch(Exception e) { System.err.println("ExpectedClick.Evals.LineAdToMatchType: Can't convert field to a string; error = " + e.getMessage()); return null; } Tuple t = DefaultTupleFactory.getInstance().newTuple(); try { t.set(0,lineAdSourceToMatchtype(lineAdSource)); }catch(Exception e) {} DataBag output = DefaultBagFactory.getInstance().newDefaultBag(); output.add(t); return output; }
  • 10.
    FilterFunc • Filterfunctions are eval functions that return a boolean value • Filter functions can be used anywhere a Boolean expression is appropriate – FILTER operator or Bincond • Example use Filter Func to implement outer join A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararay, contributions: float); C = COGROUP A BY name, B BY name; D = FOREACH C GENERATE group, flatten((IsEmpty(A) ? null : A)), flatten((IsEmpty(B) ? null : B)); dump D;
  • 11.
    isEmpty FilterFunc importjava.io.IOException; import java.util.Map; import org.apache.pig.FilterFunc; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; import org.apache.pig.data.DataType; import org.apache.pig.impl.util.WrappedIOException; public class IsEmpty extends FilterFunc { public Boolean exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try { Object values = input.get(0); if (values instanceof DataBag) return ((DataBag)values).size() == 0; else if (values instanceof Map) return ((Map)values).size() == 0; else { throw new IOException("Cannot test a " + DataType.findTypeName(values) + " for emptiness."); } } catch (ExecException ee) { throw WrappedIOException.wrap("Caught exception processing input row ", ee); } } }
  • 12.
    LoadFunc • LoadFuncabstract class has the main methods for loading data • 3 important interfaces – LoadMetadata has methods to deal with metadata – LoadPushDown has methods to push operations from pig runtime into loader implementations – LoadCaster has methods to convert byte arrays to specific types • implement this method if your loader casts (implicit or explicit) from DataByteArray fields to other types • Functions to be implemented – getInputFormat() – setLocation() – prepareToRead() – getNext() – setUdfContextSignature() – relativeToAbsolutePath()
  • 13.
    Regexp Loader Example public class RegexLoader extends LoadFunc { private LineRecordReader in = null; long end = Long.MAX_VALUE; private final Pattern pattern; public RegexLoader(String regex) { pattern = Pattern.compile(regex); } public InputFormat getInputFormat() throws IOException { return new TextInputFormat(); } public void prepareToRead(RecordReader reader, PigSplit split) throws IOException { in = (LineRecordReader) reader; } public void setLocation(String location, Job job) throws IOException { FileInputFormat.setInputPaths(job, location); }
  • 14.
    Regexp Loader publicTuple getNext() throws IOException { if (!in.nextKeyValue()) { return null; } Matcher matcher = pattern.matcher(""); TupleFactory mTupleFactory = DefaultTupleFactory.getInstance(); String line; boolean tryNext = true; while (tryNext) { Text val = in.getCurrentValue(); if (val == null) { break; } line = val.toString(); if (line.length() > 0 && line.charAt(line.length() - 1) == 'r') { line = line.substring(0, line.length() - 1); } matcher = matcher.reset(line); ArrayList<DataByteArray> list = new ArrayList<DataByteArray>(); if (matcher.find()) { tryNext=false; for (int i = 1; i <= matcher.groupCount(); i++) { list.add(new DataByteArray(matcher.group(i))); } return mTupleFactory.newTuple(list); } } return null; } }
  • 15.
    End of session Day – 3: Apache Pig user defined functions (UDFs)