Quantcast
Channel: SquareCog's SquareBlog » hadoop
Viewing all articles
Browse latest Browse all 7

Upcoming Features in Pig 0.8: Dynamic Invokers

$
0
0

Pig release 0.8 is scheduled to be feature-frozen and branched at the end of August 2010. This release has many, many useful new features, mostly addressing usability. In this series of posts, I will demonstrate some of my favorites from this release.

Pig 0.8 will have a family of built-in UDFs called Dynamic Invokers. The idea is simple: frequently, Pig users need to use a simple function that is already provided by standard Java libraries, but for which a UDF has not been written. Dynamic Invokers allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs, at the cost of doing some Java reflection on every function call.

An example.

Let’s start off with a quick motivation example. Imagine we have a bunch of URL-encoded strings which we want to decode. In Java, this is done by simply calling:

String decoded = URLDecoder.decode(encoded, "UTF-8");

In Pig, there is no built-in function to do this, but it’s easy enough to write your own, wrapping the URLDecoder function:

package org.squarecog.pig;

import java.io.IOException;
import java.net.URLDecoder;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UrlDecode extends EvalFunc<String> {

    @Override
    public String exec(Tuple input) throws IOException {
        String encoded = (String) input.get(0);
        String encoding = (String) input.get(1);
        return URLDecoder.decode(encoded, encoding);
    }
}

This is about the least amount of code you can get away with — it doesn’t check for failing casts, non-existing fields, and all kinds of other problems, but it does the job most of the time. Having written this class, the next step would be to compile it, test it, package it into a jar, and now the decoder is ready to be used in Pig:

REGISTER squarecogs_pig_stuff.jar;

encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE org.squarecog.pig.UrlDecode(encoded, 'UTF-8');

What a pain. There must be an easier way, right? Well, now there is. With Pig 0.8 all you have to do is put this in your Pig script:

DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

That’s it. No Java, no compilation. Just use it.

Usage

Currently, Dynamic Invokers can be used for any static function that accepts no arguments or some combination of Strings, ints, longs, doubles, floats, or arrays of same, and returns a String, an int, a long, a double, or a float. Primitives only for the numbers, no capital-letter numeric classes as arguments. Depending on the return type, a specific kind of Invoker must be used: InvokeForString, InvokeForInt, InvokeForLong, InvokeForDouble, or InvokeForFloat.

The DEFINE keyword is used to bind a keyword to a Java method, as above. The first argument to the InvokeFor* constructor is the full path to the desired method. The second argument is a space-delimited ordered list of the classes of the method arguments. This can be omitted or an empty string if the method takes no arguments. Valid class names are String, Long, Float, Double, and Int. Invokers can also work with array arguments, represented in Pig as DataBags of single-tuple elements. Simply refer to string[], for example. Class names are not case-sensitive.

Speed

I tested the speed of these Invokers by using them to take log of the numbers from 0 to 1,000,000 in a tight loop. For this experiment, using the dynamic InvokeForDouble UDF was about twice as slow as using the Log UDF directly. I find this to be an acceptable cost to pay for the speed and convenience of development when writing prototypes and one-off exploratory scripts. Naturally, if you are trying to squeeze all the performance that’s possible out of your scripts, you should use regular UDFs.

Arrays

As mentioned, Pig 0.8 invokers will support array arguments. This makes methods like those in org.apache.commons.math.stat.StatUtils available for processing the results of grouping your datasets, for example. This is very nice, but a word of caution: the resulting UDF will of course not be optimized for Hadoop, and the very significant benefits one gains from implementing the Algebraic and Accumulative interfaces are lost here. Be careful with this one.

Future Work

If people find these Invokers useful, more features can be added, such as support for booleans, bytes, and the various Number classes (rather than just primitives). Let me know what you would like to see, either in the comments, or, even better, on the Pig user mailing list.



Viewing all articles
Browse latest Browse all 7

Latest Images

Trending Articles



Latest Images