Using Globs and Wildcards with MapReduce

October 1, 2013 Dev Ayon

A typical question we face while writing MapReduce jobs is how to include multiple source files to the job. A cursory glance at the FileInputFormat indicates that there are a couple of approaches:

static void

setInputPathFilter(Job job, Class<? extends PathFilter > filter)
          Set a PathFilter to be applied to the input paths for the map-reduce job.

static void

setInputPaths(Job job, Path... inputPaths)
          Set the array of Paths as the list of inputs for the map-reduce job.

static void

setInputPaths(Job job, String commaSeparatedPaths)
          Sets the given comma separated paths as the list of inputs for the map-reduce job.

So, in essence, one could write a filter to which would include only some kinds of files. Alternatively, one could include a comma separated list of inputs, or a list of path objects. These kind of approaches are necessary for a couple of use cases. For example, suppose you have a mapreduce job A that runs every hour. The output(s) of this jobs are written into directories such as:

/user/jobrunner/jobA/MMddyyyy/HH

Say a MapReduce job B uses this output as its input, and runs every day. Thus, the input to job B for 9 September 2013 is:

/user/jobrunner/jobA/09092013/01,/user/jobrunner/jobA/09092013/02,/user/jobrunner/jobA/09092013/03….

And so on…

Sounds tedious, right?  Well, one could argue that you could write a PathFilter to look for subdirectories. However, as it turns out, there is a simple feature to solve issues like these: Globs.

Globs, to simplify, are just wildcard expressions you can use to name files and directories. Globs in HDFS (and MapRFS, if you use MapR) follow syntax very similar to that in typical unix-bases OSes such as Linux:

*

Matches zero or more characters

?

Matches a single character

[ab]

Matches a single character in the set {a, b}

[^ab]

Matches a single character that is not in the set {a, b}

[a-b]

Matches a single character in the (closed) range [a, b], where a is lexicographically less than or equal to b

[^a-b]

Matches a single character that is not in the (closed) range [a, b], where a is lexicographically less than or equal to b

{a,b}

Matches either expression a or b

\c

Matches character c when it is a metacharacter

 

Thus, the input to our job can be given as

/user/jobrunner/jobA/09092013/*/

Simple as that!

Note: MapR suggests that globs don't work very fast on MapRFS. The suggestion is that, instead of using globs, it is better to write explicit code that lists out all files, and then filter them. In our tests at Zaloni however, we see that there are cases where globs on MapR do work faster than explicit code (in some cases). When the size of a glob expression is very large and the resultant set of files returned is also large, then globs work better than using a list-and-filter approach.

Defining the Data Lake White Paper banner

About the Author

Dev Ayon

Dev Ayon is a software engineer at Zaloni and specializes on scalable solutions with MapReduce and Spark. Interests are scalable data anonymization, semantic data and reasoning, continuous build and deployment systems, and meddling in other people's business.

More Content by Dev Ayon
Previous Article
Thoughts from Ben: The Hadoop 2.x Update and What It Means for Bedrock

Earlier this month Hadoop released their latest version, 2.x. This, of course, has resulted in a flurry of ...

Next Article
Secondary Sorting in Hadoop
Secondary Sorting in Hadoop

We continue our technical series with a post on Secondary Sorting in Hadoop.

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!