The Impatient series is a set of tutorials to get you started with Cascading 2.5.

This set of progressive coding examples starts with a simple file copy and builds up to a MapReduce implementation of the TF-IDF algorithm.

Prerequisites

The code of this tutorial is hosted on github. Please clone it onto your local disk like so:

$ git clone https://github.com/Cascading/Impatient

Each part has its own sub-directory in the repository.

In order to follow the tutorial, you will also have to have gradle and apache hadoop installed on your computer. You do not need a hadoop cluster, local mode is sufficient.

Gradle

Everything has been tested with gradle 1.6. You can check your version of gradle like this:

$ gradle -v
------------------------------------------------------------
Gradle 1.6
------------------------------------------------------------
...

Hadoop

For hadoop please install the latest stable version from the 1.x series. At the time of this writing, this means apache hadoop 1.2.1.

$ hadoop version
Hadoop 1.2.1
...

Cascading is compatible with a number of hadoop distributions and versions. You can see on the compatibility page, if your distribution is supported.

IDE support

While an IDE is not strictly required to follow the tutorials, it is certainly useful. You can easily create an IntelliJ IDEA compatible project in each part of the tutorial like this:

$ gradle ideaModule

If you prefer eclipse, you can run:

gradle eclipse

Part 1

  • Implements simplest Cascading app possible

  • Copies each TSV line from source tap to sink tap

  • Roughly, in about a dozen lines of code

  • Physical plan: 1 Mapper

Part 2

  • Implements a simple example of WordCount

  • Uses a regex to split the input text lines into a token stream

  • Generates a DOT file, to show the Cascading flow graphically

  • Physical plan: 1 Mapper, 1 Reducer

Part 3

  • Uses a custom Function to scrub the token stream

  • Discusses when to use standard Operations vs. creating custom ones

  • Physical plan: 1 Mapper, 1 Reducer

Part 4

  • Shows how to use a HashJoin on two pipes

  • Filters a list of stop words out of the token stream

  • Physical plan: 1 Mapper, 1 Reducer

Part 5

  • Calculates TF-IDF using an ExpressionFunction

  • Shows how to use a CountBy, SumBy, and a CoGroup

  • Physical plan: 10 Mappers, 8 Reducers

Part 6

  • Includes unit tests in the build

  • Shows how to use other TDD features: checkpoints, assertions, traps, debug

  • Physical plan: 11 Mappers, 8 Reducers

Other versions

Also, compare these other excellent implementations of the example apps here: