The Impatient series is a set of tutorials to get you started with Cascading 2.5.
This set of progressive coding examples starts with a simple file copy and builds up to a MapReduce implementation of the TF-IDF algorithm.
Prerequisites
The code of this tutorial is hosted on github. Please clone it onto your local disk like so:
$ git clone https://github.com/Cascading/Impatient
Each part has its own sub-directory in the repository.
In order to follow the tutorial, you will also have to have gradle and apache hadoop installed on your computer. You do not need a hadoop cluster, local mode is sufficient.
Gradle
Everything has been tested with gradle 1.6. You can check your version of gradle like this:
$ gradle -v ------------------------------------------------------------ Gradle 1.6 ------------------------------------------------------------ ...
Hadoop
For hadoop please install the latest stable version from the 1.x series. At the time of this writing, this means apache hadoop 1.2.1.
$ hadoop version Hadoop 1.2.1 ...
Cascading is compatible with a number of hadoop distributions and versions. You can see on the compatibility page, if your distribution is supported.
IDE support
While an IDE is not strictly required to follow the tutorials, it is certainly useful. You can easily create an IntelliJ IDEA compatible project in each part of the tutorial like this:
$ gradle ideaModule
If you prefer eclipse, you can run:
gradle eclipse
Part 1
-
Implements simplest Cascading app possible
-
Copies each TSV line from source tap to sink tap
-
Roughly, in about a dozen lines of code
-
Physical plan: 1 Mapper
Part 2
-
Implements a simple example of WordCount
-
Uses a regex to split the input text lines into a token stream
-
Generates a DOT file, to show the Cascading flow graphically
-
Physical plan: 1 Mapper, 1 Reducer
Part 3
-
Uses a custom Function to scrub the token stream
-
Discusses when to use standard Operations vs. creating custom ones
-
Physical plan: 1 Mapper, 1 Reducer
Part 4
-
Shows how to use a HashJoin on two pipes
-
Filters a list of stop words out of the token stream
-
Physical plan: 1 Mapper, 1 Reducer
Part 5
-
Calculates TF-IDF using an ExpressionFunction
-
Shows how to use a CountBy, SumBy, and a CoGroup
-
Physical plan: 10 Mappers, 8 Reducers
Part 6
-
Includes unit tests in the build
-
Shows how to use other TDD features: checkpoints, assertions, traps, debug
-
Physical plan: 11 Mappers, 8 Reducers
Other versions
Also, compare these other excellent implementations of the example apps here:
-
Scalding for the Impatient by Sujit Pal in Scalding and
-
Cascalog for the Impatient by Paul Lam in Cascalog.