Monday 19 January 2015

Working with Apache Crunch - Overview

Apache Crunch is a thin veneer on top of map reduce for developing easy and efficient map reduce pipelines. It aims to help the developers with a high level API for writing and testing complex map reduce programs that require multiple processing stages. Crunch is modeled after 'FlumeJava' by Google. 

What is advantageous about Crunch?

  • All the common functionality like JOIN, AGGREGATION, SORTING, etc., are available in Crunch API.
  • Crunch can process data from various input sources that confirm to different input types like serialized object formats, time series data, HBase rows and columns, etc. It does not impose a single data type.
  • Crunch provides type safety.
  • The 'MemPipeline' feature available makes testing a much easier process.
  • Crunch manages pipeline execution very easily and efficiently.


3 most important Crunch APIs

Crunch APIs are centered around 3 immutable distributed data sets.
  1. PCollection
  2. PTable
  3. PGroupedTable 


PCollection - Lazily evaluated parallel collection

PCollection<T> provides an immutable, distributed and unsorted collection of elements of type T.

          e.g.: PCollection<String>


PCollection provides a method called parallelDo, that apply the logic in DoFn parallel to all the elements present in source PCollection and provides a transformed PCollection object as the output. paralleDo allows element wise comparison over an input PCollection.


PTable - Sub interface of PCollection<Pair<K,V>>

PTable<K,V> provides an immutable, distributed and unsorted multimap of key type K and value type V.

          e.g.: PTable<String, String>

PTable provides parallelDo, groupByKey, join, cogroup operations.


PGroupedTable - Result of groupByKey function

PGroupedTable<K,V> is a distributed, sorted map of keys of type K to an iterable that may be iterated once. PGroupedTable<K,V> has parallelDo, combinedValues operations.



Input Data - Source
Output Data - Target
Data Container Abstraction  - PCollection
Data Format and Serialization - POJOs and PTypes
Data Transformation - DoFn

No comments:

Post a Comment