Bigdata In Our Palm

Working with Apache Crunch - Overview

Apache Crunch is a thin veneer on top of map reduce for developing easy and efficient map reduce pipelines. It aims to help the developers with a high level API for writing and testing complex map reduce programs that require multiple processing stages. Crunch is modeled after 'FlumeJava' by Google.

What is advantageous about Crunch?

All the common functionality like JOIN, AGGREGATION, SORTING, etc., are available in Crunch API.
Crunch can process data from various input sources that confirm to different input types like serialized object formats, time series data, HBase rows and columns, etc. It does not impose a single data type.
Crunch provides type safety.
The 'MemPipeline' feature available makes testing a much easier process.
Crunch manages pipeline execution very easily and efficiently.

3 most important Crunch APIs

Crunch APIs are centered around 3 immutable distributed data sets.

PCollection
PTable
PGroupedTable

PCollection - Lazily evaluated parallel collection

PCollection<T> provides an immutable, distributed and unsorted collection of elements of type T.

e.g.: PCollection<String>

PCollection provides a method called parallelDo, that apply the logic in DoFn parallel to all the elements present in source PCollection and provides a transformed PCollection object as the output. paralleDo allows element wise comparison over an input PCollection.

PTable - Sub interface of PCollection<Pair<K,V>>

PTable<K,V> provides an immutable, distributed and unsorted multimap of key type K and value type V.

e.g.: PTable<String, String>

PTable provides parallelDo, groupByKey, join, cogroup operations.

PGroupedTable - Result of groupByKey function

PGroupedTable<K,V> is a distributed, sorted map of keys of type K to an iterable that may be iterated once. PGroupedTable<K,V> has parallelDo, combinedValues operations.


Input Data	- Source
Output Data	- Target
Data Container Abstraction	- PCollection
Data Format and Serialization	- POJOs and PTypes
Data Transformation	- DoFn

Bigdata In Our Palm

Monday, 19 January 2015

Working with Apache Crunch - Overview

What is advantageous about Crunch?

3 most important Crunch APIs

PCollection - Lazily evaluated parallel collection

PTable - Sub interface of PCollection<Pair<K,V>>

PGroupedTable - Result of groupByKey function

No comments:

Post a Comment

Blog Archive