Working with Apache Crunch - Overview
Apache Crunch is a thin veneer on top of map reduce for developing easy and efficient map reduce pipelines. It aims to help the developers with a high level API for writing and testing complex map reduce programs that require multiple processing stages. Crunch is modeled after 'FlumeJava' by Google.
What is advantageous about Crunch?
- All the common functionality like JOIN, AGGREGATION, SORTING, etc., are available in Crunch API.
- Crunch can process data from various input sources that confirm to different input types like serialized object formats, time series data, HBase rows and columns, etc. It does not impose a single data type.
- Crunch provides type safety.
- The 'MemPipeline' feature available makes testing a much easier process.
- Crunch manages pipeline execution very easily and efficiently.
3 most important Crunch APIs
Crunch APIs are centered around 3 immutable distributed data sets.
- PCollection
- PTable
- PGroupedTable
PCollection - Lazily evaluated parallel collection
PCollection<T> provides an immutable, distributed and unsorted collection of elements of type T.
e.g.: PCollection<String>
PCollection provides a method called parallelDo, that apply the logic in DoFn parallel to all the elements present in source PCollection and provides a transformed PCollection object as the output. paralleDo allows element wise comparison over an input PCollection.
PTable - Sub interface of PCollection<Pair<K,V>>
PTable<K,V> provides an immutable, distributed and unsorted multimap of key type K and value type V.
e.g.: PTable<String, String>
PTable provides parallelDo, groupByKey, join, cogroup operations.
PGroupedTable - Result of groupByKey function
PGroupedTable<K,V> is a distributed, sorted map of keys of type K to an iterable that may be iterated once. PGroupedTable<K,V> has parallelDo, combinedValues operations.
Input Data | - Source |
Output Data | - Target |
Data Container Abstraction | - PCollection |
Data Format and Serialization | - POJOs and PTypes |
Data Transformation | - DoFn |
No comments:
Post a Comment