Bigdata In Our Palm: January 2015

Wednesday 28 January 2015

Word Count Program using Crunch

WordCount Class

import org.apache.crunch.DoFn;

import org.apache.crunch.Emitter;

import org.apache.crunch.PCollection;

import org.apache.crunch.PTable;

import org.apache.crunch.Pair;

import org.apache.crunch.Pipeline;

import org.apache.crunch.impl.mr.MRPipeline;

import org.apache.crunch.types.writable.Writables;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

/**

* A word count example for Apache Crunch.

public class WordCount extends Configured implements Tool {

public static void main(String[] args) throws Exception {

ToolRunner.run(new Configuration(), new WordCount(), args);

}

public int run(String[] args) throws Exception {

if (args.length != 2) {

System.out.println("USAGE:- <Input Path> <Output Path>")

GenericOptionsParser.printGenericCommandUsage(System.err);

return 1;

}

String inputPath = args[0];

String outputPath = args[1];

// Create an object to coordinate pipeline creation and execution.

Pipeline pipeline = new MRPipeline(WordCount.class);

PCollection<String> lines = pipeline.readTextFile(inputPath);

// Define a function that splits each line in a PCollection of Strings

// into a PCollection made up of the individual words in the file.

// The second argument sets the serialization format.

PCollection<String> words = lines.parallelDo(new Tokenizer(),

Writables.strings());

// The count method applies a series of Crunch primitives and returns

// a map of the unique words in the input PCollection to their counts.

PTable<String, Long> counts = words.count();

// Instruct the pipeline to write the resulting counts to a text file.

pipeline.writeTextFile(counts, outputPath);

// Execute the pipeline as a MapReduce.

pipeline.done();

return 0;

// return result.succeeded() ? 0 : 1;

}

Tokenizer Class

import org.apache.crunch.DoFn;

import org.apache.crunch.Emitter;

/**

* Splits a line of text, filtering known stop words.

public class Tokenizer extends DoFn<String, String> {

// Tokenize the line using pipe as delimeter

@Override

public void process(String line, Emitter<String> emitter) {

String[] splitted = line.split("\\|");

for (String word : splitted) {

emitter.emit(word);

}

Monday 19 January 2015

Working with Apache Crunch - Overview

Apache Crunch is a thin veneer on top of map reduce for developing easy and efficient map reduce pipelines. It aims to help the developers with a high level API for writing and testing complex map reduce programs that require multiple processing stages. Crunch is modeled after 'FlumeJava' by Google.

What is advantageous about Crunch?

All the common functionality like JOIN, AGGREGATION, SORTING, etc., are available in Crunch API.
Crunch can process data from various input sources that confirm to different input types like serialized object formats, time series data, HBase rows and columns, etc. It does not impose a single data type.
Crunch provides type safety.
The 'MemPipeline' feature available makes testing a much easier process.
Crunch manages pipeline execution very easily and efficiently.

3 most important Crunch APIs

Crunch APIs are centered around 3 immutable distributed data sets.

PCollection
PTable
PGroupedTable

PCollection - Lazily evaluated parallel collection

PCollection<T> provides an immutable, distributed and unsorted collection of elements of type T.

e.g.: PCollection<String>

PCollection provides a method called parallelDo, that apply the logic in DoFn parallel to all the elements present in source PCollection and provides a transformed PCollection object as the output. paralleDo allows element wise comparison over an input PCollection.

PTable - Sub interface of PCollection<Pair<K,V>>

PTable<K,V> provides an immutable, distributed and unsorted multimap of key type K and value type V.

e.g.: PTable<String, String>

PTable provides parallelDo, groupByKey, join, cogroup operations.

PGroupedTable - Result of groupByKey function

PGroupedTable<K,V> is a distributed, sorted map of keys of type K to an iterable that may be iterated once. PGroupedTable<K,V> has parallelDo, combinedValues operations.


Input Data	- Source
Output Data	- Target
Data Container Abstraction	- PCollection
Data Format and Serialization	- POJOs and PTypes
Data Transformation	- DoFn

Friday 16 January 2015

'Bigdata In Our Palm' is a technology blog wherein I share the latest trends in big data industry. This is also a forum where I discuss about the various challenges I faced working with the Hadoop ecosystem.