Wednesday 28 January 2015

Word Count Program using Crunch

WordCount Class

import org.apache.crunch.DoFn;
import org.apache.crunch.Emitter;
import org.apache.crunch.PCollection;
import org.apache.crunch.PTable;
import org.apache.crunch.Pair;
import org.apache.crunch.Pipeline;
import org.apache.crunch.impl.mr.MRPipeline;
import org.apache.crunch.types.writable.Writables;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * A word count example for Apache Crunch.
 */
public class WordCount extends Configured implements Tool {

public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new WordCount(), args);
}

public int run(String[] args) throws Exception {

if (args.length != 2) {
System.out.println("USAGE:- <Input Path> <Output Path>")
                        GenericOptionsParser.printGenericCommandUsage(System.err);
return 1;
}

String inputPath = args[0];
String outputPath = args[1];

// Create an object to coordinate pipeline creation and execution.
Pipeline pipeline = new MRPipeline(WordCount.class);

PCollection<String> lines = pipeline.readTextFile(inputPath);

// Define a function that splits each line in a PCollection of Strings
// into a PCollection made up of the individual words in the file.
// The second argument sets the serialization format.
PCollection<String> words = lines.parallelDo(new Tokenizer(),
Writables.strings());

// The count method applies a series of Crunch primitives and returns
// a map of the unique words in the input PCollection to their counts.
PTable<String, Long> counts = words.count();

// Instruct the pipeline to write the resulting counts to a text file.
pipeline.writeTextFile(counts, outputPath);

// Execute the pipeline as a MapReduce.
pipeline.done();
return 0;

// return result.succeeded() ? 0 : 1;
}
}



Tokenizer Class


import org.apache.crunch.DoFn;
import org.apache.crunch.Emitter;

/**
 * Splits a line of text, filtering known stop words.
 */
public class Tokenizer extends DoFn<String, String> {
         // Tokenize the line using pipe as delimeter
@Override
public void process(String line, Emitter<String> emitter) {
String[] splitted = line.split("\\|");
for (String word : splitted) {
emitter.emit(word);
}
}
}

Monday 19 January 2015

Working with Apache Crunch - Overview

Apache Crunch is a thin veneer on top of map reduce for developing easy and efficient map reduce pipelines. It aims to help the developers with a high level API for writing and testing complex map reduce programs that require multiple processing stages. Crunch is modeled after 'FlumeJava' by Google. 

What is advantageous about Crunch?

  • All the common functionality like JOIN, AGGREGATION, SORTING, etc., are available in Crunch API.
  • Crunch can process data from various input sources that confirm to different input types like serialized object formats, time series data, HBase rows and columns, etc. It does not impose a single data type.
  • Crunch provides type safety.
  • The 'MemPipeline' feature available makes testing a much easier process.
  • Crunch manages pipeline execution very easily and efficiently.


3 most important Crunch APIs

Crunch APIs are centered around 3 immutable distributed data sets.
  1. PCollection
  2. PTable
  3. PGroupedTable 


PCollection - Lazily evaluated parallel collection

PCollection<T> provides an immutable, distributed and unsorted collection of elements of type T.

          e.g.: PCollection<String>


PCollection provides a method called parallelDo, that apply the logic in DoFn parallel to all the elements present in source PCollection and provides a transformed PCollection object as the output. paralleDo allows element wise comparison over an input PCollection.


PTable - Sub interface of PCollection<Pair<K,V>>

PTable<K,V> provides an immutable, distributed and unsorted multimap of key type K and value type V.

          e.g.: PTable<String, String>

PTable provides parallelDo, groupByKey, join, cogroup operations.


PGroupedTable - Result of groupByKey function

PGroupedTable<K,V> is a distributed, sorted map of keys of type K to an iterable that may be iterated once. PGroupedTable<K,V> has parallelDo, combinedValues operations.



Input Data - Source
Output Data - Target
Data Container Abstraction  - PCollection
Data Format and Serialization - POJOs and PTypes
Data Transformation - DoFn

Friday 16 January 2015

'Bigdata In Our Palm' is a technology blog wherein I share the latest trends in big data industry. This is also a forum where I discuss about the various challenges I faced working with the Hadoop ecosystem.