Tuesday, August 19, 2014

Pig - Getting started.


PIG Basics

Pig Latin is a data flow language. This means it allows users to describe how data from
One or more inputs should be read, processed, and then stored to one or more outputs
In parallel.

#1:


A = Load 'student' USING PigStorage() AS (name:chararray, age:int, salary:float);
B = ForEach A GENERATE name,age,salary;
Dump B;

·         Load statement reads the files from HDFS.
·         FOREACH statement works with columns of data.
·         DUMP retrieves the result in the format specified in FOREACH statement.

#2

A = Load 'student' USING PigStorage() AS (name:chararray, age:int, salary:float);
B = ForEach A GENERATE name;
Store B Into ‘/usr/log/result.log’;
·          
·         STORE to store the retrieved result in specific location.

#3


A = LOAD ‘myfile.csv USING PigStorage(‘,’) AS (t, u, v);
B = GROUP A BY t;
C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
D = ORDER C BY mycount;
E = LIMIT D 100;
STORE D INTO ‘mysortedcount’ USING PigStorage();

·         Here in  “myfile.csv” file, all the row values will be split by ‘,’ and loaded in first statement A
·         Extract unique values from a column in a relation you can use DISTINCT or GROUP BY/GENERATE.
·         LIMIT fetch specified number of top records as per the ORDER BY clause.





No comments:

Post a Comment