PIG Basics
Pig
Latin is a data flow language. This means it allows users to describe how data
from
One
or more inputs should be read, processed, and then stored to one or more
outputs
In
parallel.
#1:
A = Load 'student' USING
PigStorage() AS (name:chararray, age:int, salary:float);
B = ForEach A GENERATE
name,age,salary;
Dump B;
·
Load
statement reads the files from HDFS.
·
FOREACH
statement works with columns of data.
·
DUMP
retrieves the result in the format specified in FOREACH statement.
#2
A = Load 'student' USING
PigStorage() AS (name:chararray, age:int, salary:float);
B = ForEach A GENERATE
name;
Store B Into ‘/usr/log/result.log’;
·
·
STORE
to store the retrieved result in specific location.
#3
A = LOAD ‘myfile.csv
USING PigStorage(‘,’) AS (t, u, v);
B = GROUP A BY t;
C = FOREACH B GENERATE
group, COUNT(A.t) as mycount;
D = ORDER C BY mycount;
E = LIMIT D 100;
STORE D INTO
‘mysortedcount’ USING PigStorage();
·
Here
in “myfile.csv”
file, all the row values will be split by ‘,’
and loaded in first statement A
·
Extract unique values from a column in a relation you can use
DISTINCT or GROUP BY/GENERATE.
·
LIMIT fetch
specified number of top records as per the ORDER
BY clause.