Tuesday, August 19, 2014

Pig - Getting started.


PIG Basics

Pig Latin is a data flow language. This means it allows users to describe how data from
One or more inputs should be read, processed, and then stored to one or more outputs
In parallel.

#1:


A = Load 'student' USING PigStorage() AS (name:chararray, age:int, salary:float);
B = ForEach A GENERATE name,age,salary;
Dump B;

·         Load statement reads the files from HDFS.
·         FOREACH statement works with columns of data.
·         DUMP retrieves the result in the format specified in FOREACH statement.

#2

A = Load 'student' USING PigStorage() AS (name:chararray, age:int, salary:float);
B = ForEach A GENERATE name;
Store B Into ‘/usr/log/result.log’;
·          
·         STORE to store the retrieved result in specific location.

#3


A = LOAD ‘myfile.csv USING PigStorage(‘,’) AS (t, u, v);
B = GROUP A BY t;
C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
D = ORDER C BY mycount;
E = LIMIT D 100;
STORE D INTO ‘mysortedcount’ USING PigStorage();

·         Here in  “myfile.csv” file, all the row values will be split by ‘,’ and loaded in first statement A
·         Extract unique values from a column in a relation you can use DISTINCT or GROUP BY/GENERATE.
·         LIMIT fetch specified number of top records as per the ORDER BY clause.





Monday, August 18, 2014

Hive - Create Database and table in Hive

Hive is a component, which provides SQL-Like interface to access data in HDFS. It provides data warehousing facilities on HDFS.
HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. 
Create Database syntax:

CREATE DATABASE IF NOT EXISTS <dbname>
LOCATION '/lib/warehouse/sample'
COMMENT 'Holds all db tables'
WITH DBPROPERTIES ('Use' = 'Demos', 'SchemaInfo' = 'db schema information');

·         IF NOT EXISTS clause is useful for scripts that should create a database onthe-fly, if necessary.
·         Location used to override the default location of the directory.

Create Table syntax:


CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
addr STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>)
PARTITIONED BY (country STRING, state STRING)
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02’', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

·         String,Float,Array,Map,Strauct are some of the data types.
·         Struct is represented as a particular type.
·         Deduction is Map type, with key value pair data type
·         For Array<string> every item in subordinate will be string
·         If the filed terminated by ‘,’, the file will be saved in csv format.
·         Each terminated by new line.