Hadoop - Pig

Ambari

  • Ambari provides Dashboard:


    image.png
  • Enable Admin Users:

ssh [email protected] -p 2222
su root
ambari-admin-password-reset

Pig Concept

image.png
  • Usage of Pig:
  1. Grunt
  2. Script
  3. Ambari

example -> find the oldest 5-star movie

  • New Script in Pig View
image.png
  • Load data:

ratings = LOAD 'ml-100k/u.data' AS (userID:int, movieID:int, rating:int, ratingTime: int);

metadata = LOAD 'ml-100k/u.item' USING PigStorage('|') AS (movieID:int, movieTitle:chararray, releaseDate:chararray,
videoRelease:chararray, imdbLink:chararray);

  • FOREACH/GENERATE:

nameLookup = FOREACH metadata GENERATE movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MM-yyyy')) AS releaseTime;

  • Group By

ratingsByMovie = Group ratings BY movieID;

*Return Result:

avgRatings = Foreach ratingsByMovie Generate group AS movieID, AVG(ratings.rating) AS avgRating;
fiveStarMovies= Filter avgRatings By avgRating > 4.0;
fiveStarsWithData = join fiveStarMovies by movieID, nameLookup by movieID;
oldestFiveStarMovie = order fiveStarsWithData by nameLookup::releaseTime;

dump oldestFiveStarMovie;

  • Result: (Runtime - several minutes)


    image.png
  • With Tez, it can shrink into 1 minute


    image.png

你可能感兴趣的:(Hadoop - Pig)