Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Apache Pig

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

An Introduction to Apache Pig

An Introduction to Apache Pig, what is it used for ?
How does it work and why use it compared to Map Reduce
native code ?

Avatar for Mike Frampton

Mike Frampton

July 26, 2013
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. Apache Pig • What is it ? • How does

    it work ? • Why use it ? • PigLatin Data Types • PigLatin Maths • PigLatin Example www.semtech-solutions.co.nz [email protected]
  2. Pig – What is it ? • A high level

    language • Used to analyse large data sets • Used to create MapReduce jobs • Abstracts definition of jobs • Uses Pig Latin to define jobs • Less code needed • Compiles to MapReduce code www.semtech-solutions.co.nz [email protected]
  3. Pig – How does it work ? • Three ways

    to use it – Grunt – Pig's interactive shell – Write Pig Latin in a script file – Embed Pig commands in another language • Run modes – Local mode – single machine – Hadoop – run on a Hadoop/MapReduce cluster • Creates MapReduce code automatically www.semtech-solutions.co.nz [email protected]
  4. Pig – Why use it ? • It is quicker

    • It is data omnivorous • It is easy to learn • It is widely used • Minor performance loss – Compared to native code • It can be extended via user defined functions ( UDF ) www.semtech-solutions.co.nz [email protected]
  5. PigLatin Data Types • Int • Long • Float •

    Double • Chararray • Bytearray • Tuple • Bag • Map www.semtech-solutions.co.nz [email protected]
  6. PigLatin Maths Some of the built in maths functions •

    ABS • CEIL • EXP • FLOOR • LOG • ROUND • SIN • TAN www.semtech-solutions.co.nz [email protected]
  7. PigLatin Example Example borrowed from Wikipedia input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet'

    AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; www.semtech-solutions.co.nz [email protected]
  8. Contact Us • Feel free to contact us at –

    www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems