Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache™ Pig - An Introduction

Apache™ Pig - An Introduction

Quick introduction to Apache™ Pig

Moty Michaely

June 03, 2015
Tweet

More Decks by Moty Michaely

Other Decks in Technology

Transcript

  1. Apache™ Pig is High Level Scripting Language Pig Latin Hadoop

    MapReduce/Tez Compiler Commonly Used Open Source
  2. Apache™ Pig is High Level Scripting Language Hadoop MapReduce/Tez Compiler

    Supports MapReduce and Tez Commonly Used Open Source
  3. Apache™ Pig is High Level Scripting Language Hadoop MapReduce/Tez Compiler

    Commonly Used Netflix, Xplenty, eBay, Yahoo, Wix... Open Source
  4. Apache™ Pig is High Level Scripting Language Hadoop MapReduce/Tez Compiler

    Commonly Used Open Source Backed by the community
  5. Why Pig (the name)? Pigs Eat Anything Relational, Nested, Unstructured

    Files, Key/Value stores, Databases Pigs Live Anywhere Pigs Are Domestic Animals Pigs Fly
  6. Why Pig (the name)? Pigs Eat Anything Pigs Live Anywhere

    Not tied to particular framework Pigs Are Domestic Animals Pigs Fly
  7. Why Pig (the name)? Pigs Eat Anything Pigs Live Anywhere

    Pigs Are Domestic Animals Easily controlled Integration of user code Pigs Fly
  8. Why Pig (the name)? Pigs Eat Anything Pigs Live Anywhere

    Pigs Are Domestic Animals Pigs Fly Faster development Improved performance
  9. Motivation for Pig Increase productivity 10 lines of Pig Latin

    ≈ 200 lines of Java 4 hours of Java ≈ 15 minutes of Pig Latin Open to non-java developers Optimization opportunities Extensibility
  10. Motivation for Pig Increase productivity Open to non-java developers It’s

    like SQL Optimization opportunities Extensibility
  11. Motivation for Pig Increase productivity Open to non-java developers Optimization

    opportunities No need to tune Hadoop for your needs Execution plan, optimizer Extensibility
  12. Motivation for Pig Increase productivity Open to non-java developers Optimization

    opportunities Extensibility User defined functions Integration with Python, Ruby and JS
  13. Pig Latin (For Apache™ Pig) “A high-level language that allows

    you to write data processing and analysis programs.”
  14. Pig Latin For Apache™ Pig Relations - A relation (table)

    is a bag - A bag is a collection of tuples - A tuple (row) is an ordered set of fields - A field is a piece of data
  15. Running Pig - Two execution modes Interactive Mode $ cd

    /path/to/pig/bin/ $ pig grunt> a = LOAD ‘/path/to/file’; grunt> DUMP a; Batch Mode
  16. Running Pig Pig supports two execution modes Interactive Mode Batch

    Mode $ cd /path/to/pig/bin/ $ pig -f /path/to/pig/file.pig
  17. Word Count in Pig -- Word Count Script (wordcount.pig) text

    = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; STORE counts INTO 'wordcount';
  18. Word Count in Pig -- Word Count Script (wordcount.pig) text

    = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; STORE counts INTO 'wordcount';
  19. -- Word Count Script (wordcount.pig) text = LOAD 'word_count_text.txt'; words

    = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; STORE counts INTO 'wordcount'; Word Count in Pig
  20. Word Count in Pig -- Word Count Script (wordcount.pig) text

    = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; STORE counts INTO 'wordcount';
  21. Word Count in Pig -- Word Count Script (wordcount.pig) text

    = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; STORE counts INTO 'wordcount';
  22. Word Count in Pig -- Word Count Script (wordcount.pig) text

    = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; STORE counts INTO 'wordcount';
  23. Word Count in Pig -- Word Count Script (wordcount.pig) text

    = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; STORE counts INTO 'wordcount';
  24. Word Count in Pig - Sorted -- Word Count Script

    (wordcount.pig) text = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; sorted_counts = ORDER counts BY count DESC, word ASC; STORE counts INTO 'wordcount';
  25. Word Count in Pig - Sorted -- Word Count Script

    (wordcount.pig) text = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(words) AS count; sorted_counts = ORDER counts BY count DESC, word ASC; STORE sorted_counts INTO 'wordcount_sorted';
  26. Word Count - MR vs. Pig 63 Lines of code

    5 lines of code -- Word Count Script (wordcount.pig) text = LOAD 'word_count_text.txt'; words = FOREACH text GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; grouped_words = GROUP words BY word; counts = FOREACH grouped_words GENERATE group AS word, COUNT(grouped_words) AS count; STORE counts INTO 'wordcount';
  27. “80% of the work in any data project is in

    cleaning the data.” (D.J Patel, Data Jujitsu)
  28. Pig is great for Web log processing Data processing for

    web search platforms Ad hoc queries across large data sets Rapid prototyping of algorithms for processing large data sets
  29. Pig is great for Web log processing Data processing for

    web search platforms Ad hoc queries across large data sets Rapid prototyping of algorithms for processing large data sets
  30. Pig is great for Web log processing Data processing for

    web search platforms Ad hoc queries across large data sets Rapid prototyping of algorithms for processing large data sets
  31. Pig is great for Web log processing Data processing for

    web search platforms Ad hoc queries across large data sets Rapid prototyping of algorithms for processing large data sets
  32. Resources Apache Pig Philosophy Apache Pig 0.14 Documentation Word Count

    Example Introduction to Apache Tez Pig (Language) Pig for dummies Pig Latin (Language Game) Xplenty Data Jujitsu: The art of turning data into product Pig Cheat Sheet