Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop and Spring Data Hadoop by Kailash Kutti

Hadoop and Spring Data Hadoop by Kailash Kutti

Michael Isvy

January 28, 2014
Tweet

More Decks by Michael Isvy

Other Decks in Technology

Transcript

  1. A NEW PLATFORM FOR A NEW ERA

    View Slide

  2. 2
    © Copyright 2013 Pivotal. All rights reserved. 2
    © Copyright 2013 Pivotal. All rights reserved.
    Spring data for Hadoop
    January 2014

    View Slide

  3. 3
    © Copyright 2013 Pivotal. All rights reserved.
    About the speaker
    Ÿ  KAILASHnath Kutti
    –  Technical Architect @ Pivotal
    –  3+ years Hadoop experience

    View Slide

  4. 4
    © Copyright 2013 Pivotal. All rights reserved.
    Agenda
    Ÿ  Hadoop in 10 minutes
    –  HDFS
    –  MapReduce
    –  Code example
    Ÿ  Hadoop using Spring
    –  Code examples
    –  Hadoop specific configurations
    –  Pig through Spring
    Ÿ  Questions

    View Slide

  5. 5
    © Copyright 2013 Pivotal. All rights reserved. 5
    © Copyright 2013 Pivotal. All rights reserved. 5
    © Copyright 2013 Pivotal. All rights reserved.
    About Hadoop

    View Slide

  6. 6
    © Copyright 2013 Pivotal. All rights reserved.
    Why Hadoop is Important?
    Ÿ  Delivers performance and scalability at low cost
    Ÿ  Handles large amounts of data
    Ÿ  Resilient in case of infrastructure failures
    Ÿ  Transparent application scalability

    View Slide

  7. 7
    © Copyright 2013 Pivotal. All rights reserved.
    Hadoop Overview
    Ÿ  Open-source Apache project out of Yahoo! in 2006
    Ÿ  Distributed fault-tolerant data storage and batch
    processing
    Ÿ  Linear scalability on commodity hardware

    View Slide

  8. 8
    © Copyright 2013 Pivotal. All rights reserved.
    Hadoop Overview
    Ÿ  Great at
    –  Reliable storage for huge data sets
    –  Batch queries and analytics
    –  Changing schemas
    Ÿ  Not so great at
    –  Changes to files (can’t do it…)
    –  Low-latency responses (like OLTP applications)
    –  Analyst usability

    View Slide

  9. 9
    © Copyright 2013 Pivotal. All rights reserved.
    HDFS Overview
    Ÿ  Hierarchical UNIX-like file system for data storage
    –  sort of
    Ÿ  Splitting of large files into blocks
    Ÿ  Distribution and replication of blocks to nodes
    Ÿ  Two key services
    –  Master NameNode
    –  Many DataNodes
    Ÿ  Secondary/Checkpoint Node

    View Slide

  10. 10
    © Copyright 2013 Pivotal. All rights reserved.
    How HDFS Works - Writes
    DataNode A DataNode B DataNode C DataNode D
    NameNode
    1
    Client
    2
    A1
    3
    A2 A3 A4
    Client contacts NameNode to write data
    NameNode says write it to these nodes
    Client sequentially
    writes blocks to DataNode

    View Slide

  11. 11
    © Copyright 2013 Pivotal. All rights reserved.
    How HDFS Works - Writes
    DataNode A DataNode B DataNode C DataNode D
    NameNode
    Client
    A1 A2 A3 A4 A1
    A1 A2
    A2
    A3
    A3
    A4 A4
    DataNodes replicate data
    blocks, orchestrated
    by the NameNode

    View Slide

  12. 12
    © Copyright 2013 Pivotal. All rights reserved.
    How HDFS Works - Reads
    DataNode A DataNode B DataNode C DataNode D
    NameNode
    Client
    A1 A2 A3 A4 A1
    A1 A2
    A2
    A3
    A3
    A4 A4
    1
    2
    3
    Client contacts NameNode to read data
    NameNode says you can find it here
    Client sequentially
    reads blocks from DataNode

    View Slide

  13. 13
    © Copyright 2013 Pivotal. All rights reserved.
    Hadoop MapReduce 1.x
    Ÿ  Moves the code to the data
    Ÿ  JobTracker
    –  Master service to monitor jobs
    Ÿ  TaskTracker
    –  Multiple services to run tasks
    –  Same physical machine as a DataNode
    Ÿ  A job contains many tasks
    Ÿ  A task contains one or more task attempts

    View Slide

  14. 14
    © Copyright 2013 Pivotal. All rights reserved.
    How MapReduce Works
    DataNode A
    A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
    JobTracker
    1
    Client
    4
    2
    B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
    3
    DataNode B DataNode C DataNode D
    TaskTracker A TaskTracker B TaskTracker C TaskTracker D
    Client submits job to JobTracker
    JobTracker submits
    tasks to TaskTrackers
    Job output is written to
    DataNodes w/replication
    JobTracker reports metrics

    View Slide

  15. 15
    © Copyright 2013 Pivotal. All rights reserved.
    MapReduce Paradigm
    Ÿ  Data processing system with two key phases
    Ÿ  Map
    –  Perform a map function on key/value pairs
    Ÿ  Reduce
    –  Perform a reduce function on key/value groups
    Ÿ  Groups created by sorting map output

    View Slide

  16. 16
    © Copyright 2013 Pivotal. All rights reserved.
    Word Count
    Ÿ  Count the number of times each word is used in a body of
    text
    Ÿ  Map input is a line of text
    Ÿ  Reduce output a word and the count

    View Slide

  17. 17
    © Copyright 2013 Pivotal. All rights reserved.
    Split Map Shuffle Reduce
    MapReduce Significant steps (WordCount)
    Hadoop is fun
    I love Hadoop
    Pig is more fun
    Hadoop, 1
    Is, 1
    Fun, 1
    I, 1
    Love, 1
    Hadoop, 1
    Pig, 1
    Is, 1
    More, 1
    Fun, 1
    Hadoop, {1,1}
    Is, {1,1}
    Fun, {1,1}
    I, 1
    Love, 1
    Pig, 1
    More, 1
    Hadoop, 2
    Is, 2
    Fun, 2
    I, 1
    Love, 1
    Pig, 1
    More, 1

    View Slide

  18. 18
    © Copyright 2013 Pivotal. All rights reserved.
    Mapper Code
    public class WordMapper
    extends Mapper {
    private final static IntWritable ONE = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, Context context) {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    context.write(word, ONE);
    }
    }
    }

    View Slide

  19. 19
    © Copyright 2013 Pivotal. All rights reserved.
    Reducer Code
    public class IntSumReducer
    extends Reducer {
    public void reduce(Text key, Iterable values,
    Context context) {
    int sum = 0;
    for (IntWritable val : values) {
    sum += val.get();
    }
    context.write(key, new IntWritable(sum));
    }
    }

    View Slide

  20. 20
    © Copyright 2013 Pivotal. All rights reserved.
    •  Standard Hadoop APIs
    Counting Words – Configuring M/R
    Configuration conf = new Configuration();
    Job job = new Job(conf, "wordcount");
    Job.setJarByClass(WordCountMapper.class);
    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new
    Path(args[1]));
    job.waitForCompletion(true);

    View Slide

  21. 21
    © Copyright 2013 Pivotal. All rights reserved. 21
    © Copyright 2013 Pivotal. All rights reserved. 21
    © Copyright 2013 Pivotal. All rights reserved.
    Spring data for Hadoop
    Simplify developing Hadoop Applications

    View Slide

  22. 22
    © Copyright 2013 Pivotal. All rights reserved.
    Developer observations on Hadoop
    Ÿ  Hadoop has a poor out of the box programming model
    Ÿ  Non trivial applications often become a collection of scripts
    calling Hadoop command line applications
    Ÿ  Spring aims to simplify developer Hadoop applications
    –  Leverage several Spring eco-system projects

    View Slide

  23. 23
    © Copyright 2013 Pivotal. All rights reserved.
    Spring data for Hadoop - Features
    Ÿ  Consistent programming and declarative configuration model
    –  Create, configure, and parameterize Hadoop connectivity and all job types
    –  Environment profiles – easily move application from dev to qa to production
    Ÿ  Developer productivity
    –  Create well-formed applications, not spaghetti script applications
    –  Simplify HDFS access and FsShell API with support for JVM scripting
    –  Runner classes for MR/Pig/Hive/Cascading for small workflows
    –  Helper “Template” classes for Pig/Hive/HBase

    View Slide

  24. 24
    © Copyright 2013 Pivotal. All rights reserved.
    Spring data for Hadoop – Use Cases
    Ÿ  Apply across a wide range of use cases
    –  Ingest: Events/JDBC/NoSQL/Files to HDFS
    –  Orchestrate: Hadoop Jobs
    –  Export: HDFS to JDBC/NoSQL
    Ÿ  Spring Integration and Spring Batch make this possible

    View Slide

  25. 25
    © Copyright 2013 Pivotal. All rights reserved.
    •  Standard Hadoop APIs
    Counting Words – Configuring M/R
    Configuration conf = new Configuration();
    Job job = new Job(conf, "wordcount");
    Job.setJarByClass(WordCountMapper.class);
    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new
    Path(args[1]));
    job.waitForCompletion(true);

    View Slide

  26. 26
    © Copyright 2013 Pivotal. All rights reserved.
    Configuring Hadoop with Spring


    fs.default.name=${hd.fs}

    input-path=“${input.path}"
    output-path="${output.path}“
    jar=“hadoop-examples.jar”
    mapper="examples.WordCount.WordMapper“
    reducer="examples.WordCount.IntSumReducer"/>

    View Slide

  27. 27
    © Copyright 2013 Pivotal. All rights reserved.
    Injecting Jobs
    Ÿ  Use DI to obtain reference to Hadoop Job
    –  Perform additional runtime configuration and submit
    public  class  WordService  {  
     
       @Autowired  
       private  Job  mapReduceJob;      
     
       public  void  processWords()  {          
           mapReduceJob.submit();  
       }  
    }  

    View Slide

  28. 28
    © Copyright 2013 Pivotal. All rights reserved.
    input.path=/wc/input/  
    output.path=/wc/word/  
    hd.fs=hdfs://localhost:9000
    Streaming Jobs and Environment Configuration
    bin/hadoop jar hadoop-streaming.jar \
    –input /wc/input –output /wc/output \
    -mapper /bin/cat –reducer /bin/wc \
    -files stopwords.txt

    mapper=“${cat}” reducer=“${wc}”
    files=“classpath:stopwords.txt”>

    env=dev java –jar SpringLauncher.jar applicationContext.xml
    hadoop-dev.properties

    View Slide

  29. 29
    © Copyright 2013 Pivotal. All rights reserved.
    •  Access all “bin/hadoop fs” commands through
    Spring’s FsShell helper class
    –  mkdir, chmod, test
    HDFS and Hadoop Shell as APIs
    class MyScript {
    @Autowired FsShell fsh;
    @PostConstruct void init() {
    String outputDir = "/data/output";
    if (fsShell.test(outputDir)) {
    fsShell.rmr(outputDir);
    }
    }
    }

    View Slide

  30. 30
    © Copyright 2013 Pivotal. All rights reserved.
    HDFS and Hadoop Shell as APIs
    Ÿ  FsShell is designed to support JVM scripting languages
    // use the shell (made available under variable fsh)
    if (!fsh.test(inputDir)) {
    fsh.mkdir(inputDir);
    fsh.copyFromLocal(sourceFile, inputDir);
    fsh.chmod(700, inputDir)
    }
    if (fsh.test(outputDir)) {
    fsh.rmr(outputDir)
    }
    copy-files.groovy

    View Slide

  31. 31
    © Copyright 2013 Pivotal. All rights reserved.
    HDFS and Hadoop Shell as APIs
    Ÿ  Reference script and supply variables in application
    configuration
    <br/><property name="inputDir" value="${wordcount.input.path}"/><br/><property name="outputDir" value="${wordcount.output.path}"/><br/><property name=“sourceFile“ value="${localSourceFile}"/><br/>
    appCtx.xml

    View Slide

  32. 32
    © Copyright 2013 Pivotal. All rights reserved.
    Small workflows
    Ÿ  Often need the following steps
    –  Execute HDFS operations before job
    –  Run MapReduce Job
    –  Execute HDFS operations after job completes
    Ÿ  Spring’s JobRunner helper class sequences these steps
    –  Can reference multiple scripts with comma delimited names
    pre-action="setupScript"
    job="wordcountJob“
    post-action=“tearDownScript"/>

    View Slide

  33. 33
    © Copyright 2013 Pivotal. All rights reserved.
    Runner classes
    Ÿ  Similar runner classes available for Hive and Pig
    Ÿ  Implement JDK callable interface
    Ÿ  Easy to schedule for simple needs using Spring
    Ÿ  Can later ‘graduate’ to use Spring Batch for more complex workflows
    –  Start simple and grow, reusing existing configuration
    pre-action="setupScript“
    job="wordcountJob“
    post-action=“tearDownScript"/>



    View Slide

  34. 34
    © Copyright 2013 Pivotal. All rights reserved.
    Spring’s PigRunner
    Ÿ  Execute a small Pig workflow

    <br/><property name=“sourceFile" value="${localSourceFile}"/><br/><property name="inputDir" value="${inputDir}"/><br/><property name="outputDir" value="${outputDir}"/><br/>

    <br/><arguments><br/>inputDir=${inputDir}<br/>outputDir=${outputDir}<br/></arguments><br/>

    View Slide

  35. 35
    © Copyright 2013 Pivotal. All rights reserved.
    PigTemplate - Configuration
    Ÿ  Helper class that simplifies the programmatic use of Pig
    –  Common tasks are one-liners
    Ÿ  Similar template helper classes for Hive and HBase


    View Slide

  36. 36
    © Copyright 2013 Pivotal. All rights reserved.
    PigTemplate – Programmatic Use
    public class PigPasswordRepository implements PasswordRepository {
    @Autowired
    private PigTemplate pigTemplate;
    @Autowired
    private String outputDir;
    private String pigScript = "classpath:password-analysis.pig";
    public void processPasswordFile(String inputFile) {
    Properties scriptParameters = new Properties();
    scriptParameters.put("inputDir", inputFile);
    scriptParameters.put("outputDir", outputDir);
    pigTemplate.executeScript(pigScript, scriptParameters);
    }
    }

    View Slide

  37. 37
    © Copyright 2013 Pivotal. All rights reserved.
    Big Data problems are also integration problems
    Collect Transform RT Analysis Ingest Batch Analysis Distribute Use
    Spring Integration & Data
    Spring Hadoop +
    Batch
    Spring MVC
    Twi$er  Search  
    &  Gardenhose  
    Redis  
    Gemfire (CQ)

    View Slide

  38. 38
    © Copyright 2013 Pivotal. All rights reserved.
    Spring Integration
    §  Implementation of Enterprise Integration Patterns
    –  Mature, since 2007
    –  Apache 2.0 License
    §  Separates integration concerns from processing logic
    –  Framework handles message reception and method invocation
    •  e.g. Polling vs. Event-driven
    –  Endpoints written as POJOs
    •  Increases testability
    Endpoint Endpoint

    View Slide

  39. 39
    © Copyright 2013 Pivotal. All rights reserved.
    Spring Batch
    Ÿ  Framework for batch processing
    –  Basis for JSR-352
    Ÿ  Features
    –  parsers, mappers, readers, writers
    –  automatic retries after failure
    –  periodic commits
    –  synchronous and asynch processing
    –  parallel processing
    –  partial processing (skipping records)
    –  non-sequential processing
    –  job tracking and restart

    View Slide

  40. 40
    © Copyright 2013 Pivotal. All rights reserved.
    Spring Integration and Batch for Hadoop
    Ingest/Export
    Ÿ  Event Streams – Spring Integration
    –  Examples
    ▪  Consume syslog events, transform and write to HDFS
    ▪  Consume twitter search results and write to HDFS
    Ÿ  Batch – Spring Batch
    –  Examples
    ▪  Read log files on local file system, transform and write to HDFS
    ▪  Read from HDFS, transform and write to JDBC, HBase, MongoDB,…

    View Slide

  41. 41
    © Copyright 2013 Pivotal. All rights reserved.
    Spring Data, Integration, & Batch for Analytics
    Ÿ  Realtime Analytics – Spring Integration & Data
    –  Examples – Service Activator that
    ▪  Increments counters in Redis or MongoDB using Spring Data helper libraries
    ▪  Create Gemfire Continuous Queries using Spring Gemfire
    Ÿ  Batch Analytics – Spring Batch
    –  Orchestrate Hadoop based workflows with Spring Batch
    –  Also orchestrate non-hadoop based workflows

    View Slide

  42. 42
    © Copyright 2013 Pivotal. All rights reserved.
    Hadoop Analytical workflow managed by
    Spring Batch
    §  Reuse same Batch infrastructure
    and knowledge to manage
    Hadoop workflows
    §  Step can be any Hadoop job
    type or HDFS script

    View Slide

  43. 43
    © Copyright 2013 Pivotal. All rights reserved.
    Relationship between Spring Projects
    Spring batch on and off Hadoop
    Spring Integration
    Enterprise integration Patterns, Event Driven Applications
    Spring Framework
    DI, AOP, Web, Messaging, Scheduling
    Spring Data for
    Hadoop
    Simplify Hadoop

    View Slide

  44. 44
    © Copyright 2013 Pivotal. All rights reserved.
    Next Steps – Spring XD
    Ÿ  New open source umbrella project to support common big
    data use cases
    –  High throughput distributed data ingestion into HDFS
    ▪  From a variety of input sources
    –  Real-time analytics at ingestion time
    ▪  Gathering metrics, counting values, Gemfire CQ…
    –  On and off Hadoop workflow orchestration
    –  High throughput data export
    ▪  From HDFS to a RDBMS or NoSQL database.
    Ÿ  XD = eXtreme Data

    View Slide

  45. 45
    © Copyright 2013 Pivotal. All rights reserved.
    Resources
    Ÿ  Pivotal
    –  goPivotal.com
    Ÿ  Spring Data
    –  http://www.springsource.org/spring-data
    –  http://www.springsource.org/spring-hadoop
    Ÿ  Spring Data Book - http://bit.ly/sd-book
    –  Part III on Big Data
    Ÿ  Example Code https://github.com/SpringSource/spring-data-book
    Ÿ  Spring XD http://github.com/springsource/spring-xd

    View Slide

  46. 46
    © Copyright 2013 Pivotal. All rights reserved.
    Q & A

    View Slide

  47. A NEW PLATFORM FOR A NEW ERA

    View Slide