$30 off During Our Annual Pro Sale. View Details »

An easy hadoop scheduler with JRuby

Pere Urbón
February 06, 2014

An easy hadoop scheduler with JRuby

A short description about the scheduling situation for Hadoop jobs, including a custom tooling we made in the house.

Pere Urbón

February 06, 2014
Tweet

More Decks by Pere Urbón

Other Decks in Programming

Transcript

  1. Ein einfacher
    scheduler for hadoop
    Pere Urbon­Bayes
    [email protected]
    http://www.purbon.com

    View Slide

  2. We've a situation
    State of the art
    Our internal solution
    Still in the TODO list
    Talk.schedule();

    View Slide

  3. View Slide

  4. We've a situation
    State of the art
    Our internal solution
    Still in the TODO list
    Talk.schedule();

    View Slide

  5. View Slide


  6. We build solutions for very large solar fields, for
    example in Templin, the north of Berlin, DE.
    -
    214 hectares
    -
    1.500.000 photovoltaic modules
    -
    128.48 MW
    -
    205 millions Euro
    -

    We need to get inside knowledge about the
    installation, cause we want to know:
    -
    If everything is working properly
    -
    better ways to produce the energy
    The situation

    View Slide

  7. The situation
    So we need a platform The scheduler

    View Slide

  8. The situation
    The scheduler should:

    Be able to run Hadoop jobs, PIG scripts and/or
    shell scripts.

    Have a REST API.

    Support workflow, but also cron alike jobs.

    Be able to run jobs on demand.

    Be able to track running jobs.

    Provide live statistics about the running jobs.

    View Slide

  9. We've a situation
    State of the art
    Our internal solution
    Still in the TODO list
    Talk.schedule();

    View Slide

  10. State of the art

    View Slide

  11. State of the art
    Database
    Oozie Server
    Oozie Client
    REST api
    HDFS
    MapReduce
    JobTracker
    TaskTracker
    http://oozie.apache.org/

    View Slide

  12. State of the art
    Database
    Azkaban Server Azkaban Executor
    MapReduce
    JobTracker
    TaskTracker
    “Client”
    http://azkaban.github.io/azkaban2/

    View Slide

  13. State of the art
    Luigi Worker Pool
    Luigi scheduler
    MapReduce
    JobTracker
    TaskTracker
    Client
    HDFS
    Deployment throw
    codebase copy
    https://github.com/spotify/luigi

    View Slide

  14. State of the art
    Lang Complexity
    (LOC)
    Framework Logs Community Docs
    oozie java High > 105k Pig, hive,
    sqoop,
    mapreduce
    decentrali
    zed
    Good - ASF excellent
    azkaban java Moderate >
    26k
    Pig, hive,
    mapreduce
    Centraliz
    ed . UI
    accessibl
    e
    Few users good
    luigi phython Simple >
    5.9k
    Hive,
    postgress,
    scalding,
    python,
    streaming
    decentrali
    zed
    Few users good
    http://www.slideshare.net/jcrobak/data-engineermeetup-201309

    View Slide

  15. State of the art
    Configuration Replay Customization Testing Authorization
    oozie Command line,
    property files, xml
    Oozie job -
    rerun
    difficult Mini oozie Kerberos,
    simple, custom
    azkaban Bundled inside the
    workflow zip,
    sytems defaults
    Partial
    rerun UI
    plugin ? Xml based,
    custom
    luigi Command line,
    python init file
    Remove
    output -
    idempotent
    subclass Python unit
    test
    Linux, OS
    based
    http://www.slideshare.net/jcrobak/data-engineermeetup-201309

    View Slide

  16. State of the art
    So the most obvious is to use Oozie, because:
    - Has a great community and features.
    - Integrated with the common Hadoop tooling
    - Has a good documentation.
    but
    - XML based configuration.
    - Scheduling based on map tasks.
    - Complicated setup.
    - Confusing object model.
    - Lack of Java experience.

    View Slide

  17. View Slide

  18. We've a situation
    State of the art
    Our internal solution
    Still in the TODO list
    Talk.schedule();

    View Slide

  19. Our internal solution
    What do we want:
    i.Have an easy configuration and deployment methods
    ii.Easy to interact with throw a REST interface
    iii.Deal with workflow and cron alike jobs
    iv.Have a centralized logging infrastructure
    v.Use not just Hadoop tooling, but also “scripts”
    vi.Easy to extends, maintain and adapt to our needs
    and wishes

    View Slide

  20. Our internal solution
    What do we know/Who are we?:
    - Little team, highly distributed.
    - Diverse knowledge base, mostly C# and Microsoft
    technologies but a few with Java and Ruby skills.
    - Lots of ops but just a few devs.
    - Experience with data and databases.
    - In a not really technically wise ecosystem.
    - Highly changing environment.

    View Slide

  21. Our internal solution

    View Slide

  22. Our internal solution

    View Slide

  23. Our internal solution
    REST api
    Scheduler Server
    MapReduce
    JobTracker
    TaskTracker
    Database
    Scheduler Worker
    Client
    Registry
    Cron Manager
    GraphDB
    Workflow Manager
    File
    System

    View Slide

  24. Our internal solution
    We like Ruby, so we choose JRuby 'cause it also helps
    with concurrency, memory management, might be
    speed, Java integration, etc.
    From Neo4j we borrow the graph features, so we can
    deal with a workflow (DAG graph) in a more easy way.
    Sinatra is our man for the REST api.
    But we need also to track out service evolution, for this
    we use Redis.

    View Slide

  25. Our internal solution
    Our workflow are defined as Direct Acyclic Graph (DAG)
    S E
    Task 1 Fork Task 3
    Task 2
    Task 4
    Join Task 5

    View Slide

  26. Our internal solution

    View Slide

  27. Our internal solution

    View Slide

  28. Our internal solution

    View Slide

  29. We've a situation
    State of the art
    Our internal solution
    Still in the TODO list
    Talk.schedule();

    View Slide

  30. The TODO list
    However we've a functional and ready to use
    scheduler that:
    - Run Hadoop Jobs, PIG and shell scripts.
    - Track the job evolution and most of the errors.
    - Support workflow and cron alike jobs.
    - Provide a simple interface for clients.
    - Deploy new jobs.

    View Slide

  31. The TODO list
    We still have to improve:
    - The deployment procedure.
    - The workflow management process.
    - The replay capabilities.
    - The reporting module.
    - Add Idempotency and integration of data

    View Slide

  32. We've a situation
    State of the art
    Our internal solution
    Still in the TODO list
    Talk.schedule();

    View Slide

  33. Thank you!
    Questions?
    Ein einfacher scheduler for
    hadoop
    Pere Urbon­Bayes
    [email protected]
    http://www.purbon.com

    View Slide