Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An easy hadoop scheduler with JRuby

Pere Urbón
February 06, 2014

An easy hadoop scheduler with JRuby

A short description about the scheduling situation for Hadoop jobs, including a custom tooling we made in the house.

Pere Urbón

February 06, 2014
Tweet

More Decks by Pere Urbón

Other Decks in Programming

Transcript

  1. We've a situation State of the art Our internal solution

    Still in the TODO list Talk.schedule();
  2. We've a situation State of the art Our internal solution

    Still in the TODO list Talk.schedule();
  3. • We build solutions for very large solar fields, for

    example in Templin, the north of Berlin, DE. - 214 hectares - 1.500.000 photovoltaic modules - 128.48 MW - 205 millions Euro - • We need to get inside knowledge about the installation, cause we want to know: - If everything is working properly - better ways to produce the energy The situation
  4. The situation The scheduler should: • Be able to run

    Hadoop jobs, PIG scripts and/or shell scripts. • Have a REST API. • Support workflow, but also cron alike jobs. • Be able to run jobs on demand. • Be able to track running jobs. • Provide live statistics about the running jobs.
  5. We've a situation State of the art Our internal solution

    Still in the TODO list Talk.schedule();
  6. State of the art Database Oozie Server Oozie Client REST

    api HDFS MapReduce JobTracker TaskTracker http://oozie.apache.org/
  7. State of the art Database Azkaban Server Azkaban Executor MapReduce

    JobTracker TaskTracker “Client” http://azkaban.github.io/azkaban2/
  8. State of the art Luigi Worker Pool Luigi scheduler MapReduce

    JobTracker TaskTracker Client HDFS Deployment throw codebase copy https://github.com/spotify/luigi
  9. State of the art Lang Complexity (LOC) Framework Logs Community

    Docs oozie java High > 105k Pig, hive, sqoop, mapreduce decentrali zed Good - ASF excellent azkaban java Moderate > 26k Pig, hive, mapreduce Centraliz ed . UI accessibl e Few users good luigi phython Simple > 5.9k Hive, postgress, scalding, python, streaming decentrali zed Few users good http://www.slideshare.net/jcrobak/data-engineermeetup-201309
  10. State of the art Configuration Replay Customization Testing Authorization oozie

    Command line, property files, xml Oozie job - rerun difficult Mini oozie Kerberos, simple, custom azkaban Bundled inside the workflow zip, sytems defaults Partial rerun UI plugin ? Xml based, custom luigi Command line, python init file Remove output - idempotent subclass Python unit test Linux, OS based http://www.slideshare.net/jcrobak/data-engineermeetup-201309
  11. State of the art So the most obvious is to

    use Oozie, because: - Has a great community and features. - Integrated with the common Hadoop tooling - Has a good documentation. but - XML based configuration. - Scheduling based on map tasks. - Complicated setup. - Confusing object model. - Lack of Java experience.
  12. We've a situation State of the art Our internal solution

    Still in the TODO list Talk.schedule();
  13. Our internal solution What do we want: i.Have an easy

    configuration and deployment methods ii.Easy to interact with throw a REST interface iii.Deal with workflow and cron alike jobs iv.Have a centralized logging infrastructure v.Use not just Hadoop tooling, but also “scripts” vi.Easy to extends, maintain and adapt to our needs and wishes
  14. Our internal solution What do we know/Who are we?: -

    Little team, highly distributed. - Diverse knowledge base, mostly C# and Microsoft technologies but a few with Java and Ruby skills. - Lots of ops but just a few devs. - Experience with data and databases. - In a not really technically wise ecosystem. - Highly changing environment.
  15. Our internal solution REST api Scheduler Server MapReduce JobTracker TaskTracker

    Database Scheduler Worker Client Registry Cron Manager GraphDB Workflow Manager File System
  16. Our internal solution We like Ruby, so we choose JRuby

    'cause it also helps with concurrency, memory management, might be speed, Java integration, etc. From Neo4j we borrow the graph features, so we can deal with a workflow (DAG graph) in a more easy way. Sinatra is our man for the REST api. But we need also to track out service evolution, for this we use Redis.
  17. Our internal solution Our workflow are defined as Direct Acyclic

    Graph (DAG) S E Task 1 Fork Task 3 Task 2 Task 4 Join Task 5
  18. We've a situation State of the art Our internal solution

    Still in the TODO list Talk.schedule();
  19. The TODO list However we've a functional and ready to

    use scheduler that: - Run Hadoop Jobs, PIG and shell scripts. - Track the job evolution and most of the errors. - Support workflow and cron alike jobs. - Provide a simple interface for clients. - Deploy new jobs.
  20. The TODO list We still have to improve: - The

    deployment procedure. - The workflow management process. - The replay capabilities. - The reporting module. - Add Idempotency and integration of data
  21. We've a situation State of the art Our internal solution

    Still in the TODO list Talk.schedule();