An easy hadoop scheduler with JRuby

Ein einfacher scheduler for hadoop Pere UrbonBayes [email protected] http://www.purbon.com

We've a situation State of the art Our internal solution
Still in the TODO list Talk.schedule();

• We build solutions for very large solar fields, for
example in Templin, the north of Berlin, DE. - 214 hectares - 1.500.000 photovoltaic modules - 128.48 MW - 205 millions Euro - • We need to get inside knowledge about the installation, cause we want to know: - If everything is working properly - better ways to produce the energy The situation

The situation So we need a platform The scheduler

The situation The scheduler should: • Be able to run
Hadoop jobs, PIG scripts and/or shell scripts. • Have a REST API. • Support workflow, but also cron alike jobs. • Be able to run jobs on demand. • Be able to track running jobs. • Provide live statistics about the running jobs.

State of the art

State of the art Database Oozie Server Oozie Client REST
api HDFS MapReduce JobTracker TaskTracker http://oozie.apache.org/

State of the art Database Azkaban Server Azkaban Executor MapReduce
JobTracker TaskTracker “Client” http://azkaban.github.io/azkaban2/

State of the art Luigi Worker Pool Luigi scheduler MapReduce
JobTracker TaskTracker Client HDFS Deployment throw codebase copy https://github.com/spotify/luigi

State of the art Lang Complexity (LOC) Framework Logs Community
Docs oozie java High > 105k Pig, hive, sqoop, mapreduce decentrali zed Good - ASF excellent azkaban java Moderate > 26k Pig, hive, mapreduce Centraliz ed . UI accessibl e Few users good luigi phython Simple > 5.9k Hive, postgress, scalding, python, streaming decentrali zed Few users good http://www.slideshare.net/jcrobak/data-engineermeetup-201309

State of the art Configuration Replay Customization Testing Authorization oozie
Command line, property files, xml Oozie job - rerun difficult Mini oozie Kerberos, simple, custom azkaban Bundled inside the workflow zip, sytems defaults Partial rerun UI plugin ? Xml based, custom luigi Command line, python init file Remove output - idempotent subclass Python unit test Linux, OS based http://www.slideshare.net/jcrobak/data-engineermeetup-201309

State of the art So the most obvious is to
use Oozie, because: - Has a great community and features. - Integrated with the common Hadoop tooling - Has a good documentation. but - XML based configuration. - Scheduling based on map tasks. - Complicated setup. - Confusing object model. - Lack of Java experience.

Our internal solution What do we want: i.Have an easy
configuration and deployment methods ii.Easy to interact with throw a REST interface iii.Deal with workflow and cron alike jobs iv.Have a centralized logging infrastructure v.Use not just Hadoop tooling, but also “scripts” vi.Easy to extends, maintain and adapt to our needs and wishes

Our internal solution What do we know/Who are we?: -
Little team, highly distributed. - Diverse knowledge base, mostly C# and Microsoft technologies but a few with Java and Ruby skills. - Lots of ops but just a few devs. - Experience with data and databases. - In a not really technically wise ecosystem. - Highly changing environment.

Our internal solution

Our internal solution REST api Scheduler Server MapReduce JobTracker TaskTracker
Database Scheduler Worker Client Registry Cron Manager GraphDB Workflow Manager File System

Our internal solution We like Ruby, so we choose JRuby
'cause it also helps with concurrency, memory management, might be speed, Java integration, etc. From Neo4j we borrow the graph features, so we can deal with a workflow (DAG graph) in a more easy way. Sinatra is our man for the REST api. But we need also to track out service evolution, for this we use Redis.

Our internal solution Our workflow are defined as Direct Acyclic
Graph (DAG) S E Task 1 Fork Task 3 Task 2 Task 4 Join Task 5

Our internal solution

The TODO list However we've a functional and ready to
use scheduler that: - Run Hadoop Jobs, PIG and shell scripts. - Track the job evolution and most of the errors. - Support workflow and cron alike jobs. - Provide a simple interface for clients. - Deploy new jobs.

The TODO list We still have to improve: - The
deployment procedure. - The workflow management process. - The replay capabilities. - The reporting module. - Add Idempotency and integration of data

Thank you! Questions? Ein einfacher scheduler for hadoop Pere UrbonBayes
[email protected] http://www.purbon.com

An easy hadoop scheduler with JRuby

An easy hadoop scheduler with JRuby

Pere Urbón

More Decks by Pere Urbón

Other Decks in Programming

Featured

Transcript

Ein einfacher scheduler for hadoop Pere UrbonBayes [email protected] http://www.purbon.com

We've a situation State of the art Our internal solution

We've a situation State of the art Our internal solution

• We build solutions for very large solar fields, for

The situation So we need a platform The scheduler

The situation The scheduler should: • Be able to run

We've a situation State of the art Our internal solution

State of the art

State of the art Database Oozie Server Oozie Client REST

State of the art Database Azkaban Server Azkaban Executor MapReduce

State of the art Luigi Worker Pool Luigi scheduler MapReduce

State of the art Lang Complexity (LOC) Framework Logs Community

State of the art Configuration Replay Customization Testing Authorization oozie

State of the art So the most obvious is to

We've a situation State of the art Our internal solution

Our internal solution What do we want: i.Have an easy

Our internal solution What do we know/Who are we?: -

Our internal solution

Our internal solution

Our internal solution REST api Scheduler Server MapReduce JobTracker TaskTracker

Our internal solution We like Ruby, so we choose JRuby

Our internal solution Our workflow are defined as Direct Acyclic

Our internal solution

Our internal solution

Our internal solution

We've a situation State of the art Our internal solution

The TODO list However we've a functional and ready to

The TODO list We still have to improve: - The

We've a situation State of the art Our internal solution

Thank you! Questions? Ein einfacher scheduler for hadoop Pere UrbonBayes