● We build solutions for very large solar fields, for example in Templin, the north of Berlin, DE. - 214 hectares - 1.500.000 photovoltaic modules - 128.48 MW - 205 millions Euro - ● We need to get inside knowledge about the installation, cause we want to know: - If everything is working properly - better ways to produce the energy The situation
The situation The scheduler should: • Be able to run Hadoop jobs, PIG scripts and/or shell scripts. • Have a REST API. • Support workflow, but also cron alike jobs. • Be able to run jobs on demand. • Be able to track running jobs. • Provide live statistics about the running jobs.
State of the art Luigi Worker Pool Luigi scheduler MapReduce JobTracker TaskTracker Client HDFS Deployment throw codebase copy https://github.com/spotify/luigi
State of the art Lang Complexity (LOC) Framework Logs Community Docs oozie java High > 105k Pig, hive, sqoop, mapreduce decentrali zed Good - ASF excellent azkaban java Moderate > 26k Pig, hive, mapreduce Centraliz ed . UI accessibl e Few users good luigi phython Simple > 5.9k Hive, postgress, scalding, python, streaming decentrali zed Few users good http://www.slideshare.net/jcrobak/data-engineermeetup-201309
State of the art Configuration Replay Customization Testing Authorization oozie Command line, property files, xml Oozie job - rerun difficult Mini oozie Kerberos, simple, custom azkaban Bundled inside the workflow zip, sytems defaults Partial rerun UI plugin ? Xml based, custom luigi Command line, python init file Remove output - idempotent subclass Python unit test Linux, OS based http://www.slideshare.net/jcrobak/data-engineermeetup-201309
State of the art So the most obvious is to use Oozie, because: - Has a great community and features. - Integrated with the common Hadoop tooling - Has a good documentation. but - XML based configuration. - Scheduling based on map tasks. - Complicated setup. - Confusing object model. - Lack of Java experience.
Our internal solution What do we want: i.Have an easy configuration and deployment methods ii.Easy to interact with throw a REST interface iii.Deal with workflow and cron alike jobs iv.Have a centralized logging infrastructure v.Use not just Hadoop tooling, but also “scripts” vi.Easy to extends, maintain and adapt to our needs and wishes
Our internal solution What do we know/Who are we?: - Little team, highly distributed. - Diverse knowledge base, mostly C# and Microsoft technologies but a few with Java and Ruby skills. - Lots of ops but just a few devs. - Experience with data and databases. - In a not really technically wise ecosystem. - Highly changing environment.
Our internal solution We like Ruby, so we choose JRuby 'cause it also helps with concurrency, memory management, might be speed, Java integration, etc. From Neo4j we borrow the graph features, so we can deal with a workflow (DAG graph) in a more easy way. Sinatra is our man for the REST api. But we need also to track out service evolution, for this we use Redis.
The TODO list However we've a functional and ready to use scheduler that: - Run Hadoop Jobs, PIG and shell scripts. - Track the job evolution and most of the errors. - Support workflow and cron alike jobs. - Provide a simple interface for clients. - Deploy new jobs.
The TODO list We still have to improve: - The deployment procedure. - The workflow management process. - The replay capabilities. - The reporting module. - Add Idempotency and integration of data