Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Helping Data Teams with Puppet / Puppet Camp Lo...

Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015

Puppet is widely known in DevOps community, but not so popular in data teams. Nevertheless, Puppet could easily empower your data teams. In the talk presented hands-on experience of using Puppet for different data topics starting from configuring Windows machine for Business Intelligence and finishing with advanced ranking infrastructures based on Puppet.
The talk will walk you through the process of setting up a standalone Puppet configuration, that used for provisioning Windows machine to be utilized for Business Intelligence purposes like Tableau and Talend Big Data configurations, ETL scheduling etc. Second part of the talk will cover a use-case of Puppet for enabling a lean ranking infrastructure.

Sergii Khomenko

April 13, 2015
Tweet

More Decks by Sergii Khomenko

Other Decks in Programming

Transcript

  1. S T Y L I G H T . C

    O M Helping Data Teams with Puppet S T Y L I G H T . C O M S E R G I I K H O M E N K O , D A T A S C I E N T I S T , S E R G I I . K H O M E N K O @ S T Y L I G H T . C O M , @ l c 0 d 3 r
  2. W h o ? W h a t ? W

    h y ? S e t t i n g u p y o u r B I w i t h p u p p e t . S m a l l t i p s a n d t r i c k s P u p p e t y o u r r a n k i n g A G E N D A
  3. Data scientist at one of the biggest fashion communities, STYLIGHT.

    Data analysis and visualization hobbyist. Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014 Founder and speaker at Munich Golang UG, Munich Tableau UG. Speaker at Munich UseR Group, Munich Search UG, Munich Quantified Self UG. Sergii Khomenko Milos Radovanovic Passionate about DevOps stuff: 1. microservices 2. docker 3. 12 factor apps 4. continuous integration/deployment
  4. L i v e i n 1 2 c o

    u n t r i e s STYLIGHT – international community
  5. S T Y L I G H T . C

    O M Setting up your BI with puppet.
  6. T a b l e a u - r e

    p o r t i n g a n d a d - h o c s P y t h o n / T a l e n d E T L t o o l s Minimum Viable BI
  7. R U N N I N G P U P

    P E T I N A S T A N D A L O N E M O D E Minimum Viable BI We use Puppet for *nix servers and can’t merge with Windows machine Standalone mode for Puppet – easier to start and develop – windows machines are separated from *nix ones
  8. R U N N I N G P U P

    P E T I N A S T A N D A L O N E M O D E Minimum Viable BI cd c:\folder\with\our-bi git pull origin master IF %ERRORLEVEL% NEQ 0 set context=GIT_FAILURE && goto error_handler puppet apply --modulepath=puppet\modules puppet\win- node-name.net.pp IF %ERRORLEVEL% NEQ 0 set context=PUPPET_FAILURE && goto error_handler goto end
  9. R U N N I N G P U P

    P E T I N A S T A N D A L O N E M O D E Minimum Viable BI :error_handler echo entering error_handler EVENTCREATE /T ERROR /L APPLICATION /SO Puppet_Scheduler /ID 100 /D "EXECUTION FAILED REASON %context%" goto end :end echo DONE
  10. Minimum Viable BI Standalone mode for Puppet – configuration is totally

    separated – custom modules --modulepath=puppet\modules –  Github hosted configuration –  Error handling via Windows event log R U N N I N G P U P P E T I N A S T A N D A L O N E M O D E
  11. Minimum Viable BI node  'ʹwin-­‐‑node-­‐‑name.net'ʹ  {        scheduled_task

     {'ʹrefresh-­‐‑1'ʹ:            ensure        =>  present,            enabled      =>  true,            command      =>  'ʹC:\path\to\your\script.bat'ʹ,            arguments  =>  'ʹsome  args  'ʹ,             S C H E D U L I N G I S I M P O R T A N T
  12. Minimum Viable BI            user  =>

     'ʹyour-­‐‑user'ʹ,            password  =>  'ʹyour-­‐‑password'ʹ,            trigger      =>  {                schedule      =>  daily,                start_time  =>  'ʹ06:00'ʹ,            }        } S C H E D U L I N G I S I M P O R T A N T
  13. Minimum Viable BI # Can't use the Puppet's scheduled_task as

    it does not support to run the schedule task every 5 minutes. https://github.com/sdliangzhihua/windows-puppet- example/blob/master/manifest.pp#L68 S Y N C M Y C O N F I G U R A T I O N E V E R Y 1 5 M I N
  14. Minimum Viable BI $cmd = 'C:\Windows\system32\cmd.exe' $job_name = 'sync_code' exec

    { 'CreateCodeSyncScheduledTask': command => "${cmd} /C schtasks /create /sc MINUTE /mo 15 /tn ${job_name} /tr C:\\your\ \puppet.bat /ru administrator /f", onlyif => ["${cmd} /C schtasks /query /tn ${job_name} & if errorlevel 1 (exit /b 0) else exit /b 1"], S Y N C M Y C O N F I G U R A T I O N E V E R Y 1 5 M I N
  15. S T Y L I G H T . C

    O M Small tips and tricks do  not  repeat  yourself  and  other  tricks
  16. Minimum Viable BI node  'ʹwin-­‐‑node-­‐‑name.net'ʹ  {        scheduled_task

     {'ʹrefresh-­‐‑1'ʹ:            ensure        =>  present,            enabled      =>  true,            command      =>  'ʹC:\path\to\your\script.bat'ʹ,            arguments  =>  'ʹsome  args  'ʹ,             S C H E D U L I N G I S I M P O R T A N T
  17. Small tips and tricks class  job_scheduler(        $ensure

                           =  $job_scheduler::params::ensure,        $enabled                    =  $job_scheduler::params::enabled,        $user                                =  $job_scheduler::params::user,        $password              =  $job_scheduler::params::password,        $working_dir    =  $job_scheduler::params::working_dir, )inherits  job_scheduler::params{ }
  18. Small tips and tricks define  job_scheduler::job (      

     $arguments              ='ʹtableau_adobe.py'ʹ,        $command                  ='ʹc:\Py27-­‐‑32\python.exe'ʹ,        $schedule_type      ='ʹdaily'ʹ,        $start_time            ='ʹ08:15'ʹ,        $day_of_week          ='ʹevery'ʹ, ) {
  19. Small tips and tricks define  job_scheduler::tableau_job (      

     $arguments              ='ʹdefault-­‐‑tableau'ʹ,        $command                  ='ʹc:\folder\tableau.bat'ʹ,        $schedule_type      ='ʹdaily'ʹ,        $start_time            ='ʹ21:00'ʹ,        $day_of_week          ='ʹevery'ʹ, ) {
  20. Small tips and tricks # Params with default values for

    the tableau job # that might be changed in a job definition # # 1. $arguments ='default-argument', # 2. $command ='c:\folder\script.bat', # 3. $schedule_type ='daily', # 4. $start_time ='21:00', # 5. $day_of_week ='every', ####################
  21. Small tips and tricks job_scheduler::tableau_job { ’some job': start_time =>

    '01:00', arguments => ’args'; ’default refresh-1': start_time => '06:00'; 'default refresh-2': start_time => '10:00'; 'weekly update': start_time => '03:35', arguments => 'weekly-update', schedule_type => weekly, day_of_week => ['mon']; }
  22. Small tips and tricks job_scheduler::redshift_job  {        

       'ʹRS  tagged  products'ʹ:                  start_time  =>  'ʹ00:40'ʹ,  params  =>   'ʹ..\datasources\something.tds'ʹ;            'ʹRS  another  job'ʹ:  start_time  =>  'ʹ00:50'ʹ,  params  =>  'ʹ.. \datasources\else.tds'ʹ
  23. S T Y L I G H T . C

    O M Puppet your ranking Lean,  flexible,  powerful
  24. A r a n k i n g i s

    a r e l a t i o n s h i p b e t w e e n a s e t o f i t e m s s u c h t h a t , f o r a n y t w o i t e m s , t h e f i r s t i s e i t h e r ' r a n k e d h i g h e r t h a n ' , ' r a n k e d l o w e r t h a n ' o r ' r a n k e d e q u a l t o ' t h e s e c o n d .
  25. Ranking specifics: •  Seasonal influence •  Trends •  Cold start

    of new countries, shops •  Multiple dimensions of ranking model
  26. Requirements: •  Decreasing time to implement new ranking model • 

    Keeping working infrastructure alive •  A/B testing without changing entire infrastructure •  Performance level - “still fast” and “transparent” Lean approach to Ranking M u l t i p l e p o i n t s o f e v a l u a t i o n
  27. Updated infrastructure Jboss Solr-loadbalancer nginx Solr nginx Solr nginx Solr

    Jboss Solr-loadbalancer nginx Solr Front-end loadbalancer
  28. q = +brand:adidas shop:monshowroom^3 q = +adidas monshowroom defType =

    dismax qf = brand shop^3 sort = user_ratings desc, score desc qq = adidas q = {!boost b=$b defType=dismax v=$qq} b = prod(popularity, clicks) Lean approach to Ranking
  29. Lean approach to Ranking solr0x.node.company.pp include nginx nginx::config { "solr_dev":

    } nginx::solr-ranking { "delta2": urls => [ “/some.thing? gender=women&brand=2271&tag=1161&tag=877&tag=468", "/some.thing? gender=men&brand=11235&tag=10203&tag=10299&tag=10326" ],
  30. Lean approach to Ranking <% urls.each do |url| -%> if

    ($args ~* <% if url['gender'] > 0 -%>gender_id%3A< %= url['gender'] %>.*<% end -%><% url['tags'].each do |tag| -%>tag_id%3A<%= tag %>.*<% end -%><% if url['brand'] > 0 -%>brand_id%3A%28<%= url['brand'] %>%29<% end -%>) { set $orig $args; set $args "q={!boost+b=%24b+defType=dismax+v= %24qq}&qq=id:*"; rewrite ^(.*)$ "$1?$orig" break; } <% end -%> nginx / templates / conf / solr-rewrites.conf.erb
  31. Stages to evaluate a model: •  R ranking model • 

    Independent Solr-node 1.  For internal use-cases 2.  Testing for some of pages 3.  A/B roll out for % of users •  Production roll out Lean approach to Ranking M u l t i p l e p o i n t s o f e v a l u a t i o n
  32. S T Y L I G H T . C

    O M Sergii Khomenko Data Scientist STYLIGHT GmbH [email protected] @lc0d3r Nymphenburger Straße 86 80636 Munich, Germany