Scaling Ansible

Scaling Ansible

Ansible agentless nature and simplicity are definitely two of the aspects that determined its incredible success as configuration management and automation tool. However - despite being very powerful - agentless automation has some limitations leading very often to the increase in complexity of the Ansible code to be able to manage complex infrastructures. In this talk we will deep dive into Ansible execution model to discuss some workarounds to effectively implement Ansible as configuration management tool at scale without compromising performances and/or the readability of the code.

540c59627eb80644f4859bebafa5f185?s=128

Andrea Tosatto

March 08, 2019
Tweet

Transcript

  1. Scaling Ansible Andrea Tosatto @Open-Xchange Loadays 2019 @_hilbert_ atosatto andrea@tosatto.me

  2. WHOAMI

  3. POWERDNS Started in 1999 Open-Source since 2007 Part of Open-Xchange

    since 2015 Powering more than 30% of the internet hosted domains in the world 75% of the DNSSEC domains in the world 150mln of internet users We code mainly in C++, but we do also Lua, Python and Golang. We use Ansible for the automation of the rollout and management of most of our Customers deployments. @PowerDNS powerdns/pdns
  4. WHY ANSIBLE

  5. #ANSI-LOVE <3 • ANSIBLE IS EASY • ANSIBLE HAS BEEN

    INCREDIBLY EFFECTIVE FOR US • IT ALLOWED US TO QUICKLY ON-BOARD NEW TEAM MEMBERS • IT ALLOWED US TO EASILY TRAIN CUSTOMERS TO OPERATE OUR SOLUTION • IT INCREASED THE VELOCITY OF THE ROLLOUT OF OUR PLATFORM IN COMPLEX ENVIRONMENTS • IT ALLOWED US TO UNIFORM ALL THE DEPLOYMENTS OF OUR SOLUTION • ANSIBLE IS EASY TO BE IMPLEMENTED • YOU NEED JUST SSH AND PYTHON • NO “MASTER” NO “CERTIFICATES” NO “AGENTS”
  6. #ANSI-LOVE <3 • ANSIBLE IS EASY • ANSIBLE HAS BEEN

    INCREDIBLY EFFECTIVE FOR US • IT ALLOWED US TO QUICKLY ON-BOARD NEW TEAM MEMBERS • IT ALLOWED US TO EASILY TRAIN CUSTOMERS TO OPERATE OUR SOLUTION • IT INCREASED THE VELOCITY OF THE ROLLOUT OF OUR PLATFORM IN COMPLEX ENVIRONMENTS • IT ALLOWED US TO UNIFORM ALL THE DEPLOYMENTS OF OUR SOLUTION • ANSIBLE IS EASY TO BE IMPLEMENTED • YOU NEED JUST SSH AND PYTHON • NO “MASTER” NO “CERTIFICATES” NO “AGENTS”
  7. SO WHAT?!

  8. DEPLOYING POWERDNS • 160 VMS (76 more coming in 4

    weeks) • 5 DATACENTERS (2 more in the pipeline) • 1 BASTION HOST • 80 ROLES • 56 PLAYS (including facts fetching) • 832 TASKS
  9. IT’S RUNNING!

  10. WHERE TO START?!

  11. TEST ENVIRONMENT (1) • 9 VMS • 1 DATACENTER •

    1 BASTION HOST • 80 ROLES • 56 PLAYS (including facts fetching) • 832 TASKS • ANSIBLE 2.6.14 • PYTHON 2.7.16
  12. TEST ENVIRONMENT (2) Infrastructure Code Terraform Ansible Ansible Inventory Deployment

    Code Application Deployment Behave E2E Testing Code
  13. ANSIBLE EXECUTION TIME (1) • WE DO HAVE A MIXED

    USAGE PATTERN OF OUR ANSIBLE CODE: • FOR SOME USE-CASES WE MAKE USE OF EPHEMERAL ENVIRONMENTS, SO ANSIBLE IS ALWAYS USED TO INSTALL FROM SCRATCH • FOR SOME OTHERS USE-CASES WE DO CONSTANTLY RUN ANSIBLE AGAINST THE SAME SET OF HOSTS TO CONTINUOSLY UPGRADE THE APPLICATION DEPLOYMENT, SO ANSIBLE IS ALWAYS USED TO UPDATE THE SOFTWARE AND ITS CONFIGURATION • NONE OF OUR CUSTOMERS MAKES USE OF IMMUTABLE INFRASTRUCTURE • DUE TO THE IMPORTANCE TO THE BUSINESS OF THE SERVICES WE ARE DEPLOYING AND TO THE SIZE OF THE DEPLOYMENTS, ANSIBLE EXECUTION TIME IT IS VERY IMPORTANT TO US BECAUSE THIS ALLOWS US TO: • QUICKLY AUDIT THE CONFIGURATION / STATUS OF EACH INSTANCE • INCREASE THE CONFIDENCE WHEN ROLLING OUT CHANGES ACROSS THE INFRASTRUCTURE
  14. ANSIBLE EXECUTION TIME (2)

  15. ANSIBLE EXECUTION TIME (3) THE TIMING INFORMATION REPORTED IN THE

    NEXT SET OF SLIDES HAS BEEN MEASURED IN THE FOLLOWING CONDITIONS: • AGAINST AN ALREADY CONVERGED INFRASTRUCTURE (NO CHANGES / NO ERRORS) • AGAINST OUR TESTING ENVIRONMENT (AS DESCRIBED BEFORE) RUNNING ON OPENSTACK • RUNNING ANSIBLE THROUGH A VPN CONNECTION TO BE ABLE REACH THE TARGET HOSTS
  16. ANSIBLE EXECUTION TIME (4)

  17. SOME EASY IMPROVEMENTS

  18. TAGS AND LIMITS (1) • WE USE TAGS TO SELECT

    THE PLAYS TO BE EXECUTED • WE USE LIMITS TO IMPLEMENT ROLLING CONFIGURATION AND SOFTWARE UPDATES • WE RUN THE FULL ANSIBLE PLAYBOOK IN ”CHECK MODE” TO CONTINUOSLY AUDIT THE STATUS OF CONVERGENCE OF THE PLATFORM AND SPOT MISCONFIGURATIONS / NOT ROLLED OUT HOSTS • WE COLLECT ALL THE ANSIBLE EXECUTION RESULTS WITH OPENSTACK ARA • NO, WE DON’T USE TOWER (YET?!) ;-)
  19. TAGS AND LIMITS (2)

  20. AVOID UNNECESSARY LOOPS

  21. FACTS GATHERING AND CACHING (1) • THE DEFAULT ANSIBLE FACTS

    GATHERING STRATEGY, “IMPLICIT”, FETCHES FACTS AT EVERY PLAY EXECUTION • IN OUR CASE, THAT MEANS THAT DURING A FULL RUN OF OUR PLAYBOOK ANSIBLE WILL FETCH FACTS FROM THE SAME HOSTS 50 TIMES!!!! • IMHO, “SMART” GATHERING – FETCHING FACTS ONLY FOR HOSTS NOT YET DISCOVERED – SHOULD BE THE DEFAULT! • FACTS CACHING IS A MUST HAVE WHEN ABUSING LIMITS TO RESTRICT THE EXECUTION OF AN ANSIBLE PLAYBOOK AND USING FACTS TO CROSS- REFERENCE HOSTVARS IN TEMPLATES
  22. FACTS GATHERING AND CACHING (2)

  23. FACTS GATHERING AND CACHING (3)

  24. FACTS GATHERING AND CACHING (4)

  25. TUNING SSH CONNECTIONS

  26. ANSIBLE EXECUTION MODEL inventory playbook Playbook Executor TaskQueue Manager Strategy

    PlayIterator + Task Executor host fork ssh Worker Process
  27. CONNECTIONS MULTIPLEXING • ANSIBLE SHIPS WITH SOME SANE CONNECTIONS MULTIPLEXING

    CONFIGURATIONS ssh_args = -C ControlMaster=auto -o ControlPersist=60s • HOWEVER, DEPENDING ON THE PLAYBOOKS EXECUTION DURATION AND NUMBER OF HOSTS, THESE CAN IMPROVED TO MITIGATE THE IMPACT OF THE SSH CONNECTIONS OVERHEAD ssh_args = -C ControlMaster=auto -o ControlPersist=900s
  28. FORKS • FORKS CONTROLS THE DEFAULT NUMBER OF PROCESSES TO

    SPAWN WHEN COMMUNICATING WITH REMOTE HOSTS • IN CASE OF A LARGE NUMBER OF HOSTS, HIGHER VALUES WILL MAKE ACTIONS ACROSS ALL OF THE HOSTS COMPLETE FASTER • SETTING AN BIG FORK NUMBER WILL RESULT IN HIGHER NETWORK AND CPU LOAD ON YOUR ANSIBLE HOST
  29. PIPELINING (1) • BY DEFAULT, EACH ANSIBLE MODULE EXECUTION RESULTS

    AT LEAST IN THE FOLLOWING ACTIONS • (1) GENERATION OF THE COMPILED MODULE PYTHON FILE WITH ITS PARAMETERS FOR REMOTE EXECUTION (LOCAL) • (2) CONNECTION TO THE TARGET MACHINE TO DETECT THE USER HOME DIRECTORY (SSH) • (3) CREATION OF THE TEMP DIRECTORIES ON THE TARGET MACHINE (SSH) • (4) TRANSFER OF THE MODULE SOURCE CODE AND BINARY ARGUMENTS (SFTP) • (5) EXECUTION OF THE CODE AND CLEANUP OF THE TEMPORARY DIRECTORIES (SSH) • (6) GET MODULES RESULTS FROM THE SSH CONNECTION STANDARD OUTPUT (LOCAL)
  30. PIPELINING (2) • PIPELINING REDUCES THE NUMBER OF SSH OPERATIONS

    REDUCING THE OPERATIONS EXECUTED BY ANSIBLE TO RUN A MODULE TO • (1) GENERATION OF THE COMPILED MODULE PYTHON FILE WITH ITS PARAMETERS FOR REMOTE EXECUTION (LOCAL) • (2) CONNECT VIA SSH TO EXECUTE PYTHON INTERPRETER (SSH) • (3) SEND PYTHON-FILE CONTENT TO INTERPRETER'S STANDARD INPUT (LOCAL) • (4) GET MODULE'S RESULT FROM STANDARD OUTPUT (LOCAL) • WHEN USING “SUDO” OPERATIONS YOU MUST FIRST DISABLE “REQUIRETTY” ON ALL THE MANAGED HOSTS
  31. TUNING SSH CONNECTIONS (1)

  32. TUNING SSH CONNECTIONS (2)

  33. CHANGING STRATEGY

  34. LINEAR VS FREE STRATEGY (1) • ANSIBLE SHIPS WITH TWO

    DIFFERENT TASKS EXECUTION STRATEGIES: LINEAR AND FREE • THE DEFAULT TASKS EXECUTION STATEGY USED BY ANSIBLE IS LINEAR • WITH LINEAR STRATEGY, ALL THE HOSTS WILL RUN EACH TASK BEFORE BEFORE ANY HOST STARTS THE NEXT TASK, USING THE NUMBER OF FORKS TO PARALLIZE
  35. LINEAR VS FREE STRATEGY (2) • THE FREE STRATEGY, ALLOWS

    EACH HOST TO RUN UNTIL THE END OF THE PLAYBOOK AS FAST AS IT CAN • THE FREE STRATEGY IS GENERALLY VERY EFFECTIVE IN PLAYS WITHOUT INTER- NODE DEPENDENCIES • NOTE THAT STRATEGIES ARE TIED TO PLAYS. YOU CAN MIX STRATEGIES IN YOUR PLAYBOOK • IN OUR USE CASE, GIVEN THE NUMBER OF PLAYS, USING THE FREE STRATEGY DIDN’T RESULT IN ANY RELEVANT SPEED-UP OF THE FULL PLAYBOOK EXECUTION
  36. MITOGEN (1) • MITOGEN FOR ANSIBLE IS A COMPLETELY REDESIGNED

    UNIX CONNECTION LAYER AND MODULE RUNTIME FOR ANSIBLE • CODE IS EPHEMERALLY CACHED IN RAM, REDUCING BANDWIDTH USAGE BY AN ORDER OF MAGNITUDE COMPARED TO SSH PIPELINING • SINCE A SIGNIFICANT STATE IS MAINTAINED IN RAM BETWEEN STEPS, NO REPEATED AUTHENTICATION EVENTS OCCUR ON THE HOSTS • IN-RAM CACHING ALLOWS FEWER WRITES TO THE TARGET FILESYSTEM. IN TYPICAL CONFIGURATIONS, ANSIBLE REPEATEDLY REWRITES AND EXTRACTS ZIP FILES TO MULTIPLE TEMPORARY DIRECTORIES ON THE TARGET.
  37. MITOGEN (2)

  38. MITOGEN (3)

  39. DON’T WAIT FOR ANSIBLE TO COMPLETE

  40. PUT YOUR ANSIBLE ON STEROIDS

  41. USEFUL LINKS • Speeding up Ansible Playbook runs - https://chrisbergeron.com/2018/06/08/ansible_performance_tuning/

    • Ansible – Faster Then Light - https://github.com/Fobhep/cfmgmt2019/blob/master/cfgmgt2019.pdf • List of Ansible settings - https://docs.ansible.com/ansible/latest/reference_appendices/config.html • Mitogen for Ansible - https://mitogen.readthedocs.io/en/latest/ansible.html# • Test Driven Infrastructure Development - https://speakerdeck.com/atosatto/test-driven- infrastructure-development
  42. THANK YOU PS: WE’RE HIRING - https://www.open-xchange.com/jobs/