Slide 1

Slide 1 text

Scaling Ansible Andrea Tosatto @Open-Xchange Loadays 2019 @_hilbert_ atosatto [email protected]

Slide 2

Slide 2 text

WHOAMI

Slide 3

Slide 3 text

POWERDNS Started in 1999 Open-Source since 2007 Part of Open-Xchange since 2015 Powering more than 30% of the internet hosted domains in the world 75% of the DNSSEC domains in the world 150mln of internet users We code mainly in C++, but we do also Lua, Python and Golang. We use Ansible for the automation of the rollout and management of most of our Customers deployments. @PowerDNS powerdns/pdns

Slide 4

Slide 4 text

WHY ANSIBLE

Slide 5

Slide 5 text

#ANSI-LOVE <3 • ANSIBLE IS EASY • ANSIBLE HAS BEEN INCREDIBLY EFFECTIVE FOR US • IT ALLOWED US TO QUICKLY ON-BOARD NEW TEAM MEMBERS • IT ALLOWED US TO EASILY TRAIN CUSTOMERS TO OPERATE OUR SOLUTION • IT INCREASED THE VELOCITY OF THE ROLLOUT OF OUR PLATFORM IN COMPLEX ENVIRONMENTS • IT ALLOWED US TO UNIFORM ALL THE DEPLOYMENTS OF OUR SOLUTION • ANSIBLE IS EASY TO BE IMPLEMENTED • YOU NEED JUST SSH AND PYTHON • NO “MASTER” NO “CERTIFICATES” NO “AGENTS”

Slide 6

Slide 6 text

#ANSI-LOVE <3 • ANSIBLE IS EASY • ANSIBLE HAS BEEN INCREDIBLY EFFECTIVE FOR US • IT ALLOWED US TO QUICKLY ON-BOARD NEW TEAM MEMBERS • IT ALLOWED US TO EASILY TRAIN CUSTOMERS TO OPERATE OUR SOLUTION • IT INCREASED THE VELOCITY OF THE ROLLOUT OF OUR PLATFORM IN COMPLEX ENVIRONMENTS • IT ALLOWED US TO UNIFORM ALL THE DEPLOYMENTS OF OUR SOLUTION • ANSIBLE IS EASY TO BE IMPLEMENTED • YOU NEED JUST SSH AND PYTHON • NO “MASTER” NO “CERTIFICATES” NO “AGENTS”

Slide 7

Slide 7 text

SO WHAT?!

Slide 8

Slide 8 text

DEPLOYING POWERDNS • 160 VMS (76 more coming in 4 weeks) • 5 DATACENTERS (2 more in the pipeline) • 1 BASTION HOST • 80 ROLES • 56 PLAYS (including facts fetching) • 832 TASKS

Slide 9

Slide 9 text

IT’S RUNNING!

Slide 10

Slide 10 text

WHERE TO START?!

Slide 11

Slide 11 text

TEST ENVIRONMENT (1) • 9 VMS • 1 DATACENTER • 1 BASTION HOST • 80 ROLES • 56 PLAYS (including facts fetching) • 832 TASKS • ANSIBLE 2.6.14 • PYTHON 2.7.16

Slide 12

Slide 12 text

TEST ENVIRONMENT (2) Infrastructure Code Terraform Ansible Ansible Inventory Deployment Code Application Deployment Behave E2E Testing Code

Slide 13

Slide 13 text

ANSIBLE EXECUTION TIME (1) • WE DO HAVE A MIXED USAGE PATTERN OF OUR ANSIBLE CODE: • FOR SOME USE-CASES WE MAKE USE OF EPHEMERAL ENVIRONMENTS, SO ANSIBLE IS ALWAYS USED TO INSTALL FROM SCRATCH • FOR SOME OTHERS USE-CASES WE DO CONSTANTLY RUN ANSIBLE AGAINST THE SAME SET OF HOSTS TO CONTINUOSLY UPGRADE THE APPLICATION DEPLOYMENT, SO ANSIBLE IS ALWAYS USED TO UPDATE THE SOFTWARE AND ITS CONFIGURATION • NONE OF OUR CUSTOMERS MAKES USE OF IMMUTABLE INFRASTRUCTURE • DUE TO THE IMPORTANCE TO THE BUSINESS OF THE SERVICES WE ARE DEPLOYING AND TO THE SIZE OF THE DEPLOYMENTS, ANSIBLE EXECUTION TIME IT IS VERY IMPORTANT TO US BECAUSE THIS ALLOWS US TO: • QUICKLY AUDIT THE CONFIGURATION / STATUS OF EACH INSTANCE • INCREASE THE CONFIDENCE WHEN ROLLING OUT CHANGES ACROSS THE INFRASTRUCTURE

Slide 14

Slide 14 text

ANSIBLE EXECUTION TIME (2)

Slide 15

Slide 15 text

ANSIBLE EXECUTION TIME (3) THE TIMING INFORMATION REPORTED IN THE NEXT SET OF SLIDES HAS BEEN MEASURED IN THE FOLLOWING CONDITIONS: • AGAINST AN ALREADY CONVERGED INFRASTRUCTURE (NO CHANGES / NO ERRORS) • AGAINST OUR TESTING ENVIRONMENT (AS DESCRIBED BEFORE) RUNNING ON OPENSTACK • RUNNING ANSIBLE THROUGH A VPN CONNECTION TO BE ABLE REACH THE TARGET HOSTS

Slide 16

Slide 16 text

ANSIBLE EXECUTION TIME (4)

Slide 17

Slide 17 text

SOME EASY IMPROVEMENTS

Slide 18

Slide 18 text

TAGS AND LIMITS (1) • WE USE TAGS TO SELECT THE PLAYS TO BE EXECUTED • WE USE LIMITS TO IMPLEMENT ROLLING CONFIGURATION AND SOFTWARE UPDATES • WE RUN THE FULL ANSIBLE PLAYBOOK IN ”CHECK MODE” TO CONTINUOSLY AUDIT THE STATUS OF CONVERGENCE OF THE PLATFORM AND SPOT MISCONFIGURATIONS / NOT ROLLED OUT HOSTS • WE COLLECT ALL THE ANSIBLE EXECUTION RESULTS WITH OPENSTACK ARA • NO, WE DON’T USE TOWER (YET?!) ;-)

Slide 19

Slide 19 text

TAGS AND LIMITS (2)

Slide 20

Slide 20 text

AVOID UNNECESSARY LOOPS

Slide 21

Slide 21 text

FACTS GATHERING AND CACHING (1) • THE DEFAULT ANSIBLE FACTS GATHERING STRATEGY, “IMPLICIT”, FETCHES FACTS AT EVERY PLAY EXECUTION • IN OUR CASE, THAT MEANS THAT DURING A FULL RUN OF OUR PLAYBOOK ANSIBLE WILL FETCH FACTS FROM THE SAME HOSTS 50 TIMES!!!! • IMHO, “SMART” GATHERING – FETCHING FACTS ONLY FOR HOSTS NOT YET DISCOVERED – SHOULD BE THE DEFAULT! • FACTS CACHING IS A MUST HAVE WHEN ABUSING LIMITS TO RESTRICT THE EXECUTION OF AN ANSIBLE PLAYBOOK AND USING FACTS TO CROSS- REFERENCE HOSTVARS IN TEMPLATES

Slide 22

Slide 22 text

FACTS GATHERING AND CACHING (2)

Slide 23

Slide 23 text

FACTS GATHERING AND CACHING (3)

Slide 24

Slide 24 text

FACTS GATHERING AND CACHING (4)

Slide 25

Slide 25 text

TUNING SSH CONNECTIONS

Slide 26

Slide 26 text

ANSIBLE EXECUTION MODEL inventory playbook Playbook Executor TaskQueue Manager Strategy PlayIterator + Task Executor host fork ssh Worker Process

Slide 27

Slide 27 text

CONNECTIONS MULTIPLEXING • ANSIBLE SHIPS WITH SOME SANE CONNECTIONS MULTIPLEXING CONFIGURATIONS ssh_args = -C ControlMaster=auto -o ControlPersist=60s • HOWEVER, DEPENDING ON THE PLAYBOOKS EXECUTION DURATION AND NUMBER OF HOSTS, THESE CAN IMPROVED TO MITIGATE THE IMPACT OF THE SSH CONNECTIONS OVERHEAD ssh_args = -C ControlMaster=auto -o ControlPersist=900s

Slide 28

Slide 28 text

FORKS • FORKS CONTROLS THE DEFAULT NUMBER OF PROCESSES TO SPAWN WHEN COMMUNICATING WITH REMOTE HOSTS • IN CASE OF A LARGE NUMBER OF HOSTS, HIGHER VALUES WILL MAKE ACTIONS ACROSS ALL OF THE HOSTS COMPLETE FASTER • SETTING AN BIG FORK NUMBER WILL RESULT IN HIGHER NETWORK AND CPU LOAD ON YOUR ANSIBLE HOST

Slide 29

Slide 29 text

PIPELINING (1) • BY DEFAULT, EACH ANSIBLE MODULE EXECUTION RESULTS AT LEAST IN THE FOLLOWING ACTIONS • (1) GENERATION OF THE COMPILED MODULE PYTHON FILE WITH ITS PARAMETERS FOR REMOTE EXECUTION (LOCAL) • (2) CONNECTION TO THE TARGET MACHINE TO DETECT THE USER HOME DIRECTORY (SSH) • (3) CREATION OF THE TEMP DIRECTORIES ON THE TARGET MACHINE (SSH) • (4) TRANSFER OF THE MODULE SOURCE CODE AND BINARY ARGUMENTS (SFTP) • (5) EXECUTION OF THE CODE AND CLEANUP OF THE TEMPORARY DIRECTORIES (SSH) • (6) GET MODULES RESULTS FROM THE SSH CONNECTION STANDARD OUTPUT (LOCAL)

Slide 30

Slide 30 text

PIPELINING (2) • PIPELINING REDUCES THE NUMBER OF SSH OPERATIONS REDUCING THE OPERATIONS EXECUTED BY ANSIBLE TO RUN A MODULE TO • (1) GENERATION OF THE COMPILED MODULE PYTHON FILE WITH ITS PARAMETERS FOR REMOTE EXECUTION (LOCAL) • (2) CONNECT VIA SSH TO EXECUTE PYTHON INTERPRETER (SSH) • (3) SEND PYTHON-FILE CONTENT TO INTERPRETER'S STANDARD INPUT (LOCAL) • (4) GET MODULE'S RESULT FROM STANDARD OUTPUT (LOCAL) • WHEN USING “SUDO” OPERATIONS YOU MUST FIRST DISABLE “REQUIRETTY” ON ALL THE MANAGED HOSTS

Slide 31

Slide 31 text

TUNING SSH CONNECTIONS (1)

Slide 32

Slide 32 text

TUNING SSH CONNECTIONS (2)

Slide 33

Slide 33 text

CHANGING STRATEGY

Slide 34

Slide 34 text

LINEAR VS FREE STRATEGY (1) • ANSIBLE SHIPS WITH TWO DIFFERENT TASKS EXECUTION STRATEGIES: LINEAR AND FREE • THE DEFAULT TASKS EXECUTION STATEGY USED BY ANSIBLE IS LINEAR • WITH LINEAR STRATEGY, ALL THE HOSTS WILL RUN EACH TASK BEFORE BEFORE ANY HOST STARTS THE NEXT TASK, USING THE NUMBER OF FORKS TO PARALLIZE

Slide 35

Slide 35 text

LINEAR VS FREE STRATEGY (2) • THE FREE STRATEGY, ALLOWS EACH HOST TO RUN UNTIL THE END OF THE PLAYBOOK AS FAST AS IT CAN • THE FREE STRATEGY IS GENERALLY VERY EFFECTIVE IN PLAYS WITHOUT INTER- NODE DEPENDENCIES • NOTE THAT STRATEGIES ARE TIED TO PLAYS. YOU CAN MIX STRATEGIES IN YOUR PLAYBOOK • IN OUR USE CASE, GIVEN THE NUMBER OF PLAYS, USING THE FREE STRATEGY DIDN’T RESULT IN ANY RELEVANT SPEED-UP OF THE FULL PLAYBOOK EXECUTION

Slide 36

Slide 36 text

MITOGEN (1) • MITOGEN FOR ANSIBLE IS A COMPLETELY REDESIGNED UNIX CONNECTION LAYER AND MODULE RUNTIME FOR ANSIBLE • CODE IS EPHEMERALLY CACHED IN RAM, REDUCING BANDWIDTH USAGE BY AN ORDER OF MAGNITUDE COMPARED TO SSH PIPELINING • SINCE A SIGNIFICANT STATE IS MAINTAINED IN RAM BETWEEN STEPS, NO REPEATED AUTHENTICATION EVENTS OCCUR ON THE HOSTS • IN-RAM CACHING ALLOWS FEWER WRITES TO THE TARGET FILESYSTEM. IN TYPICAL CONFIGURATIONS, ANSIBLE REPEATEDLY REWRITES AND EXTRACTS ZIP FILES TO MULTIPLE TEMPORARY DIRECTORIES ON THE TARGET.

Slide 37

Slide 37 text

MITOGEN (2)

Slide 38

Slide 38 text

MITOGEN (3)

Slide 39

Slide 39 text

DON’T WAIT FOR ANSIBLE TO COMPLETE

Slide 40

Slide 40 text

PUT YOUR ANSIBLE ON STEROIDS

Slide 41

Slide 41 text

USEFUL LINKS • Speeding up Ansible Playbook runs - https://chrisbergeron.com/2018/06/08/ansible_performance_tuning/ • Ansible – Faster Then Light - https://github.com/Fobhep/cfmgmt2019/blob/master/cfgmgt2019.pdf • List of Ansible settings - https://docs.ansible.com/ansible/latest/reference_appendices/config.html • Mitogen for Ansible - https://mitogen.readthedocs.io/en/latest/ansible.html# • Test Driven Infrastructure Development - https://speakerdeck.com/atosatto/test-driven- infrastructure-development

Slide 42

Slide 42 text

THANK YOU PS: WE’RE HIRING - https://www.open-xchange.com/jobs/