Improving your services, the DevOps way

Slide 1

Slide 1 text

Marco Marongiu Telenor Digital AS Improving your services, the DevOps way DevOps techniques for a non DevOps shop ‑

Slide 2

Slide 2 text

Who I am ● IT Manager & Head of IT ● Started Nov 2016 ● Senior System Administrator ● Feb 2010 – Oct 2016 ● Lots of Config Management! Special thanks to Michael Link, CIO at Opera Software, for allowing me to hold this presentation.

Slide 3

Slide 3 text

Agenda ● The old email infrastructure ● The problem(s) ● The new email infrastructure... ● ...and how we got there

Slide 4

Slide 4 text

Internal systems Inbound MX Mail routing layer

Slide 5

Slide 5 text

LEGACY

Slide 6

Slide 6 text

S.P.O.F.

Slide 7

Slide 7 text

SNOWFLAKES

Slide 8

Slide 8 text

INCONSISTENT CONFIGURATION

Slide 9

Slide 9 text

UNEXPECTED INTERACTIONS

Slide 10

Slide 10 text

UNEXPECTED BEHAVIOURS

Slide 11

Slide 11 text

Considerations ● The architecture makes sense, but the implementation s***s ● We needed reliability: remove SPOFs by having more than one machine per role, in different locations, and having more than one must be easy; ● We needed resilience: each machine should be easy to restore in the case of a failure; ● We needed consistency: the configuration should match the role of each machine and always be up to date

Slide 12

Slide 12 text

We needed CONFIGURATION MANAGEMENT

Slide 13

Slide 13 text

Policy hubs Inbound MX Mail routing layer Internal systems

Slide 14

Slide 14 text

...but how does one reproduce SNOFLAKES?

Slide 15

Slide 15 text

YOU DON'T!!! (silly!)

Slide 16

Slide 16 text

Reproduce what you want, not what you have ● We are not interested in reproducing the existing machines as much as we are interested in reproducing their behaviour! – List the behaviours for each role; – Write tests for those behaviours – Ensure that the existing machines pass all tests, or check your tests and fix them ● Build the configurations for the new machines and use the tests to validate them.

Slide 17

Slide 17 text

Use your (company's) full potential ● Don't go to war alone: if you have knowledgeable colleagues get help from them; ● If they can't help you to do the actual work, they can still help you in the design phase, or checking that the final result is sound; ● Remember to do DevOps – e.g.: include network and security specialists in the picture...

Slide 18

Slide 18 text

Start debt-free ● Use an up-to-date version of the Operating System: if the existing systems use an outdated OS, don't take the shortcut: start with a fresh version; – Yes, it still holds even if that means you have to switch from System V to Systemd ● Reuse existing configurations where it makes sense; ● Manage the whole configuration from the very beginning; don't end up with a snowflake again.

Slide 19

Slide 19 text

Start with the high level, code the details ● Whatever the role of the machine, the high-level operations will be the same: – Ensure that some packages are installed – Ensure that some services are running – Ensure that configuration changes are detected and picked up – Ensure that configurations are reloaded when they change – Ensure that services are restarted when there is a significant configuration change where a reload isn't enough

Slide 20

Slide 20 text

Start with the high level, code the details ● Most of the high-level operations are so generic that they can be coded in a reusable way, like subroutines in a programming language – Depending on your CM tool they will be called bundles, classes... ● Even better, they may be available in libraries or frameworks that are ready to use – In our case (CFEngine), we used the NCF framework from Normation, the fine makers of the Rudder Project ● Creating generic building blocks or using existing frameworks can save you lots of time!

Slide 21

Slide 21 text

Be lazy on similarities, smart on differences ● E.g.: what differentiates an inbound MX from an SMTP router? – In the MX we would install more packages (antivirus, spam filter...); – The SMTP router has a milter, a custom service that must start at boot; – ... ● Where configurations are similar across roles but with some key differences (e.g. for the MTA), a template should definitely be used; ● Configurations for “unique” services could be distributed as plain files, unless there were information dependencies on the local machine, in which case a template is necessary;

Slide 22

Slide 22 text

Small plan, big wins ● The most complex piece to put together was the inbound MX, which required a fair amount of work to get (almost) right; ● The outbound MX and the SMTP router were similar enough that they could use the same “driver” – all in all, they were simple SMTP servers with some configuration differences – the same CFEngine bundle could configure both of them ● The investment in time due to the adoption of a new framework and writing tests ahead paid us back quickly!

Slide 23

Slide 23 text

Testing with production traffic

Slide 24

Slide 24 text

Testing with production traffic ● Iptables has a feature called “weighted connections” where it will act on a user defined percentage of connections, randomly chosen; ● We used that in the PREROUTING chain on the inbound MX to DNAT a few incoming connections on port 25, forwarding them to the new MX ● On the new MX connections through port 25 were marked by iptables and return packets routed back via the old MX using a dedicated routing table.

Slide 25

Slide 25 text

The last mile: distributing configurations

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Small plan, big wins – again. Push routing maps to inbound MX, SMTP router, unmanaged configs Push routing maps to distr. points, all servers pull full configs

Slide 28

Slide 28 text

What you have is what you tested...

Slide 29

Slide 29 text

...but fixes are a snap away now! ● Unless you did some huge blunder, you have now means to easily deploy configuration updates across the whole infrastructure and have fixes applied in minutes and consistently – ...which is what happened to our mail infrastructure during the first couple of months

Slide 30

Slide 30 text

Emergency reconfigurations are not a problem

Slide 31

Slide 31 text

Conclusions ● We started with an architecture with good foundations but filled of SPOF's ● We grew it into a resilient, distributed, scalable architecture ● We did it by using techniques from test-driven development, agile, DevOps, collaboration in general ● We didn't do a perfect job, but we got the tools in place to improve it along the way ● And you can do that, too!

Slide 32

Slide 32 text

Thank you! Marco Marongiu Email: [email protected] Twitter: @brontolinux Web: http://syslog.me/