Slide 1

Slide 1 text

5/29/15 @dougbarth IPsec mesh network: perfect for the cloud? VELOCITY SANTA CLARA 2015

Slide 2

Slide 2 text

5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC

Slide 3

Slide 3 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 4

Slide 4 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? Traditional setup PagerDuty’s setup Rollout Tales from production Should you do this?

Slide 5

Slide 5 text

5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Traditional setup

Slide 6

Slide 6 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? PUBLIC PRIVATE DM Z

Slide 7

Slide 7 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? TLS landmines Finagle’s TLS + IPsec IPsec only TLS penalty in MySQL

Slide 8

Slide 8 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 9

Slide 9 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? VPN VPN TUNNEL

Slide 10

Slide 10 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? VPC NO VPC

Slide 11

Slide 11 text

5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC PagerDuty’s setup

Slide 12

Slide 12 text

5/29/15 Hosting setup Cloud providers only Multi-datacenter deployment Several different technologies Percona XtraDB Cluster Zookeeper Cassandra Ruby/Rails Scala/Finagle nginx IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? US-WEST-2 US-WEST- 1 LINODE

Slide 13

Slide 13 text

5/29/15 PagerDuty’s networking goals Encrypted by default Failures at the instance level (not the DC level) Throughput scales as our infrastructure grows IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 14

Slide 14 text

5/29/15 Mesh network Every box handles its own encryption Policy enforcement distributed CM makes this manageable IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 15

Slide 15 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? Ubuntu 10.04 & 14.04 ipsec-tools racoon

Slide 16

Slide 16 text

5/29/15 # This configures how Phase 1 key exchange occurs. We keep the remote end # anonymous so we don't have to bounce racoon (and therefore lose our SAs) when # new boxes are added. remote anonymous { … Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 17

Slide 17 text

5/29/15 exchange_mode main; Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? POLICY DHE AUTH INITIATOR RESPONDER

Slide 18

Slide 18 text

5/29/15 proposal { authentication_method pre_shared_key; dh_group modp3072; encryption_algorithm aes; hash_algorithm sha256; } Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 19

Slide 19 text

5/29/15 Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? # Per connection PSK # box01 10.0.0.1 0xdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef 50.0.0.1 0xdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef # box02 10.0.0.2 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef 50.0.0.2 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef

Slide 20

Slide 20 text

5/29/15 # EC2 doesn’t route ESP nat_traversal force; Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? UDP ESP

Slide 21

Slide 21 text

5/29/15 lifetime time 24 hours; Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 22

Slide 22 text

5/29/15 dpd_delay 20; Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? INITIATOR RESPONDER R-U-THERE R-U-THERE-ACK

Slide 23

Slide 23 text

5/29/15 # This configures the SA parameters. Again, anonymous so we don't need to # bounce racoon when new boxes are added. sainfo anonymous { pfs_group modp3072; encryption_algorithm aes; authentication_algorithm hmac_sha256; compression_algorithm deflate; lifetime time 8 hours; } Phase 2 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 24

Slide 24 text

5/29/15 # WAN spdadd 50.0.0.1 10.0.0.2 any -P out ipsec esp/transport//require; spdadd 10.0.0.2 50.0.0.1 any -P in ipsec esp/transport//require; spdadd 10.0.0.1 50.0.0.2 any -P out ipsec esp/transport//require; spdadd 50.0.0.2 10.0.0.1 any -P in ipsec esp/transport//require; # LAN spdadd 10.0.0.1 10.0.0.2 any -P out ipsec esp/transport//require; spdadd 10.0.0.2 10.0.0.1 any -P in ipsec esp/transport//require; SPD — WAN & LAN encryption IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 25

Slide 25 text

5/29/15 spdadd 0.0.0.0/0 0.0.0.0/0[22] tcp -P out prio def + 100 none; spdadd 0.0.0.0/0 0.0.0.0/0[22] tcp -P in prio def + 100 none; spdadd 0.0.0.0/0[22] 0.0.0.0/0 tcp -P out prio def + 100 none; spdadd 0.0.0.0/0[22] 0.0.0.0/0 tcp -P in prio def + 100 none; SPD — SSH excluded IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 26

Slide 26 text

5/29/15 # Exclude ICMP from IPsec. # # Having ICMP encrypted makes it difficult for us to investigate networking # issues. traceroute between machines doesn't work because traceroute doesn't # realize that the TTL expired packet referencing the UDP-encap packet is meant # for it. A similar issue exists for mtr. spdadd 0.0.0.0/0 0.0.0.0/0 icmp -P out prio def + 100 none; spdadd 0.0.0.0/0 0.0.0.0/0 icmp -P in prio def + 100 none; SPD — ICMP excluded too IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 27

Slide 27 text

5/29/15 pd-sync-policies IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? Found 2589 existing policies, 1718 existing point-to-point policies Loading policies from ["/etc/ipsec-tools.conf", "/etc/ipsec-tools.d/*.conf"] Found 1718 policies in the config file Found 0 changed policies Found 0 new policies Found 0 old policies setkey returned successfully

Slide 28

Slide 28 text

5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Rollout

Slide 29

Slide 29 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? NONE NONE NONE

Slide 30

Slide 30 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? USE NONE NONE

Slide 31

Slide 31 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? USE USE USE

Slide 32

Slide 32 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? REQUIRE USE USE

Slide 33

Slide 33 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? REQUIRE REQUIRE REQUIRE

Slide 34

Slide 34 text

5/29/15 $ sudo setkey -DP # SPD entries $ sudo racoonctl -l show-sa isakmp # Phase 1 relationships $ sudo setkey -D # Phase 2 relationships $ sudo ip xfrm state # Phase 2 relationships (newer format) Scraping lots of metrics IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 35

Slide 35 text

5/29/15 type=MAC_IPSEC_EVENT msg=audit(1432651251.889:2222847): op=SA-replayed-pkt … type=MAC_IPSEC_EVENT msg=audit(1432651251.889:2222848): op=SA-notfound … auditd has useful events IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 36

Slide 36 text

5/29/15 $ cat /proc/net/xfrm_stat XfrmInError 0 XfrmInBufferError 0 XfrmInHdrError 0 XfrmInNoStates 1238 XfrmInStateProtoError 8 XfrmInStateModeError 0 XfrmInStateSeqError 500 XfrmInStateExpired 0 XfrmInStateMismatch 0 XfrmInStateInvalid 0 XfrmInTmplMismatch 0 XfrmInNoPols 0 CONFIG_XFRM_STATISTICS IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 37

Slide 37 text

5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Tales from production

Slide 38

Slide 38 text

5/29/15 Lessons learned AES works well, even without hardware acceleration Network probing is more difficult Path MTU breaks after route timeout. Still investigating. IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 39

Slide 39 text

5/29/15 Ordering issues in DPD DPD is on phase 1 relationship No liveness check on phase 2 relationships SAs can get out of sync Requires manually clearing the relationships IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 40

Slide 40 text

5/29/15 Linux 3.7 - 3.13 xfrm4_gc_thresh 2.6 used a large dynamic value 3.7 switched to static value of 1024 3.13 bumped to 32K after performance issues IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 41

Slide 41 text

5/29/15 Linux 3.9 Client kernel panics on server disconnect http://sourceforge.net/p/ipsec-tools/bugs/86/ IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 42

Slide 42 text

5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? “Corruption during AES encryption in Xen v4.1 or v3.4 paravirtual guests running a Linux 3.0+ kernel, combined with the lack of TCP checksum validation in IPSec Transport mode, which leads to the admission of corrupted TCP data on a ZooKeeper node, resulting in an unhandled exception from which ZooKeeper is unable to recover. Jeez. Talk about a needle in a haystack…”

Slide 43

Slide 43 text

5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Should you do this?

Slide 44

Slide 44 text

5/29/15 Need an agent Automatically manage policies Handle metrics collection and emission Implement phase 2 liveness checks GC old relationships IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 45

Slide 45 text

5/29/15 [email protected] PAGERDUTY.COM/JOBS IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

Slide 46

Slide 46 text

5/29/15 pagerduty.com Thank you!