Pro Yearly is on sale from $80 to $50! »

IPsec mesh network: Perfect for the cloud?

IPsec mesh network: Perfect for the cloud?

Velocity 2015 session: http://velocityconf.com/devops-web-performance-2015/public/schedule/detail/41454
Video: https://www.youtube.com/watch?v=320KfBkGNyY

Private networks are traditionally assumed secure, and traffic crossing the public internet is secured via VPN. For data center-to-data center traffic, that involves a site-to-site tunnel with a VPN concentrator on either side of the encrypted tunnel.

This hub-and-spoke architecture presents a simple solution for protecting network traffic over the internet, but making that solution highly available involves added complexity. Multiple tunnels need to exist, with traffic load balanced across them. Automation must exist to detect down (and half down) links and redirect traffic. Also, these tunnels should scale up as the amount of traffic passing between data centers increases. And this solution does nothing to protect network traffic on the private network—a private network that is increasingly managed by cloud providers and shared with other companies.

When traffic is flowing over networks that we don’t manage (both over the WAN and the LAN), it is time to rethink our network security practices. By using DevOps practices in our network systems, PagerDuty was able to get rid of the hub-and-spoke model and instead use an IPSec mesh architecture. Each server in our system establishes a secure association with its peer and transmits all traffic using IPSec transport. Each host manages encryption and decryption of its own traffic, so our ability to protect that traffic naturally scales up as we add new infrastructure.

This talk will focus on how we implemented that model on our Linux fleet. We will dig into the details of our configuration including the policies we use, and the encryption and authentication mechanisms in place. We will talk about how this model performs on our systems and the impact it has on the production workload. Finally, we will discuss how it handles failure, bugs we’ve found along the way, and how we see this model changing as our infrastructure continues to grow.

In the end, I hope to have given everyone a better understanding of how VPNs work, and how through combining the development and operations disciplines we can produce a solution that was previously considered impractical.

A97a75c945507f70992f579a730b0657?s=128

Doug Barth

May 28, 2015
Tweet

Transcript

  1. 5/29/15 @dougbarth IPsec mesh network: perfect for the cloud? VELOCITY

    SANTA CLARA 2015
  2. 5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC

  3. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

  4. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? Traditional setup

    PagerDuty’s setup Rollout Tales from production Should you do this?
  5. 5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Traditional setup

  6. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? PUBLIC PRIVATE

    DM Z
  7. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? TLS landmines

    Finagle’s TLS + IPsec IPsec only TLS penalty in MySQL
  8. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

  9. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? VPN VPN

    TUNNEL
  10. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? VPC NO

    VPC
  11. 5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC PagerDuty’s setup

  12. 5/29/15 Hosting setup Cloud providers only Multi-datacenter deployment Several different

    technologies Percona XtraDB Cluster Zookeeper Cassandra Ruby/Rails Scala/Finagle nginx IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? US-WEST-2 US-WEST- 1 LINODE
  13. 5/29/15 PagerDuty’s networking goals Encrypted by default Failures at the

    instance level (not the DC level) Throughput scales as our infrastructure grows IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  14. 5/29/15 Mesh network Every box handles its own encryption Policy

    enforcement distributed CM makes this manageable IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  15. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? Ubuntu 10.04

    & 14.04 ipsec-tools racoon
  16. 5/29/15 # This configures how Phase 1 key exchange occurs.

    We keep the remote end # anonymous so we don't have to bounce racoon (and therefore lose our SAs) when # new boxes are added. remote anonymous { … Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  17. 5/29/15 exchange_mode main; Phase 1 IPSEC MESH NETWORK: PERFECT FOR

    THE CLOUD? POLICY DHE AUTH INITIATOR RESPONDER
  18. 5/29/15 proposal { authentication_method pre_shared_key; dh_group modp3072; encryption_algorithm aes; hash_algorithm

    sha256; } Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  19. 5/29/15 Phase 1 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    # Per connection PSK # box01 10.0.0.1 0xdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef 50.0.0.1 0xdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef # box02 10.0.0.2 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef 50.0.0.2 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
  20. 5/29/15 # EC2 doesn’t route ESP nat_traversal force; Phase 1

    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? UDP ESP
  21. 5/29/15 lifetime time 24 hours; Phase 1 IPSEC MESH NETWORK:

    PERFECT FOR THE CLOUD?
  22. 5/29/15 dpd_delay 20; Phase 1 IPSEC MESH NETWORK: PERFECT FOR

    THE CLOUD? INITIATOR RESPONDER R-U-THERE R-U-THERE-ACK
  23. 5/29/15 # This configures the SA parameters. Again, anonymous so

    we don't need to # bounce racoon when new boxes are added. sainfo anonymous { pfs_group modp3072; encryption_algorithm aes; authentication_algorithm hmac_sha256; compression_algorithm deflate; lifetime time 8 hours; } Phase 2 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  24. 5/29/15 # WAN spdadd 50.0.0.1 10.0.0.2 any -P out ipsec

    esp/transport//require; spdadd 10.0.0.2 50.0.0.1 any -P in ipsec esp/transport//require; spdadd 10.0.0.1 50.0.0.2 any -P out ipsec esp/transport//require; spdadd 50.0.0.2 10.0.0.1 any -P in ipsec esp/transport//require; # LAN spdadd 10.0.0.1 10.0.0.2 any -P out ipsec esp/transport//require; spdadd 10.0.0.2 10.0.0.1 any -P in ipsec esp/transport//require; SPD — WAN & LAN encryption IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  25. 5/29/15 spdadd 0.0.0.0/0 0.0.0.0/0[22] tcp -P out prio def +

    100 none; spdadd 0.0.0.0/0 0.0.0.0/0[22] tcp -P in prio def + 100 none; spdadd 0.0.0.0/0[22] 0.0.0.0/0 tcp -P out prio def + 100 none; spdadd 0.0.0.0/0[22] 0.0.0.0/0 tcp -P in prio def + 100 none; SPD — SSH excluded IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  26. 5/29/15 # Exclude ICMP from IPsec. # # Having ICMP

    encrypted makes it difficult for us to investigate networking # issues. traceroute between machines doesn't work because traceroute doesn't # realize that the TTL expired packet referencing the UDP-encap packet is meant # for it. A similar issue exists for mtr. spdadd 0.0.0.0/0 0.0.0.0/0 icmp -P out prio def + 100 none; spdadd 0.0.0.0/0 0.0.0.0/0 icmp -P in prio def + 100 none; SPD — ICMP excluded too IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  27. 5/29/15 pd-sync-policies IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? Found

    2589 existing policies, 1718 existing point-to-point policies Loading policies from ["/etc/ipsec-tools.conf", "/etc/ipsec-tools.d/*.conf"] Found 1718 policies in the config file Found 0 changed policies Found 0 new policies Found 0 old policies setkey returned successfully
  28. 5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Rollout

  29. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? NONE NONE

    NONE
  30. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? USE NONE

    NONE
  31. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? USE USE

    USE
  32. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? REQUIRE USE

    USE
  33. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? REQUIRE REQUIRE

    REQUIRE
  34. 5/29/15 $ sudo setkey -DP # SPD entries $ sudo

    racoonctl -l show-sa isakmp # Phase 1 relationships $ sudo setkey -D # Phase 2 relationships $ sudo ip xfrm state # Phase 2 relationships (newer format) Scraping lots of metrics IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  35. 5/29/15 type=MAC_IPSEC_EVENT msg=audit(1432651251.889:2222847): op=SA-replayed-pkt … type=MAC_IPSEC_EVENT msg=audit(1432651251.889:2222848): op=SA-notfound … auditd

    has useful events IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  36. 5/29/15 $ cat /proc/net/xfrm_stat XfrmInError 0 XfrmInBufferError 0 XfrmInHdrError 0

    XfrmInNoStates 1238 XfrmInStateProtoError 8 XfrmInStateModeError 0 XfrmInStateSeqError 500 XfrmInStateExpired 0 XfrmInStateMismatch 0 XfrmInStateInvalid 0 XfrmInTmplMismatch 0 XfrmInNoPols 0 CONFIG_XFRM_STATISTICS IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  37. 5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Tales from production

  38. 5/29/15 Lessons learned AES works well, even without hardware acceleration

    Network probing is more difficult Path MTU breaks after route timeout. Still investigating. IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  39. 5/29/15 Ordering issues in DPD DPD is on phase 1

    relationship No liveness check on phase 2 relationships SAs can get out of sync Requires manually clearing the relationships IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  40. 5/29/15 Linux 3.7 - 3.13 xfrm4_gc_thresh 2.6 used a large

    dynamic value 3.7 switched to static value of 1024 3.13 bumped to 32K after performance issues IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  41. 5/29/15 Linux 3.9 Client kernel panics on server disconnect http://sourceforge.net/p/ipsec-tools/bugs/86/

    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  42. 5/29/15 IPSEC MESH NETWORK: PERFECT FOR THE CLOUD? “Corruption during

    AES encryption in Xen v4.1 or v3.4 paravirtual guests running a Linux 3.0+ kernel, combined with the lack of TCP checksum validation in IPSec Transport mode, which leads to the admission of corrupted TCP data on a ZooKeeper node, resulting in an unhandled exception from which ZooKeeper is unable to recover. Jeez. Talk about a needle in a haystack…”
  43. 5/29/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Should you do

    this?
  44. 5/29/15 Need an agent Automatically manage policies Handle metrics collection

    and emission Implement phase 2 liveness checks GC old relationships IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
  45. 5/29/15 doug@pagerduty.com PAGERDUTY.COM/JOBS IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

  46. 5/29/15 pagerduty.com Thank you!