Upgrade to Pro — share decks privately, control downloads, hide ads and more …

IPsec mesh network: Perfect for the cloud?

IPsec mesh network: Perfect for the cloud?

Velocity 2015 session: http://velocityconf.com/devops-web-performance-2015/public/schedule/detail/41454
Video: https://www.youtube.com/watch?v=320KfBkGNyY

Private networks are traditionally assumed secure, and traffic crossing the public internet is secured via VPN. For data center-to-data center traffic, that involves a site-to-site tunnel with a VPN concentrator on either side of the encrypted tunnel.

This hub-and-spoke architecture presents a simple solution for protecting network traffic over the internet, but making that solution highly available involves added complexity. Multiple tunnels need to exist, with traffic load balanced across them. Automation must exist to detect down (and half down) links and redirect traffic. Also, these tunnels should scale up as the amount of traffic passing between data centers increases. And this solution does nothing to protect network traffic on the private network—a private network that is increasingly managed by cloud providers and shared with other companies.

When traffic is flowing over networks that we don’t manage (both over the WAN and the LAN), it is time to rethink our network security practices. By using DevOps practices in our network systems, PagerDuty was able to get rid of the hub-and-spoke model and instead use an IPSec mesh architecture. Each server in our system establishes a secure association with its peer and transmits all traffic using IPSec transport. Each host manages encryption and decryption of its own traffic, so our ability to protect that traffic naturally scales up as we add new infrastructure.

This talk will focus on how we implemented that model on our Linux fleet. We will dig into the details of our configuration including the policies we use, and the encryption and authentication mechanisms in place. We will talk about how this model performs on our systems and the impact it has on the production workload. Finally, we will discuss how it handles failure, bugs we’ve found along the way, and how we see this model changing as our infrastructure continues to grow.

In the end, I hope to have given everyone a better understanding of how VPNs work, and how through combining the development and operations disciplines we can produce a solution that was previously considered impractical.

Doug Barth

May 28, 2015
Tweet

More Decks by Doug Barth

Other Decks in Technology

Transcript

  1. 5/29/15
    @dougbarth
    IPsec mesh network: perfect for the cloud?
    VELOCITY SANTA CLARA 2015

    View Slide

  2. 5/29/15
    MAKING PAGERDUTY MORE RELIABLE USING PXC

    View Slide

  3. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  4. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    Traditional setup
    PagerDuty’s setup
    Rollout
    Tales from production
    Should you do this?

    View Slide

  5. 5/29/15
    MAKING PAGERDUTY MORE RELIABLE USING PXC
    Traditional setup

    View Slide

  6. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    PUBLIC PRIVATE
    DM
    Z

    View Slide

  7. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    TLS landmines
    Finagle’s TLS + IPsec IPsec only
    TLS penalty in MySQL

    View Slide

  8. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  9. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    VPN VPN
    TUNNEL

    View Slide

  10. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    VPC NO VPC

    View Slide

  11. 5/29/15
    MAKING PAGERDUTY MORE RELIABLE USING PXC
    PagerDuty’s setup

    View Slide

  12. 5/29/15
    Hosting setup
    Cloud providers only
    Multi-datacenter deployment
    Several different technologies
    Percona XtraDB Cluster
    Zookeeper
    Cassandra
    Ruby/Rails
    Scala/Finagle
    nginx
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    US-WEST-2
    US-WEST-
    1
    LINODE

    View Slide

  13. 5/29/15
    PagerDuty’s networking goals
    Encrypted by default
    Failures at the instance level (not the DC level)
    Throughput scales as our infrastructure grows
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  14. 5/29/15
    Mesh network
    Every box handles its own encryption
    Policy enforcement distributed
    CM makes this manageable
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  15. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    Ubuntu 10.04 & 14.04
    ipsec-tools
    racoon

    View Slide

  16. 5/29/15
    # This configures how Phase 1 key exchange occurs. We keep the remote end
    # anonymous so we don't have to bounce racoon (and therefore lose our SAs) when
    # new boxes are added.
    remote anonymous {

    Phase 1
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  17. 5/29/15
    exchange_mode main;
    Phase 1
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    POLICY
    DHE
    AUTH
    INITIATOR RESPONDER

    View Slide

  18. 5/29/15
    proposal {
    authentication_method pre_shared_key;
    dh_group modp3072;
    encryption_algorithm aes;
    hash_algorithm sha256;
    }
    Phase 1
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  19. 5/29/15
    Phase 1
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    # Per connection PSK
    # box01
    10.0.0.1 0xdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef
    50.0.0.1 0xdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef
    # box02
    10.0.0.2 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
    50.0.0.2 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef

    View Slide

  20. 5/29/15
    # EC2 doesn’t route ESP
    nat_traversal force;
    Phase 1
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    UDP
    ESP

    View Slide

  21. 5/29/15
    lifetime time 24 hours;
    Phase 1
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  22. 5/29/15
    dpd_delay 20;
    Phase 1
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    INITIATOR RESPONDER
    R-U-THERE
    R-U-THERE-ACK

    View Slide

  23. 5/29/15
    # This configures the SA parameters. Again, anonymous so we don't need to
    # bounce racoon when new boxes are added.
    sainfo anonymous {
    pfs_group modp3072;
    encryption_algorithm aes;
    authentication_algorithm hmac_sha256;
    compression_algorithm deflate;
    lifetime time 8 hours;
    }
    Phase 2
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  24. 5/29/15
    # WAN
    spdadd 50.0.0.1 10.0.0.2 any -P out ipsec esp/transport//require;
    spdadd 10.0.0.2 50.0.0.1 any -P in ipsec esp/transport//require;
    spdadd 10.0.0.1 50.0.0.2 any -P out ipsec esp/transport//require;
    spdadd 50.0.0.2 10.0.0.1 any -P in ipsec esp/transport//require;
    # LAN
    spdadd 10.0.0.1 10.0.0.2 any -P out ipsec esp/transport//require;
    spdadd 10.0.0.2 10.0.0.1 any -P in ipsec esp/transport//require;
    SPD — WAN & LAN encryption
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  25. 5/29/15
    spdadd 0.0.0.0/0 0.0.0.0/0[22] tcp -P out prio def + 100 none;
    spdadd 0.0.0.0/0 0.0.0.0/0[22] tcp -P in prio def + 100 none;
    spdadd 0.0.0.0/0[22] 0.0.0.0/0 tcp -P out prio def + 100 none;
    spdadd 0.0.0.0/0[22] 0.0.0.0/0 tcp -P in prio def + 100 none;
    SPD — SSH excluded
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  26. 5/29/15
    # Exclude ICMP from IPsec.
    #
    # Having ICMP encrypted makes it difficult for us to investigate networking
    # issues. traceroute between machines doesn't work because traceroute doesn't
    # realize that the TTL expired packet referencing the UDP-encap packet is meant
    # for it. A similar issue exists for mtr.
    spdadd 0.0.0.0/0 0.0.0.0/0 icmp -P out prio def + 100 none;
    spdadd 0.0.0.0/0 0.0.0.0/0 icmp -P in prio def + 100 none;
    SPD — ICMP excluded too
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  27. 5/29/15
    pd-sync-policies
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    Found 2589 existing policies, 1718 existing point-to-point policies
    Loading policies from ["/etc/ipsec-tools.conf", "/etc/ipsec-tools.d/*.conf"]
    Found 1718 policies in the config file
    Found 0 changed policies
    Found 0 new policies
    Found 0 old policies
    setkey returned successfully

    View Slide

  28. 5/29/15
    MAKING PAGERDUTY MORE RELIABLE USING PXC
    Rollout

    View Slide

  29. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    NONE
    NONE
    NONE

    View Slide

  30. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    USE
    NONE
    NONE

    View Slide

  31. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    USE
    USE
    USE

    View Slide

  32. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    REQUIRE
    USE
    USE

    View Slide

  33. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    REQUIRE
    REQUIRE
    REQUIRE

    View Slide

  34. 5/29/15
    $ sudo setkey -DP # SPD entries
    $ sudo racoonctl -l show-sa isakmp # Phase 1 relationships
    $ sudo setkey -D # Phase 2 relationships
    $ sudo ip xfrm state # Phase 2 relationships (newer format)
    Scraping lots of metrics
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  35. 5/29/15
    type=MAC_IPSEC_EVENT msg=audit(1432651251.889:2222847): op=SA-replayed-pkt …
    type=MAC_IPSEC_EVENT msg=audit(1432651251.889:2222848): op=SA-notfound …
    auditd has useful events
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  36. 5/29/15
    $ cat /proc/net/xfrm_stat
    XfrmInError 0
    XfrmInBufferError 0
    XfrmInHdrError 0
    XfrmInNoStates 1238
    XfrmInStateProtoError 8
    XfrmInStateModeError 0
    XfrmInStateSeqError 500
    XfrmInStateExpired 0
    XfrmInStateMismatch 0
    XfrmInStateInvalid 0
    XfrmInTmplMismatch 0
    XfrmInNoPols 0
    CONFIG_XFRM_STATISTICS
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  37. 5/29/15
    MAKING PAGERDUTY MORE RELIABLE USING PXC
    Tales from production

    View Slide

  38. 5/29/15
    Lessons learned
    AES works well, even without hardware acceleration
    Network probing is more difficult
    Path MTU breaks after route timeout. Still investigating.
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  39. 5/29/15
    Ordering issues in DPD
    DPD is on phase 1 relationship
    No liveness check on phase 2 relationships
    SAs can get out of sync
    Requires manually clearing the relationships
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  40. 5/29/15
    Linux 3.7 - 3.13
    xfrm4_gc_thresh
    2.6 used a large dynamic value
    3.7 switched to static value of 1024
    3.13 bumped to 32K after performance issues
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  41. 5/29/15
    Linux 3.9
    Client kernel panics on server disconnect
    http://sourceforge.net/p/ipsec-tools/bugs/86/
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  42. 5/29/15
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?
    “Corruption during AES encryption in Xen v4.1 or v3.4 paravirtual guests running a
    Linux 3.0+ kernel, combined with the lack of TCP checksum validation in IPSec
    Transport mode, which leads to the admission of corrupted TCP data on a ZooKeeper
    node, resulting in an unhandled exception from which ZooKeeper is unable to
    recover. Jeez. Talk about a needle in a haystack…”

    View Slide

  43. 5/29/15
    MAKING PAGERDUTY MORE RELIABLE USING PXC
    Should you do this?

    View Slide

  44. 5/29/15
    Need an agent
    Automatically manage policies
    Handle metrics collection and emission
    Implement phase 2 liveness checks
    GC old relationships
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  45. 5/29/15
    [email protected]
    PAGERDUTY.COM/JOBS
    IPSEC MESH NETWORK: PERFECT FOR THE CLOUD?

    View Slide

  46. 5/29/15
    pagerduty.com
    Thank you!

    View Slide