Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} 2018 - Lyft's Wild Ride from Amazon ES to Self-Managed Elasticsearch

Elastic{ON} 2018 - Lyft's Wild Ride from Amazon ES to Self-Managed Elasticsearch

Elastic Co

March 01, 2018
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. In Three Acts 2 Pre-ES ES on AWS Post-AWS Logging-as-a-Service

    Splunk Cloud $$$ Drop-in Replacement Funky Pipeline Log Everything! “ES is Broken Again” ???
  2. About Me • In tech for 7 years • At

    Lyft Observability since Aug 2016 • Loggly, 2013-2014 3 • Pilot • Bartender • (not at the same time)
  3. About Lyft In 2017: • 375.5M rides given (up from

    162.5M in 2016) • >2,000 drop-offs/sec Halloween 2017 • >2M rides given NYE 2017 • ~2,100 employees (up from ~1,100); >700 engineers • 200+ microservices • 10,000+ EC2 instances = lots of logs 4
  4. Logs and Logs and Logs • Services • Envoy proxy:

    errors, outliers • Security: SSH auth, sysdig, syslog, osquery, SAML • Deployments • Data platform • Client errors • nginx access/errors • Logs about logs 5
  5. Logging at Lyft: The Before Times • Splunk Cloud ‒

    Pro: · Powerful query language · No predefined schema ‒ Con: · ~14 days retention · High load ⇒ ingest backs up (logs up to 30 minutes late) · $$$ • Splunk contract up for renewal Oct 2016 • Let’s use Elasticsearch, that’s what the cool kids are doing 6
  6. Elasticsearch, Great! Flashback: Loggly, 2013 • Elasticsearch 0.96 ‒ (actually

    0.2) • Learned the basics of keeping a cluster alive ‒ Cluster state! ‒ Mappings! ‒ Routing! ‒ Hot/warm! ‒ Index management! • Forgot most of it just in time to do the same thing all over again at Lyft 7
  7. Three Years Later • Elasticsearch 2.3 • Mostly the same

    experience ‒ Stable APIs are great (except when they’re not) • Still time-based/manually time-sharded indices • Still Logstash*/Kibana ‒ And their wartsquirks 8
  8. E(L)K 9 Lyft has an interesting logging pipeline • Heka

    tails logs and emits to Firehose • S3 CreateObject triggers ingest • Ingest unpacks objects, parses events, bulk indexes • Custom retry logic (DLQ) • Bulk retry • _id is hash of event (idempotent ingest)
  9. ES++ Elasticsearch 2.3 was great, and we wanted to jump

    to Elasticsearch 5, but • Amazon was dragging their feet on upgrades ‒ They got better towards the end • Amazon makes parts of the recommended index lifecycle difficult ‒ Shrink in particular • Not Amazon’s fault: some parts of the lifecycle are counterproductive ‒ Shrinking turns out to be bad for query performance • Definitely Amazon’s fault: EBS ‒ Newer instance types are EBS-only, and EBS performance/reliability is sub-optimal for Elasticsearch at scale ‒ Instance storage is limited and bound to instance type 10
  10. So, About Amazon • Everything was fine for 4 months

    ‒ Ingest timeouts? Retention shrinking? Kibana slow? Scale up! • 100k epm → 1.5M epm ‒ Amazon’s biggest cluster • Then we hit Amazon’s cluster node limit ‒ 20 nodes at the time, eventually 40 • Then 11
  11. Everything Is Broken and We Don’t Know Why Elasticsearch started

    getting the hiccups • Cluster’s red, we’re not sure why* ‒ * more on this later • Can infer through CloudWatch that one node is sick ‒ High CPU, JVM memory pressure (GC death spiral) • Not unusual, relatively simple to fix: ‒ Just restart Elasticsearch ‒ If that doesn’t work: ‒ Add a replacement node ‒ Disable routing to sick node ‒ Wait for shards to evacuate ‒ Decommission sick node But on AWS... 12
  12. • 1. Open a support ticket ‒ Wait (sometimes for

    hours) (during business hours) ‒ First-line support: “I see that your cluster is red” ‒ “Please give us the output of these API endpoints …” • 2. Escalate to ES team engineers ‒ “We see that one of your nodes needs to be shot” ‒ “We see JVM memory pressure is high, please try to reduce it” ‒ “Can you maybe stop logging so much?” ‒ Wait some more • 3. Expedite, option 1: call the TAM ‒ Eventually started going directly through TAM to engineers, who knew the routine • 4. Expedite, option 2: roll the cluster ‒ Trivial change to IAM role ⇒ get an entirely new cluster (blue/green deploy) ‒ Would often get stuck “between” deploys, old nodes sticking around ‒ Still requires manual intervention by AWS support You have opened a new Support case 13
  13. Apologia Pro Vita Sua AWS Elasticsearch What AWS Elasticsearch is:

    • Push-button solution • Great for many use cases What it isn’t: a fully functional Elasticsearch cluster • The whole thing is behind a gateway ‒ Round-robin load balancer ‒ 60s timeout (on everything) • Most APIs are obfuscated • Configuration change ⇒ whole new cluster 14
  14. The Decision • Considered Elastic Cloud ‒ Price was an

    issue • We had enough experience in house • Small team, but really good infrastructure • ~2 weeks to fully transition 15
  15. After the Jump 16 • Cluster composition ‒ Hot? Warm?

    Cold? Ingest? Tribe? ‒ How many instances? ‒ i2? r3? c4? ‒ How many nodes per instance? • Index lifecycle management ‒ Rollover ‒ Alias management ‒ Bootstrap? Move? Shrink? • Find the land mines character-building opportunities
  16. When Your Customer is the Company 17 Logging for a

    big enough company starts to look a lot like Logging-as-a-Service (but you can yell at your customers) Who’s logging? • All engineers ‒ Owned services ‒ Upstream services • Security ‒ Enriched audit logs • Data teams Some logs are more important than others • Info vs. warn/error/critical • 200 vs. 500
  17. When Your Customer is the Company 18 QoS is critical

    • Ingest rate limiting ‒ Prioritized • Query rate/complexity limiting ‒ Kibana doesn’t really make this possible by itself ‒ Reverse proxies do • Mapping limits ‒ Field cardinality • Failure isolation ‒ Multiple index series, multiple clusters
  18. GIGO 19 • Many different log formats • doc_types are

    a bit of a pitfall • Same index, multiple types • Namespacing is a must • Mapping conflicts cause missing logs ‒ Mitigated (mostly) by namespaces • Perfect world: ‒ Stable event IDs ‒ One doc_type ‒ Better-behaved logs • “Log everything” ≠ “log anything”
  19. Logs to Love and Loathe 20 Good: structured events (JSON)

    {“ts”: “2017-06-14T19:19:59.628Z”, …} Okay: key-value ts=2017-06-14T19:19:59.628Z uuid=97027b76-7001-4be8- b49a-894807ecc174 app=locations name=locations.map_matching.v1b5 lvlname=INFO [...] Bad: some unparseable mess • Unescaped embedded data structures • Multi-line exceptions • Complicated regex
  20. Kibana offers “opportunities for adventure” • “Refresh field list” ‒

    Would reliably kill a large enough cluster ‒ Hacked periodic manual updates as a workaround • “View surrounding documents” ‒ Also used to murder the cluster (by blasting a search to every single index) • Lots of mappings? ‒ Refreshing mappings in Kibana console can break in several ways It Builds Character 21
  21. • /cluster/_stats ‒ We had a bug that was hammering

    this endpoint ‒ The overhead acted as a load multiplier and reliably brought us down • Allocation settings ‒ “enable”: “none” (the “page me at 3am” button) • Routing settings ‒ Easy to mess these up and end up with eternally unassigned shards and a red cluster It Builds Character 22
  22. Garbage Collector • CMS is a disaster ‒ Daily GC

    spirals • Use G1GC ‒ Seriously, turn it on it right now ‒ Lots of FUD online about data corruption ‒ No more GC spirals (at all) (ever) It Builds Character 23
  23. fstrim • Enabled by default on NVMe instances (i3+) •

    Cluster died at 11:45pm sharp every Saturday • Mystified us for weeks • Looked at random instance metrics • “Hmm, why is it stuck in iowait for 2 hours?” It Builds Character 24
  24. AWS ES is good for what it’s good at •

    Engineering and support are improving Elasticsearch is great, but • Never intended to be a TSDB • Need to add your own tools Know what you’re getting into • Know your scale • Know your data • No Wrong Way to get logs into ES ‒ (but some are better than others) In Conclusion 25