Elastic{ON} 2018 - Lyft's Wild Ride from Amazon ES to Self-Managed Elasticsearch

From Amazon ES to Self- Managed Elasticsearch Lyft’s Wild Ride

In Three Acts 2 Pre-ES ES on AWS Post-AWS Logging-as-a-Service
Splunk Cloud $$$ Drop-in Replacement Funky Pipeline Log Everything! “ES is Broken Again” ???

About Me • In tech for 7 years • At
Lyft Observability since Aug 2016 • Loggly, 2013-2014 3 • Pilot • Bartender • (not at the same time)

About Lyft In 2017: • 375.5M rides given (up from
162.5M in 2016) • >2,000 drop-offs/sec Halloween 2017 • >2M rides given NYE 2017 • ~2,100 employees (up from ~1,100); >700 engineers • 200+ microservices • 10,000+ EC2 instances = lots of logs 4

Logs and Logs and Logs • Services • Envoy proxy:
errors, outliers • Security: SSH auth, sysdig, syslog, osquery, SAML • Deployments • Data platform • Client errors • nginx access/errors • Logs about logs 5

Logging at Lyft: The Before Times • Splunk Cloud ‒
Pro: · Powerful query language · No predefined schema ‒ Con: · ~14 days retention · High load ⇒ ingest backs up (logs up to 30 minutes late) · $$$ • Splunk contract up for renewal Oct 2016 • Let’s use Elasticsearch, that’s what the cool kids are doing 6

Elasticsearch, Great! Flashback: Loggly, 2013 • Elasticsearch 0.96 ‒ (actually
0.2) • Learned the basics of keeping a cluster alive ‒ Cluster state! ‒ Mappings! ‒ Routing! ‒ Hot/warm! ‒ Index management! • Forgot most of it just in time to do the same thing all over again at Lyft 7

Three Years Later • Elasticsearch 2.3 • Mostly the same
experience ‒ Stable APIs are great (except when they’re not) • Still time-based/manually time-sharded indices • Still Logstash*/Kibana ‒ And their wartsquirks 8

E(L)K 9 Lyft has an interesting logging pipeline • Heka
tails logs and emits to Firehose • S3 CreateObject triggers ingest • Ingest unpacks objects, parses events, bulk indexes • Custom retry logic (DLQ) • Bulk retry • _id is hash of event (idempotent ingest)

ES++ Elasticsearch 2.3 was great, and we wanted to jump
to Elasticsearch 5, but • Amazon was dragging their feet on upgrades ‒ They got better towards the end • Amazon makes parts of the recommended index lifecycle difficult ‒ Shrink in particular • Not Amazon’s fault: some parts of the lifecycle are counterproductive ‒ Shrinking turns out to be bad for query performance • Definitely Amazon’s fault: EBS ‒ Newer instance types are EBS-only, and EBS performance/reliability is sub-optimal for Elasticsearch at scale ‒ Instance storage is limited and bound to instance type 10

So, About Amazon • Everything was fine for 4 months
‒ Ingest timeouts? Retention shrinking? Kibana slow? Scale up! • 100k epm → 1.5M epm ‒ Amazon’s biggest cluster • Then we hit Amazon’s cluster node limit ‒ 20 nodes at the time, eventually 40 • Then 11

Everything Is Broken and We Don’t Know Why Elasticsearch started
getting the hiccups • Cluster’s red, we’re not sure why* ‒ * more on this later • Can infer through CloudWatch that one node is sick ‒ High CPU, JVM memory pressure (GC death spiral) • Not unusual, relatively simple to fix: ‒ Just restart Elasticsearch ‒ If that doesn’t work: ‒ Add a replacement node ‒ Disable routing to sick node ‒ Wait for shards to evacuate ‒ Decommission sick node But on AWS... 12

• 1. Open a support ticket ‒ Wait (sometimes for
hours) (during business hours) ‒ First-line support: “I see that your cluster is red” ‒ “Please give us the output of these API endpoints …” • 2. Escalate to ES team engineers ‒ “We see that one of your nodes needs to be shot” ‒ “We see JVM memory pressure is high, please try to reduce it” ‒ “Can you maybe stop logging so much?” ‒ Wait some more • 3. Expedite, option 1: call the TAM ‒ Eventually started going directly through TAM to engineers, who knew the routine • 4. Expedite, option 2: roll the cluster ‒ Trivial change to IAM role ⇒ get an entirely new cluster (blue/green deploy) ‒ Would often get stuck “between” deploys, old nodes sticking around ‒ Still requires manual intervention by AWS support You have opened a new Support case 13

Apologia Pro Vita Sua AWS Elasticsearch What AWS Elasticsearch is:
• Push-button solution • Great for many use cases What it isn’t: a fully functional Elasticsearch cluster • The whole thing is behind a gateway ‒ Round-robin load balancer ‒ 60s timeout (on everything) • Most APIs are obfuscated • Configuration change ⇒ whole new cluster 14

The Decision • Considered Elastic Cloud ‒ Price was an
issue • We had enough experience in house • Small team, but really good infrastructure • ~2 weeks to fully transition 15

After the Jump 16 • Cluster composition ‒ Hot? Warm?
Cold? Ingest? Tribe? ‒ How many instances? ‒ i2? r3? c4? ‒ How many nodes per instance? • Index lifecycle management ‒ Rollover ‒ Alias management ‒ Bootstrap? Move? Shrink? • Find the land mines character-building opportunities

When Your Customer is the Company 17 Logging for a
big enough company starts to look a lot like Logging-as-a-Service (but you can yell at your customers) Who’s logging? • All engineers ‒ Owned services ‒ Upstream services • Security ‒ Enriched audit logs • Data teams Some logs are more important than others • Info vs. warn/error/critical • 200 vs. 500

When Your Customer is the Company 18 QoS is critical
• Ingest rate limiting ‒ Prioritized • Query rate/complexity limiting ‒ Kibana doesn’t really make this possible by itself ‒ Reverse proxies do • Mapping limits ‒ Field cardinality • Failure isolation ‒ Multiple index series, multiple clusters

GIGO 19 • Many different log formats • doc_types are
a bit of a pitfall • Same index, multiple types • Namespacing is a must • Mapping conflicts cause missing logs ‒ Mitigated (mostly) by namespaces • Perfect world: ‒ Stable event IDs ‒ One doc_type ‒ Better-behaved logs • “Log everything” ≠ “log anything”

Logs to Love and Loathe 20 Good: structured events (JSON)
{“ts”: “2017-06-14T19:19:59.628Z”, …} Okay: key-value ts=2017-06-14T19:19:59.628Z uuid=97027b76-7001-4be8- b49a-894807ecc174 app=locations name=locations.map_matching.v1b5 lvlname=INFO [...] Bad: some unparseable mess • Unescaped embedded data structures • Multi-line exceptions • Complicated regex

Kibana offers “opportunities for adventure” • “Refresh field list” ‒
Would reliably kill a large enough cluster ‒ Hacked periodic manual updates as a workaround • “View surrounding documents” ‒ Also used to murder the cluster (by blasting a search to every single index) • Lots of mappings? ‒ Refreshing mappings in Kibana console can break in several ways It Builds Character 21

• /cluster/_stats ‒ We had a bug that was hammering
this endpoint ‒ The overhead acted as a load multiplier and reliably brought us down • Allocation settings ‒ “enable”: “none” (the “page me at 3am” button) • Routing settings ‒ Easy to mess these up and end up with eternally unassigned shards and a red cluster It Builds Character 22

Garbage Collector • CMS is a disaster ‒ Daily GC
spirals • Use G1GC ‒ Seriously, turn it on it right now ‒ Lots of FUD online about data corruption ‒ No more GC spirals (at all) (ever) It Builds Character 23

fstrim • Enabled by default on NVMe instances (i3+) •
Cluster died at 11:45pm sharp every Saturday • Mystified us for weeks • Looked at random instance metrics • “Hmm, why is it stuck in iowait for 2 hours?” It Builds Character 24

AWS ES is good for what it’s good at •
Engineering and support are improving Elasticsearch is great, but • Never intended to be a TSDB • Need to add your own tools Know what you’re getting into • Know your scale • Know your data • No Wrong Way to get logs into ES ‒ (but some are better than others) In Conclusion 25

Thanks! [email protected] 26

Elastic{ON} 2018 - Lyft's Wild Ride from Amazon...

Elastic{ON} 2018 - Lyft's Wild Ride from Amazon ES to Self-Managed Elasticsearch

Elastic Co

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript

From Amazon ES to Self- Managed Elasticsearch Lyft’s Wild Ride

In Three Acts 2 Pre-ES ES on AWS Post-AWS Logging-as-a-Service

About Me • In tech for 7 years • At

About Lyft In 2017: • 375.5M rides given (up from

Logs and Logs and Logs • Services • Envoy proxy:

Logging at Lyft: The Before Times • Splunk Cloud ‒

Elasticsearch, Great! Flashback: Loggly, 2013 • Elasticsearch 0.96 ‒ (actually

Three Years Later • Elasticsearch 2.3 • Mostly the same

E(L)K 9 Lyft has an interesting logging pipeline • Heka

ES++ Elasticsearch 2.3 was great, and we wanted to jump

So, About Amazon • Everything was fine for 4 months

Everything Is Broken and We Don’t Know Why Elasticsearch started

• 1. Open a support ticket ‒ Wait (sometimes for

Apologia Pro Vita Sua AWS Elasticsearch What AWS Elasticsearch is:

The Decision • Considered Elastic Cloud ‒ Price was an

After the Jump 16 • Cluster composition ‒ Hot? Warm?

When Your Customer is the Company 17 Logging for a

When Your Customer is the Company 18 QoS is critical

GIGO 19 • Many different log formats • doc_types are

Logs to Love and Loathe 20 Good: structured events (JSON)

Kibana offers “opportunities for adventure” • “Refresh field list” ‒

• /cluster/_stats ‒ We had a bug that was hammering

Garbage Collector • CMS is a disaster ‒ Daily GC

fstrim • Enabled by default on NVMe instances (i3+) •

AWS ES is good for what it’s good at •

Thanks! [email protected] 26