Etsy on Migrating to Kafka (in three short years)

Migrating to Kafka in Three Short Years A look at
the choices that deﬁned the Etsy analytics stack

Path Dependence

Decisions made in the past limit options in the present,
even if the circumstances under which those past decisions were made are no longer relevant.

In other words, we can’t upgrade the Hadoop cluster until
we port all of the Cascading.jruby jobs to Scalding.

Sneak Preview ! 1. How Etsy built its original analytics
stack 2. Handling changes prepared us to rebuild our data pipeline 3. Kafka!

Starting from scratch

Choice #1 ! Acquire Adtuitive

Before you can work on search, you need real analytics

Choice #2 ! Build a zero-impact analytics stack

Etsy is not a cloud company but the ﬁrst analytics
stack was cloud-based

(illustration here) browser CDN EMR S3 mysql FTP

Legacy effects: ! 24 hour latency on events 48 hour
latency on visits

Choice #3 ! Cascading.jruby

Hadoop Cascading Cascading.jruby

Choice #4 ! Use GA _utma cookie to deﬁne visits

Beneﬁts: ! •Simpler ETL •Visits computed on the client side
•Easy to reconcile against Google Analytics

Choice #5 ! Using existing feature library for A/B tests

Leveraged existing experience with operational ramp-ups

Low impact: just required a logging change

Choice #6 ! Build analytics stack around visit-level metrics

Great for search and ads, less great for measuring engagement

Changing the tires without stopping the car

How do we instrument the iOS app? Summer 2012

1. Native app visits should have the same structure as
Web visits

2. Native app events should use the existing data pipeline

3. The native app should buffer events and send them
when convenient

Solution: ! 1. App uploads bundles of events to API
endpoint 2. Backend event logger curls the beacon for every event

Side effect: ! We have a backend event logger that
is now used all over the place

CDN diversiﬁcation project Fall 2012

Migrated to our own beacon infrastructure

Data pipeline based on Apache, PHP, logrotate, and cron

We built our own Hadoop cluster: Etsydoop Fall 2012

We hired the Scalding guy Fall 2012

Hadoop Cascading Cascading.jruby Scalding

Uh oh, the Google Analytics JS hurts performance Fall 2012

The event logger’s GA dependency precluded async loading, hurting performance

First idea: duplicate the _utma functionality in our own code

The trouble with backend events

Visit Time Logger Event Type 1 12:01 frontend home 1
12:03 backend login 1 12:03 frontend view listing 1 1:31 backend logout 2 1:31 frontend view listing 2 1:32 frontend search 2 1:33 frontend view listing wrong visit

Complete rewrite of our ETL jobs Spring/Summer 2013

Backend page-view events Fall 2013

2014: the next phase

EventPipe goals

Use POST rather than multiple GET requests to prevent data
loss

Use JSON rather than query strings for comprehensibility

Validate beacon data before it enters the data pipeline

Use a binary serialization format for long-term storage

Use Kafka for data transfer to escape the batch paradigm

Eliminate individual beacon servers as points of failure

How do we handle the impedance mismatch between Apache/PHP and
Kafka?

Wrote a server in Go to serialize beacons in Thrift
and send them to Kafka

Use Apache for SSL termination

Still to come

Real-ish time ETL

Streaming infrastructure

Ofﬂine processing for more products

Other Kafka applications

Takeaways

Every choice you make has long-term implications

Fixing stuff creates new opportunities

@rafeco http://rc3.org

Etsy on Migrating to Kafka (in three short years)

Etsy on Migrating to Kafka (in three short years)

More Decks by Hakka Labs

Other Decks in Programming

Featured

Transcript