Slide 1

Slide 1 text

A Pragmatic Guide to Modernizing your Big Data Stack Andrew Johnson Etsy @theajsquared

Slide 2

Slide 2 text

Who am I? • Data Platform Engineer at Etsy • Manage data infrastructure - Hadoop, Vertica, Kafka, etc. • All in-house in our own DC - no cloud • Scalding is primary interface for Hadoop data

Slide 3

Slide 3 text

Big Data

Slide 4

Slide 4 text

If you started today...

Slide 5

Slide 5 text

Spoilt for Choice • Lots of technology to choose from • Core technology is maturing • Training, books, conferences ... • Plenty of vendors selling solutions and services

Slide 6

Slide 6 text

What if you already are invested in big data?

Slide 7

Slide 7 text

The Ghost of Big Data Past • Basically had just Hadoop + MapReduce • Cascading, Crunch, Hive, Pig, etc. just DSLs for MR • WordCount was king • Everything was rough around the edges • Lots of custom work to integrate everything together • This is where Etsy was!

Slide 8

Slide 8 text

Big Data at Etsy • 2010 acquisition of Adtuitive • cascading.jruby on EMR • In-house Hadoop cluster in 2012 • Avi Bryant (briefly) at Etsy, leaves us with some Scalding • Lots of users, directly and indirectly, of the data stack

Slide 9

Slide 9 text

How do you modernize without compromising quality?

Slide 10

Slide 10 text

Why Modernize? • Big Data ecosystem is fast-moving • Unable to adopt best practices • Forced to spend time + money building custom versions of existing technology

Slide 11

Slide 11 text

Assumptions • Critical production processes • Extended downtime unacceptable • Large scale • Platform/infrastructure team providing service to users • Building own stack instead of paying a vendor

Slide 12

Slide 12 text

Three Aspects of Modernization 1. Removing old technology 2. Upgrading systems you want to keep 3. Adding new technology

Slide 13

Slide 13 text

Removing Old Technology

Slide 14

Slide 14 text

cascading.jruby • JRuby DSL for Cascading • Utility code still had to be Java/Scala • Required ancient versions of everything • Unmaintained and unused outside of Etsy • Everyone hated it

Slide 15

Slide 15 text

Enter Scalding • Scala DSL for Cascading • Popular outside of Etsy and maintained by Twitter • Reuse our existing utility code

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Have Clear Messaging • Explain why it has to be removed • What can’t we do today that removing this will let us do? • Focus on the user’s experience, not the infrastructure • Easier to get users on board with helping with the removal

Slide 20

Slide 20 text

Removing Things is Hard • Unimportant jobs go first • Long tail of critical processes • Users value their products continuing to work more than removing old tech

Slide 21

Slide 21 text

Budget Time for the Long Term • In early 2014 we said we’d be done by the end of that summer • Not a full-time job, but always things to be doing

Slide 22

Slide 22 text

It’s Never Really Gone • Decisions made to support removed tech will shape the codebase for a long time • Tackle with other technical debt

Slide 23

Slide 23 text

Upgrading

Slide 24

Slide 24 text

Upgrades at Etsy • Big data stack unchanged since late 2012/early 2013 • Cascading 2.2 • Scalding 0.8.5 • Scala 2.9 • Hadoop 1 • Java 6

Slide 25

Slide 25 text

Why Upgrade? • Unable to add new things • Online documentation no longer available • Few or no bugfixes or improvements

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Upgrades at Etsy • Upgraded everything over a few months • Cascading 2.2 -> 2.6 • Scalding 0.8.5 -> 0.12.0 • Scala 2.9 > 2.10 • Hadoop 1 -> Hadoop 2 (YARN) • Java 6 -> 7

Slide 28

Slide 28 text

Know Why You are Upgrading • What can’t we do today that upgrading this will let us do? • Unblocking another upgrade is totally valid! • Important to have a clear value proposition • Helps with concerns if things do break

Slide 29

Slide 29 text

Upgrade Dependencies are Complicated • Required versions may force a particular upgrade order • Having old technology in the mix makes it more complicated

Slide 30

Slide 30 text

Computers are Terrible, Everything Will Break • The longer between upgrades the more things depend on the accidents of current circumstances • Prepare and test a rollback plan • Invest in automated test suites for upgrades • Always learn from failure incidents • https://codeascraft.com/2012/05/22/blameless-postmortems/

Slide 31

Slide 31 text

Remember Your ABUs: Always Be Upgrading • The more often you upgrade, the better the process becomes • More frequent upgrades are less disruptive • Keep the value proposition in mind, but don’t get on a multi-year “upgrade everything at once” cycle

Slide 32

Slide 32 text

Adding New Technology

Slide 33

Slide 33 text

New Tech at Etsy • This was the goal of all the previous modernization work, but is an ongoing process • Kafka • Parquet instead of Cascading Tuples • SQL on Hadoop • Streaming analytics • Spark • Maybe more...

Slide 34

Slide 34 text

Choosing Where to Add New Tech • What problem does it solve? • How does it fit into the existing stack? • What adds the most value?

Slide 35

Slide 35 text

What Not to Do • “My friend on Twitter said Impala was pretty cool” • “I saw a blog post that said Spark will solve all our problems” • “Let’s just install Presto in production so we can try it out on real data”

Slide 36

Slide 36 text

What Not to Do • Blogs, Twitter, etc. can start the conversation about something new • Use case and scale can be very different • May not discuss issues or false starts

Slide 37

Slide 37 text

Every Choice is for the Long Term • Installing something in production is a commitment • Once something is available, people will build critical functionality on top of it • Back to the long process of removing technology if it wasn’t a good choice

Slide 38

Slide 38 text

Knowing is Half the Battle • Talk to others who have it in 
 production • Try to break it • Have a standard method for 
 evaluations • Check pain points from current stack or other new tools

Slide 39

Slide 39 text

Build a Proof of Concept • Choose a real problem • If possible, compare two different approaches • This affects how you state problems! • Make clear the gap between “successful PoC” and “in production” • Throwing it away is okay

Slide 40

Slide 40 text

The Road to Production • Hold an architecture review and operability review • Ramp up slowly • Infrastructure teams should be heavily involved at the start and focus on reducing the need for their involvement with each project • Have a rollback/rollforward plan • Lots of new bugs and issues will only crop up at scale in production

Slide 41

Slide 41 text

How Much is Too Much? • Limited capacity for understanding new things • Lots of simultaneous changes make it hard to track down errors • Ensure stability before moving on to adding something else

Slide 42

Slide 42 text

The Big Picture

Slide 43

Slide 43 text

Don’t Forget Social Aspects • The process of modernizing is a social problem before it is a technical one • Need buy-in from users for success

Slide 44

Slide 44 text

Know Thyself • Have clear goals defined ahead of time • Focus on what can bring the most value • What does this allow us to do that we couldn’t do before? • Always opportunities for learning

Slide 45

Slide 45 text

A Process, Not an Event • Changes in the big data ecosystem are still coming fast • This could change in the future as the ecosystem matures • Keeping up lets you move quickly • Continue evaluating current state versus new developments

Slide 46

Slide 46 text

Not Just Big Data • Big data is where I went through this all at Etsy • Newer and faster moving than some other areas of technology • Can apply these ideas outside of big data

Slide 47

Slide 47 text

Questions? @theajsquared andrewjamesjohnson.com