Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Pragmatic Guide to Modernizing Your Big Data Stack

A Pragmatic Guide to Modernizing Your Big Data Stack

From Codemash 2016:
The big data ecosystem changes at a rapid pace. It can seem like almost every week there’s a new framework or technology that will solve all your data problems. If you’re starting from scratch, this provides a lot of flexibility in choose tools that will meet your needs — but what if you already have a mature big data stack in production? How do you keep up to date and introduce new tools while not breaking what you already have? This talk will cover Etsy’s journey towards a modern big data stack. We will discuss the choices we made, the reasoning behind those choices, and pitfalls to watch out for. You will leave this talk with new ideas about how to evaluate and select new technologies for your own big data stack.


Andrew Johnson

January 08, 2016


  1. A Pragmatic Guide to Modernizing your Big Data Stack Andrew

    Johnson Etsy @theajsquared
  2. Who am I? • Data Platform Engineer at Etsy •

    Manage data infrastructure - Hadoop, Vertica, Kafka, etc. • All in-house in our own DC - no cloud • Scalding is primary interface for Hadoop data
  3. Big Data

  4. If you started today...

  5. Spoilt for Choice • Lots of technology to choose from

    • Core technology is maturing • Training, books, conferences ... • Plenty of vendors selling solutions and services
  6. What if you already are invested in big data?

  7. The Ghost of Big Data Past • Basically had just

    Hadoop + MapReduce • Cascading, Crunch, Hive, Pig, etc. just DSLs for MR • WordCount was king • Everything was rough around the edges • Lots of custom work to integrate everything together • This is where Etsy was!
  8. Big Data at Etsy • 2010 acquisition of Adtuitive •

    cascading.jruby on EMR • In-house Hadoop cluster in 2012 • Avi Bryant (briefly) at Etsy, leaves us with some Scalding • Lots of users, directly and indirectly, of the data stack
  9. How do you modernize without compromising quality?

  10. Why Modernize? • Big Data ecosystem is fast-moving • Unable

    to adopt best practices • Forced to spend time + money building custom versions of existing technology
  11. Assumptions • Critical production processes • Extended downtime unacceptable •

    Large scale • Platform/infrastructure team providing service to users • Building own stack instead of paying a vendor
  12. Three Aspects of Modernization 1. Removing old technology 2. Upgrading

    systems you want to keep 3. Adding new technology
  13. Removing Old Technology

  14. cascading.jruby • JRuby DSL for Cascading • Utility code still

    had to be Java/Scala • Required ancient versions of everything • Unmaintained and unused outside of Etsy • Everyone hated it
  15. Enter Scalding • Scala DSL for Cascading • Popular outside

    of Etsy and maintained by Twitter • Reuse our existing utility code
  16. None
  17. None
  18. None
  19. Have Clear Messaging • Explain why it has to be

    removed • What can’t we do today that removing this will let us do? • Focus on the user’s experience, not the infrastructure • Easier to get users on board with helping with the removal
  20. Removing Things is Hard • Unimportant jobs go first •

    Long tail of critical processes • Users value their products continuing to work more than removing old tech
  21. Budget Time for the Long Term • In early 2014

    we said we’d be done by the end of that summer • Not a full-time job, but always things to be doing
  22. It’s Never Really Gone • Decisions made to support removed

    tech will shape the codebase for a long time • Tackle with other technical debt
  23. Upgrading

  24. Upgrades at Etsy • Big data stack unchanged since late

    2012/early 2013 • Cascading 2.2 • Scalding 0.8.5 • Scala 2.9 • Hadoop 1 • Java 6
  25. Why Upgrade? • Unable to add new things • Online

    documentation no longer available • Few or no bugfixes or improvements
  26. None
  27. Upgrades at Etsy • Upgraded everything over a few months

    • Cascading 2.2 -> 2.6 • Scalding 0.8.5 -> 0.12.0 • Scala 2.9 > 2.10 • Hadoop 1 -> Hadoop 2 (YARN) • Java 6 -> 7
  28. Know Why You are Upgrading • What can’t we do

    today that upgrading this will let us do? • Unblocking another upgrade is totally valid! • Important to have a clear value proposition • Helps with concerns if things do break
  29. Upgrade Dependencies are Complicated • Required versions may force a

    particular upgrade order • Having old technology in the mix makes it more complicated
  30. Computers are Terrible, Everything Will Break • The longer between

    upgrades the more things depend on the accidents of current circumstances • Prepare and test a rollback plan • Invest in automated test suites for upgrades • Always learn from failure incidents • https://codeascraft.com/2012/05/22/blameless-postmortems/
  31. Remember Your ABUs: Always Be Upgrading • The more often

    you upgrade, the better the process becomes • More frequent upgrades are less disruptive • Keep the value proposition in mind, but don’t get on a multi-year “upgrade everything at once” cycle
  32. Adding New Technology

  33. New Tech at Etsy • This was the goal of

    all the previous modernization work, but is an ongoing process • Kafka • Parquet instead of Cascading Tuples • SQL on Hadoop • Streaming analytics • Spark • Maybe more...
  34. Choosing Where to Add New Tech • What problem does

    it solve? • How does it fit into the existing stack? • What adds the most value?
  35. What Not to Do • “My friend on Twitter said

    Impala was pretty cool” • “I saw a blog post that said Spark will solve all our problems” • “Let’s just install Presto in production so we can try it out on real data”
  36. What Not to Do • Blogs, Twitter, etc. can start

    the conversation about something new • Use case and scale can be very different • May not discuss issues or false starts
  37. Every Choice is for the Long Term • Installing something

    in production is a commitment • Once something is available, people will build critical functionality on top of it • Back to the long process of removing technology if it wasn’t a good choice
  38. Knowing is Half the Battle • Talk to others who

    have it in 
 production • Try to break it • Have a standard method for 
 evaluations • Check pain points from current stack or other new tools
  39. Build a Proof of Concept • Choose a real problem

    • If possible, compare two different approaches • This affects how you state problems! • Make clear the gap between “successful PoC” and “in production” • Throwing it away is okay
  40. The Road to Production • Hold an architecture review and

    operability review • Ramp up slowly • Infrastructure teams should be heavily involved at the start and focus on reducing the need for their involvement with each project • Have a rollback/rollforward plan • Lots of new bugs and issues will only crop up at scale in production
  41. How Much is Too Much? • Limited capacity for understanding

    new things • Lots of simultaneous changes make it hard to track down errors • Ensure stability before moving on to adding something else
  42. The Big Picture

  43. Don’t Forget Social Aspects • The process of modernizing is

    a social problem before it is a technical one • Need buy-in from users for success
  44. Know Thyself • Have clear goals defined ahead of time

    • Focus on what can bring the most value • What does this allow us to do that we couldn’t do before? • Always opportunities for learning
  45. A Process, Not an Event • Changes in the big

    data ecosystem are still coming fast • This could change in the future as the ecosystem matures • Keeping up lets you move quickly • Continue evaluating current state versus new developments
  46. Not Just Big Data • Big data is where I

    went through this all at Etsy • Newer and faster moving than some other areas of technology • Can apply these ideas outside of big data
  47. Questions? @theajsquared andrewjamesjohnson.com