Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Pragmatic Guide to Modernizing Your Big Data Stack

A Pragmatic Guide to Modernizing Your Big Data Stack

From Codemash 2016:
The big data ecosystem changes at a rapid pace. It can seem like almost every week there’s a new framework or technology that will solve all your data problems. If you’re starting from scratch, this provides a lot of flexibility in choose tools that will meet your needs — but what if you already have a mature big data stack in production? How do you keep up to date and introduce new tools while not breaking what you already have? This talk will cover Etsy’s journey towards a modern big data stack. We will discuss the choices we made, the reasoning behind those choices, and pitfalls to watch out for. You will leave this talk with new ideas about how to evaluate and select new technologies for your own big data stack.

Andrew Johnson

January 08, 2016
Tweet

More Decks by Andrew Johnson

Other Decks in Technology

Transcript

  1. Who am I? • Data Platform Engineer at Etsy •

    Manage data infrastructure - Hadoop, Vertica, Kafka, etc. • All in-house in our own DC - no cloud • Scalding is primary interface for Hadoop data
  2. Spoilt for Choice • Lots of technology to choose from

    • Core technology is maturing • Training, books, conferences ... • Plenty of vendors selling solutions and services
  3. The Ghost of Big Data Past • Basically had just

    Hadoop + MapReduce • Cascading, Crunch, Hive, Pig, etc. just DSLs for MR • WordCount was king • Everything was rough around the edges • Lots of custom work to integrate everything together • This is where Etsy was!
  4. Big Data at Etsy • 2010 acquisition of Adtuitive •

    cascading.jruby on EMR • In-house Hadoop cluster in 2012 • Avi Bryant (briefly) at Etsy, leaves us with some Scalding • Lots of users, directly and indirectly, of the data stack
  5. Why Modernize? • Big Data ecosystem is fast-moving • Unable

    to adopt best practices • Forced to spend time + money building custom versions of existing technology
  6. Assumptions • Critical production processes • Extended downtime unacceptable •

    Large scale • Platform/infrastructure team providing service to users • Building own stack instead of paying a vendor
  7. Three Aspects of Modernization 1. Removing old technology 2. Upgrading

    systems you want to keep 3. Adding new technology
  8. cascading.jruby • JRuby DSL for Cascading • Utility code still

    had to be Java/Scala • Required ancient versions of everything • Unmaintained and unused outside of Etsy • Everyone hated it
  9. Enter Scalding • Scala DSL for Cascading • Popular outside

    of Etsy and maintained by Twitter • Reuse our existing utility code
  10. Have Clear Messaging • Explain why it has to be

    removed • What can’t we do today that removing this will let us do? • Focus on the user’s experience, not the infrastructure • Easier to get users on board with helping with the removal
  11. Removing Things is Hard • Unimportant jobs go first •

    Long tail of critical processes • Users value their products continuing to work more than removing old tech
  12. Budget Time for the Long Term • In early 2014

    we said we’d be done by the end of that summer • Not a full-time job, but always things to be doing
  13. It’s Never Really Gone • Decisions made to support removed

    tech will shape the codebase for a long time • Tackle with other technical debt
  14. Upgrades at Etsy • Big data stack unchanged since late

    2012/early 2013 • Cascading 2.2 • Scalding 0.8.5 • Scala 2.9 • Hadoop 1 • Java 6
  15. Why Upgrade? • Unable to add new things • Online

    documentation no longer available • Few or no bugfixes or improvements
  16. Upgrades at Etsy • Upgraded everything over a few months

    • Cascading 2.2 -> 2.6 • Scalding 0.8.5 -> 0.12.0 • Scala 2.9 > 2.10 • Hadoop 1 -> Hadoop 2 (YARN) • Java 6 -> 7
  17. Know Why You are Upgrading • What can’t we do

    today that upgrading this will let us do? • Unblocking another upgrade is totally valid! • Important to have a clear value proposition • Helps with concerns if things do break
  18. Upgrade Dependencies are Complicated • Required versions may force a

    particular upgrade order • Having old technology in the mix makes it more complicated
  19. Computers are Terrible, Everything Will Break • The longer between

    upgrades the more things depend on the accidents of current circumstances • Prepare and test a rollback plan • Invest in automated test suites for upgrades • Always learn from failure incidents • https://codeascraft.com/2012/05/22/blameless-postmortems/
  20. Remember Your ABUs: Always Be Upgrading • The more often

    you upgrade, the better the process becomes • More frequent upgrades are less disruptive • Keep the value proposition in mind, but don’t get on a multi-year “upgrade everything at once” cycle
  21. New Tech at Etsy • This was the goal of

    all the previous modernization work, but is an ongoing process • Kafka • Parquet instead of Cascading Tuples • SQL on Hadoop • Streaming analytics • Spark • Maybe more...
  22. Choosing Where to Add New Tech • What problem does

    it solve? • How does it fit into the existing stack? • What adds the most value?
  23. What Not to Do • “My friend on Twitter said

    Impala was pretty cool” • “I saw a blog post that said Spark will solve all our problems” • “Let’s just install Presto in production so we can try it out on real data”
  24. What Not to Do • Blogs, Twitter, etc. can start

    the conversation about something new • Use case and scale can be very different • May not discuss issues or false starts
  25. Every Choice is for the Long Term • Installing something

    in production is a commitment • Once something is available, people will build critical functionality on top of it • Back to the long process of removing technology if it wasn’t a good choice
  26. Knowing is Half the Battle • Talk to others who

    have it in 
 production • Try to break it • Have a standard method for 
 evaluations • Check pain points from current stack or other new tools
  27. Build a Proof of Concept • Choose a real problem

    • If possible, compare two different approaches • This affects how you state problems! • Make clear the gap between “successful PoC” and “in production” • Throwing it away is okay
  28. The Road to Production • Hold an architecture review and

    operability review • Ramp up slowly • Infrastructure teams should be heavily involved at the start and focus on reducing the need for their involvement with each project • Have a rollback/rollforward plan • Lots of new bugs and issues will only crop up at scale in production
  29. How Much is Too Much? • Limited capacity for understanding

    new things • Lots of simultaneous changes make it hard to track down errors • Ensure stability before moving on to adding something else
  30. Don’t Forget Social Aspects • The process of modernizing is

    a social problem before it is a technical one • Need buy-in from users for success
  31. Know Thyself • Have clear goals defined ahead of time

    • Focus on what can bring the most value • What does this allow us to do that we couldn’t do before? • Always opportunities for learning
  32. A Process, Not an Event • Changes in the big

    data ecosystem are still coming fast • This could change in the future as the ecosystem matures • Keeping up lets you move quickly • Continue evaluating current state versus new developments
  33. Not Just Big Data • Big data is where I

    went through this all at Etsy • Newer and faster moving than some other areas of technology • Can apply these ideas outside of big data