Save 37% off PRO during our Black Friday Sale! »

You Have a Data Lake, Now What?

C8bd11350189854525533cf86f6f92b1?s=47 Alison
September 13, 2017

You Have a Data Lake, Now What?

You have a data lake and you’re worried about drowning in it. This talk will address solutions and process for using what data you’ve collected effectively with your team and the rest of the organization. Practical and hands-on lessons covering the glamorous and not-so-glamorous next steps.

You’ve collected a ton of data and it’s just sitting there. You want to use it but where do you start? This talk will give you map so you can navigate your unique situation by asking and answering questions such as: What kind of data do you have and why does it matter? What things will come back to bite you if you don’t consider them up-front? What does collaboration that rocks look like? What problems will you run into and what strategies are useful for troubleshooting them? How do you choose what to do first? Why is interdisciplinarity important? Once it works, how do you automate it?

C8bd11350189854525533cf86f6f92b1?s=128

Alison

September 13, 2017
Tweet

Transcript

  1. You Have a Data Lake - Now What? Alison Stanton

    Chief Problem Solver, Stanton Ventures @stantonventures
  2. Why listen to me? @stantonventures

  3. To begin, REALITY CHECK @stantonventures

  4. You Don't Have A Lake @stantonventures

  5. You have lakeSSSSSS @stantonventures

  6. What are the things that would make your data lakeS

    valuable? • Monetize it $$$ ◦ Productize it by folding into your product ◦ Sell it to others ◦ Get it to managers and employees • Save Money $$$ ◦ Get it to managers and employees to make decisions on it • Improve people's (quality of) lives ◦ Get it to managers and employees to make decisions on it @stantonventures
  7. What has to happen for it to be valuable? •

    Accessible • Accurate • On-Time • Understandable • Segmentable ◦ Random sample ◦ All the water that has a specific fish ◦ All the water that has a swordfish in it ◦ All the water in the lake @stantonventures
  8. Who Has It • Internal • External • Doesn't exist

    Format • Formatted and (relatively) easy ◦ CSV, SQL, xls, xlsx • NoSQL • PDF • Web (scraping) • Pieces of paper @stantonventures What kind of data do you have and why does it matter?
  9. Where do I start? (Philosophically) @stantonventures

  10. Philosophically • Do ELT not ETL. • You can't keep

    your data in the schema it is currently in. • Start with the easiest, fastest use case • You MUST have written definitions of what terms mean. @stantonventures How do I choose what to do first?
  11. Do ELT not ETL @stantonventures

  12. You can't keep the schema you have @stantonventures

  13. Start with the easiest, fastest use case @stantonventures

  14. WRITTEN definitions of terms @stantonventures

  15. Where do I start? (Practically) @stantonventures

  16. Practically • Survey • Identify easiest, fastest use case •

    EL those data sources SIMULTANEOUSLY working on written definitions • Do the Ts, Build the report, Launch it • Collect the missing data • Do 10-20 more times • Iterate first 10-20 (Schema time!) • ALL the data sources • BI Tool • Work on highest impact reportings • Iterate on existing reporting & stay up to date on new product features/company initiatives @stantonventures
  17. What do you end up with? • Data warehouse ◦

    Incoming schemas ◦ Individual developer schemas ◦ Master Data warehouse schema • ELT infrastructure • BI infrastructure • Reports used by the business that are accurate, timely, and useful for decision making • Documentation ◦ Data Dictionary ▪ What goes in a data dictionary? ◦ ELT environment and script documentation ◦ Conventions @stantonventures
  18. Survey of what you have and where you're at @stantonventures

  19. None
  20. Identify the easiest, fastest use case @stantonventures

  21. Extract & load those data sources while simultaneously working on

    written definitions @stantonventures
  22. Do the transforms, build the report, launch it @stantonventures

  23. Start collecting the data you need but aren't collecting yet

    @stantonventures
  24. Do that cycle 10-20 more times @stantonventures

  25. Iterate on the first 10-20 cases @stantonventures Why is interdisciplinarity

    important?
  26. If you don't have all the data sources in one

    place, consider adding them all @stantonventures
  27. If you don't have a BI tool, consider getting it

    now @stantonventures
  28. Work on highest impact reporting - company-wide reporting including financials

    @stantonventures
  29. Iterate on existing reporting & stay up to date on

    new product features / company initiatives @stantonventures
  30. • Reconciliation • Commissions • Data Architecture of your Data

    Warehouse ◦ Schema design / Object selection ◦ Relationships ◦ Levels ◦ Binary • Tech stack @stantonventures What will come back to bite me?
  31. What problems will you run into? • Data Problems •

    Tech Problems • People Problems @stantonventures
  32. • Iteration • Data test suite • Communication @stantonventures What

    strategies are useful for troubleshooting problems?
  33. @stantonventures How do you automate it?

  34. @stantonventures What does collaboration that rocks look like?

  35. Advanced Lessons • Data layer @stantonventures

  36. Questions? @stantonventures

  37. Thanks! Images from: • http://www.transcendentjourney.com/tag/reality-check/ • https://www.mnn.com/earth-matters/wilderness-resources/blogs/14-of-the-most-striking-crater-lakes-on-earth • http://www.lovethesepics.com/2011/10/kaleidoscope-of-autumn-colors-is-heaven-on-earth-46-pics/ @stantonventures

  38. ◦ Timing ◦ Supporters, Power Users, and Requestors ◦ Infrastructure

    importance @stantonventures