Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Powering a Startup with Apache SPark

VCNC
October 26, 2017

Powering a Startup with Apache SPark

Apache Summit 2017 발표 자료

VCNC

October 26, 2017
Tweet

More Decks by VCNC

Other Decks in Programming

Transcript

  1. Kevin, Between (VCNC)
    [email protected]
    Powering a Startup with
    Apache Spark
    #EUent8

    View Slide

  2. 4FPVM 4PVUI,PSFB

    View Slide

  3. (BOHOBN )POHEBF
    *UBFXPO .ZVOHEPOH

    View Slide

  4. View Slide

  5. CFUBVTFST
    SFMFBTF .EPXOMPBET
    .EPXOMPBET HMPCBMMBVODIFT
    #FUXFFO .EPXOMPBET
    #FUXFFO
    4UBSUTNPOFUJ[BUJPO .EPXOMPBET
    (MPCBMFYQBOTJPO OFXCVTJOFTT UFBNPG

    View Slide

  6. put your #assignedhashtag here by setting the footer in view-header/footer
    ,FWJO,JN
    • Came from Seoul, South Korea

    • Co-founder, used to be a product
    developer

    • Now a data analyst, engineer, team
    leader

    • Founder of Korea Spark User Group

    • Committer and PMC member of
    Apache Zeppelin
    6

    View Slide

  7. #FUXFFO%BUB5FBN
    7

    View Slide

  8. *OUSPUP#FUXFFO%BUB5FBN
    • Data engineer * 4

    – Manager, engineer with various stack of knowledge and
    experience

    – Junior engineer, used to be a server engineer

    – Senior engineer, has lots of exps and skills

    – Data engineer, used to be a top level Android developer

    • Hiring data analyst and machine learning expert
    8

    View Slide

  9. #FUXFFO%BUB5FBNJTEPJOH
    • Analysis

    – Service monitoring

    – Analysis usage of new features and build product strategies

    • Data Infrastructure

    – Build and manage infrastructure

    – Spark, Zeppelin, AWS, BI Tools, etc

    • Third Party Management

    – Mobile Attribution Tools for marketing (Kochava, Tune, Appsflyer, etc)

    – Google Analytics, Firebase, etc

    – Ad Networks
    9

    View Slide

  10. #FUXFFO%BUB5FBNJTEPJOH
    • Machine Learning Study & Research

    – For the next business model

    • Support team

    – To build business, product, monetization strategies

    • Performance Marketing Analysis

    – Monitoring effectiveness of marketing budgets

    • Product Development

    – Improves client performance, server architecture, etc
    10

    View Slide

  11. 11

    View Slide

  12. 1._
    12
    Sunset @ Between Office

    View Slide

  13. 5FDIOPMPHJFT
    13

    View Slide

  14. 3FRVJSFNFOUT
    • Big Data
    – 2TB/day of log data from millions of DAU

    – 20M of users

    • Small Team
    – Team of 4, need to support 50

    • Tiny Budget
    – Company is just over BEP (Break Even Point)

    • Need very efficient tech stack!
    14

    View Slide

  15. 8BZ8F8PSL
    • Use Apache Spark as a general processing engine

    • Scriptify everything with Apache Zeppelin

    • Heavy utilization of AWS and Spot instances to cut cost

    • Proper selection of BI Dashboard Tools
    15

    View Slide

  16. "QBDIF4QBSL (FOFSBM&OHJOF
    • Definitely the best way to deal with big data (as you all know!)

    • It’s performance, agility exactly meets startup requirements

    – Used Spark from 2014

    • Great match with Cloud Service, especially with Spot instance

    – Utilizing burst nature of Cloud Service
    16

    View Slide

  17. 4DSJQUJGZ&WFSZUIJOHXJUI;FQQFMJO
    • Doing everything on Zeppelin!

    • Daily batch tasks in form of Spark scripts (using
    Zeppelin scheduler)

    • Ad hoc analysis

    • Cluster control scripts

    • The world first user of Zeppelin!

    • More than 200 Zeppelin notebooks
    17

    View Slide

  18. "84$MPVE
    • Spot Instance is my friend!

    – Mostly use spot instance for analysis

    – only 10 ~ 20% of cost compare to on-demand instances

    • Dynamic cluster launch with Auto Scale

    – Launch clusters automatically for batch analysis

    – Manually launch more clusters on Zeppelin, with Auto Scale script

    – Automatically diminish clusters when no usage
    18

    View Slide

  19. #*%BTICPBSE5PPMT
    • Use Zeppelin as a dashboard using Spark SQL with ZEPL

    • Holistics (holistics.io) or Dash (plot.ly/products/dash/)
    19

    View Slide

  20. 2VFTUJPOT$IBMMFOHFT
    20

    View Slide

  21. 3%%"1*PS%BUB'SBNF"1*
    • Now Spark has very different style of APIs

    – Programmatic RDD API

    – SQL-like DataFrame, DataSet API

    • In case of having many, simple ad-hoc queries

    – DataFrame works

    • Having more complex, deep dive analytic questions

    – RDD works

    • For a while, mostly use RDD, DataFrame for ML or simple ad hoc tasks
    21

    View Slide

  22. 4VTIJPS$PPLFE%BUB
    • Keeping data in a raw form as possible!

    – Doing ETL’s usually makes trouble, increasing management cost

    – The Sushi Principle (Joseph & Robert in Strata)

    – Drastically reduce operation & management cost

    – Apache Spark is a great tool for extracting insight from raw data
    22
    fresh data!

    View Slide

  23. 5P)JSF%BUB"OBMZTUPS/PU
    • For data analyst, expected skill set are..

    – Excel, SQL, R, ..

    • Those skills are not expected..

    – Programatic API like Spark RDD

    – Cooking raw data

    • Prefer data engineer with analytic skills

    • May need to add some ETL tasks to work with data analyst
    23

    View Slide

  24. #FUUFS 'BTUFS5FBN4VQQPSU
    • Better - Zeppelin is great for analyzing data, but not enough for sharing data for team

    – We have really few alternatives

    – Increase of using BI dashboard tools?

    – Still finding a good way

    • Faster - Launching a Spark cluster takes few minutes

    – Not bad, but we want it faster

    – Google BigQuery or AWS Athena

    – SQL Database with ETL
    24

    View Slide

  25. 'VUVSF1MBO
    • Prepare for exploding # of data operations!

    – Team is growing, business is growing

    – # of tasks

    – # of 3rd party data products

    – Communication cost

    • Operations with machine learning & deep learning

    – Better way to manage task & data flow
    25

    View Slide

  26. -FUsTXSBQVQ
    26

    View Slide

  27. 8IBU.BUUFSTGPS6T
    • Support Team

    – Each Team should see proper data and make good decision from it

    – Regular meetings, fast response to adhoc data request

    – Ultimately, our every activity should be related to company’s business

    • Technical Lead

    – Technical investments for competence of both company and individual

    – Working in Between should be a best experience for each individuals

    • Social Impact

    – Our activity on work has valuable impact for society?

    – Open source, activity on community
    27

    View Slide

  28. )PX"QBDIF4QBSLJT1PXFSJOHB4UBSUVQ
    • One great tool for general purpose

    – Daily batch tasks

    – Agile, adhoc analysis

    – Drawing dashboard

    – Many more..

    • Helps saving time, reducing cost of data operations

    • Great experience for engineer and analyst

    • Sharing know-how’s to / from community
    28

    View Slide

  29. 8PSLBTBEBUBFOHJOFFSBU4UBSUVQ
    • Fascinating, fast evolution of tech

    • Need hard work and labor

    • Data work will shine only when it is understood and used by teammates
    29
    Two Peasants Digging, Vincent van Gogh
    Two Men Digging, Jean-Francois Millet

    View Slide

  30. 5IBOLZPV
    30

    View Slide