Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Case studies of Spark Extension development by data scientists.

Case studies of Spark Extension development by data scientists.

Takahiro Yoshinaga
LINE Data Science Team 4 Data Scientist
https://linedevday.linecorp.com/jp/2019/sessions/S1-17

LINE DevDay 2019

November 20, 2019
Tweet

More Decks by LINE DevDay 2019

Other Decks in Technology

Transcript

  1. 2019 DevDay Case Studies of Spark Extension Development by Data

    Scientists. > Takahiro Yoshinaga > LINE Data Science Team 4 Data Scientist
  2. Self Introduction Subtitle 30pt / Arial / Normal > 2018-02

    Joined LINE corporation as a Data Scientist >Responsible for data analysis and development of service for corporations. > Takahiro Yoshinaga, Ph.D. (Science)
  3. OASIS Subtitle 30pt / Arial / Normal Spark Application in

    LINE corporation - Web application in a notebook format - Create / Execute query - Visualize and Share easily - Have access to R, Python (SparkR, Pyspark) - Extension - Create UDF to make data analysis convenient - Add stand-alone JAR in spark-submit options
  4. Extension Is Convenient, but It’s Hassle… Subtitle 30pt / Arial

    / Normal Upload JAR file manually Create Build Environment (Scala) in local/prod Manage JAR file (Versioning) manually
  5. Our Team Use CI / CD Subtitle 30pt / Arial

    / Normal Upload JAR file automatically Create Build Environment (Scala) in Docker Manage JAR file (Versioning) in Github Drone.io
  6. As Is / To Be Subtitle 30pt / Arial /

    Normal - Write Scala code - Build on local machine - sbt test - Upload JAR to HDFS - Check it on OASIS - Review & Merge on Github - Re-build and versioning - Upload to OASIS - Write Scala code - <deleted> - <deleted> - <deleted> - git push & check it on OASIS - Review & Merge on Github - <deleted> - <deleted>
  7. Case Studies Subtitle 30pt / Arial / Normal Register Metrics

    and Automate Group by - Before : Ad hoc aggregation by requirements - After : Auto calculation by frequently used metrics Mapping - Before : long long case when … - After : only one UDF
  8. Summary Subtitle 30pt / Arial / Normal Out team utilize

    CI / CD in our development. LINE has an environment that data scientist can develop in a modern way. We realize high performance in data analysis thanks to our development.