Case studies of Spark Extension development by data scientists.

Case studies of Spark Extension development by data scientists.

Takahiro Yoshinaga
LINE Data Science Team 4 Data Scientist
https://linedevday.linecorp.com/jp/2019/sessions/S1-17

Be4518b119b8eb017625e0ead20f8fe7?s=128

LINE DevDay 2019

November 20, 2019
Tweet

Transcript

  1. 2019 DevDay Case Studies of Spark Extension Development by Data

    Scientists. > Takahiro Yoshinaga > LINE Data Science Team 4 Data Scientist
  2. Self Introduction Subtitle 30pt / Arial / Normal > 2018-02

    Joined LINE corporation as a Data Scientist >Responsible for data analysis and development of service for corporations. > Takahiro Yoshinaga, Ph.D. (Science)
  3. Agenda > OASIS and its extension > Entire System >

    Case Studies
  4. OASIS Subtitle 30pt / Arial / Normal Spark Application in

    LINE corporation - Web application in a notebook format - Create / Execute query - Visualize and Share easily - Have access to R, Python (SparkR, Pyspark) - Extension - Create UDF to make data analysis convenient - Add stand-alone JAR in spark-submit options
  5. Extension Is Convenient, but It’s Hassle… Subtitle 30pt / Arial

    / Normal Upload JAR file manually Create Build Environment (Scala) in local/prod Manage JAR file (Versioning) manually
  6. Our Team Use CI / CD Subtitle 30pt / Arial

    / Normal Upload JAR file automatically Create Build Environment (Scala) in Docker Manage JAR file (Versioning) in Github Drone.io
  7. Entire System Subtitle 30pt / Arial / Normal

  8. As Is / To Be Subtitle 30pt / Arial /

    Normal - Write Scala code - Build on local machine - sbt test - Upload JAR to HDFS - Check it on OASIS - Review & Merge on Github - Re-build and versioning - Upload to OASIS - Write Scala code - <deleted> - <deleted> - <deleted> - git push & check it on OASIS - Review & Merge on Github - <deleted> - <deleted>
  9. Case Studies Subtitle 30pt / Arial / Normal Register Metrics

    and Automate Group by - Before : Ad hoc aggregation by requirements - After : Auto calculation by frequently used metrics Mapping - Before : long long case when … - After : only one UDF
  10. Summary Subtitle 30pt / Arial / Normal Out team utilize

    CI / CD in our development. LINE has an environment that data scientist can develop in a modern way. We realize high performance in data analysis thanks to our development.
  11. Thank You