Slide 1

Slide 1 text

2019 DevDay Case Studies of Spark Extension Development by Data Scientists. > Takahiro Yoshinaga > LINE Data Science Team 4 Data Scientist

Slide 2

Slide 2 text

Self Introduction Subtitle 30pt / Arial / Normal > 2018-02 Joined LINE corporation as a Data Scientist >Responsible for data analysis and development of service for corporations. > Takahiro Yoshinaga, Ph.D. (Science)

Slide 3

Slide 3 text

Agenda > OASIS and its extension > Entire System > Case Studies

Slide 4

Slide 4 text

OASIS Subtitle 30pt / Arial / Normal Spark Application in LINE corporation - Web application in a notebook format - Create / Execute query - Visualize and Share easily - Have access to R, Python (SparkR, Pyspark) - Extension - Create UDF to make data analysis convenient - Add stand-alone JAR in spark-submit options

Slide 5

Slide 5 text

Extension Is Convenient, but It’s Hassle… Subtitle 30pt / Arial / Normal Upload JAR file manually Create Build Environment (Scala) in local/prod Manage JAR file (Versioning) manually

Slide 6

Slide 6 text

Our Team Use CI / CD Subtitle 30pt / Arial / Normal Upload JAR file automatically Create Build Environment (Scala) in Docker Manage JAR file (Versioning) in Github Drone.io

Slide 7

Slide 7 text

Entire System Subtitle 30pt / Arial / Normal

Slide 8

Slide 8 text

As Is / To Be Subtitle 30pt / Arial / Normal - Write Scala code - Build on local machine - sbt test - Upload JAR to HDFS - Check it on OASIS - Review & Merge on Github - Re-build and versioning - Upload to OASIS - Write Scala code - - - - git push & check it on OASIS - Review & Merge on Github - -

Slide 9

Slide 9 text

Case Studies Subtitle 30pt / Arial / Normal Register Metrics and Automate Group by - Before : Ad hoc aggregation by requirements - After : Auto calculation by frequently used metrics Mapping - Before : long long case when … - After : only one UDF

Slide 10

Slide 10 text

Summary Subtitle 30pt / Arial / Normal Out team utilize CI / CD in our development. LINE has an environment that data scientist can develop in a modern way. We realize high performance in data analysis thanks to our development.

Slide 11

Slide 11 text

Thank You