OASIS : LINE’s Data Analysis Tool Using Apache Spark

OASIS : LINE’s Data Analysis Tool Using Apache Spark

53850955f15249a1a9dc49df6113e400?s=128

LINE Developers

March 14, 2019
Tweet

Transcript

  1. OASIS : LINE’S DATA ANALYSIS TOOL USING APACHE SPARK Keiji

    Yoshida - LINE Corporation
  2. OASIS • A web-based data analysis platform • Enables employees

    to analyze their service’s data
  3. Agenda 1. Motivation 2. Features 3. Use Cases

  4. Agenda 1. Motivation 2. Features 3. Use Cases

  5. DATA PLATFORM LINE Ads Platform LINE Creators Market LINE NEWS

    LINE Pay LINE LIVE LINE MOBILE Hadoop Cluster (Data Lake) LINE Ads Platform LINE Creators Market LINE NEWS LINE Pay LINE LIVE LINE MOBILE ETL Analysis BI / Reporting
  6. DATA OPEN • Makes the Hadoop cluster public within LINE

    • Enables employees to analyze their service’s data as they like • Speeds up their data analysis process and decision making Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market
  7. REQUIREMENTS 1. Security 2. Stability 3. Features

  8. 1. SECURITY • Strict access control • Allows employees to

    access only their service’s data Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market
  9. 1. SECURITY • Kerberos authentication • Apache Ranger for authorization

  10. 2. STABILITY • Isolation of applications • Resource Control Multi-tenant

    Hadoop Cluster App 1 App 4 App 2 App 3 App 5 App 6
  11. 2. STABILITY • Apache Spark on YARN • Utilize Apache

    YARN’s resource control mechanism Multi-tenant Hadoop Cluster
  12. 3. FEATURES Skill Role Required Features SQL Programming Data Science

    X X X Manager Result Sharing O X X Planner Query Result Visualization O O X Engineer ETL O O O Data Scientist Ad Hoc Data Analysis
  13. APACHE ZEPPELIN 0.7.3 : SECURITY • Configurable execution user

  14. APACHE ZEPPELIN 0.7.3 : SECURITY • Launches a Spark application

    with another user account • Cheats Apache Ranger Spark Application : User B Apache Zeppelin HDFS / Apache Ranger User A
  15. APACHE ZEPPELIN 0.7.3 : STABILITY • Runs only on a

    single server • Does not support the “yarn-cluster” mode • Easy to freeze Apache Zeppelin Server Apache Zeppelin Driver Program 1 Driver Program 2 Driver Program 3 Driver Program 4 Driver Program 5
  16. OASIS

  17. Agenda 1. Motivation 2. Features 3. Use Cases

  18. SYSTEM ARCHITECTURE OASIS Spark Interpreter MySQL Redis Hadoop YARN Cluster

    HDFS / Apache Ranger Job Scheduler Frontend / API End Users
  19. NOTEBOOK CREATION

  20. SPARK APPLICATION • Launches per notebook session • Uses notebook’s

    author’s account for accessing HDFS • Supports Spark, Spark SQL, PySpark, and SparkR
  21. SPARK APPLICATION SHARING

  22. NOTEBOOK SHARING • Notebooks can be shared within a “space”

    • “space” : root directory of notebooks for each LINE service • Access rights: “read write”, “read only” Space 1 Read Write
 Users Read Only
 Users Notebooks Space 2 Read Write
 Users Read Only
 Users Notebooks
  23. SCHEDULING • Executes a notebook automatically • Keeps contents of

    a notebook up to date • Runs ETL processing periodically
  24. MULTIPLE SERVERS • Scalable • High available OASIS Spark Interpreter

    MySQL Redis Job Scheduler Frontend / API
  25. Agenda 1. Motivation 2. Features 3. Use Cases

  26. Users 60+ Spaces 3,500+ Notebooks 550+ STATS

  27. HADOOP CLUSTER (DATA LAKE) • 500 DataNodes / NodeManagers •

    HDFS usage : 30PB • 150+ Hive databases • 1,500+ Hive tables
  28. USE CASES 1. Report 2. Interactive dashboard 3. ETL 4.

    Ad hoc analysis
  29. None
  30. None
  31. None
  32. None
  33. FUTURE WORK • Multiple Hadoop clusters • Visualization (Charts :

    Area, Scatter, Pie, …) • OSS • Authentication • Refactoring • Internal review
  34. DATA ENGINEERING MEETUP #1 https://dem.connpass.com/event/120994/

  35. THANK YOU