Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OASIS : LINE’s Data Analysis Tool Using Apache Spark

OASIS : LINE’s Data Analysis Tool Using Apache Spark

LINE Developers
PRO

March 14, 2019
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. OASIS : LINE’S DATA ANALYSIS TOOL
    USING APACHE SPARK
    Keiji Yoshida - LINE Corporation

    View Slide

  2. OASIS
    • A web-based data analysis platform
    • Enables employees to analyze their service’s data

    View Slide

  3. Agenda 1. Motivation
    2. Features
    3. Use Cases

    View Slide

  4. Agenda 1. Motivation
    2. Features
    3. Use Cases

    View Slide

  5. DATA PLATFORM
    LINE Ads Platform
    LINE Creators Market
    LINE NEWS
    LINE Pay
    LINE LIVE
    LINE MOBILE
    Hadoop Cluster (Data Lake)
    LINE Ads
    Platform
    LINE Creators
    Market
    LINE NEWS LINE Pay
    LINE LIVE LINE MOBILE
    ETL
    Analysis
    BI / Reporting

    View Slide

  6. DATA OPEN
    • Makes the Hadoop cluster public within LINE
    • Enables employees to analyze their service’s data as they like
    • Speeds up their data analysis process and decision making
    Multi-tenant Hadoop Cluster
    LINE Ads
    Platform
    LINE Creators
    Market
    LINE Ads Platform LINE Creators Market

    View Slide

  7. REQUIREMENTS
    1. Security
    2. Stability
    3. Features

    View Slide

  8. 1. SECURITY
    • Strict access control
    • Allows employees to access only their service’s data
    Multi-tenant Hadoop Cluster
    LINE Ads
    Platform
    LINE Creators
    Market
    LINE Ads Platform LINE Creators Market

    View Slide

  9. 1. SECURITY
    • Kerberos authentication
    • Apache Ranger for authorization

    View Slide

  10. 2. STABILITY
    • Isolation of applications
    • Resource Control
    Multi-tenant Hadoop Cluster
    App 1
    App 4
    App 2 App 3
    App 5 App 6

    View Slide

  11. 2. STABILITY
    • Apache Spark on YARN
    • Utilize Apache YARN’s resource control mechanism
    Multi-tenant Hadoop Cluster

    View Slide

  12. 3. FEATURES
    Skill
    Role
    Required
    Features
    SQL Programming Data Science
    X X X Manager Result Sharing
    O X X Planner Query Result
    Visualization
    O O X Engineer ETL
    O O O Data Scientist Ad Hoc Data
    Analysis

    View Slide

  13. APACHE ZEPPELIN 0.7.3 : SECURITY
    • Configurable execution user

    View Slide

  14. APACHE ZEPPELIN 0.7.3 : SECURITY
    • Launches a Spark application with another user account
    • Cheats Apache Ranger
    Spark Application : User B
    Apache Zeppelin
    HDFS / Apache Ranger
    User A

    View Slide

  15. APACHE ZEPPELIN 0.7.3 : STABILITY
    • Runs only on a single server
    • Does not support the “yarn-cluster” mode
    • Easy to freeze
    Apache Zeppelin Server
    Apache Zeppelin Driver Program 1 Driver Program 2
    Driver Program 3 Driver Program 4 Driver Program 5

    View Slide

  16. OASIS

    View Slide

  17. Agenda 1. Motivation
    2. Features
    3. Use Cases

    View Slide

  18. SYSTEM ARCHITECTURE
    OASIS
    Spark
    Interpreter
    MySQL Redis
    Hadoop
    YARN
    Cluster
    HDFS /
    Apache
    Ranger
    Job
    Scheduler
    Frontend /
    API
    End Users

    View Slide

  19. NOTEBOOK CREATION

    View Slide

  20. SPARK APPLICATION
    • Launches per notebook session
    • Uses notebook’s author’s account for accessing HDFS
    • Supports Spark, Spark SQL, PySpark, and SparkR

    View Slide

  21. SPARK APPLICATION SHARING

    View Slide

  22. NOTEBOOK SHARING
    • Notebooks can be shared within a “space”
    • “space” : root directory of notebooks for each LINE service
    • Access rights: “read write”, “read only”
    Space 1
    Read Write

    Users
    Read Only

    Users
    Notebooks
    Space 2
    Read Write

    Users
    Read Only

    Users
    Notebooks

    View Slide

  23. SCHEDULING
    • Executes a notebook automatically
    • Keeps contents of a notebook up to date
    • Runs ETL processing periodically

    View Slide

  24. MULTIPLE SERVERS
    • Scalable
    • High available
    OASIS
    Spark
    Interpreter
    MySQL Redis
    Job
    Scheduler
    Frontend /
    API

    View Slide

  25. Agenda 1. Motivation
    2. Features
    3. Use Cases

    View Slide

  26. Users
    60+
    Spaces
    3,500+
    Notebooks
    550+
    STATS

    View Slide

  27. HADOOP CLUSTER (DATA LAKE)
    • 500 DataNodes / NodeManagers
    • HDFS usage : 30PB
    • 150+ Hive databases
    • 1,500+ Hive tables

    View Slide

  28. USE CASES
    1. Report
    2. Interactive dashboard
    3. ETL
    4. Ad hoc analysis

    View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. FUTURE WORK
    • Multiple Hadoop clusters
    • Visualization (Charts : Area, Scatter, Pie, …)
    • OSS
    • Authentication
    • Refactoring
    • Internal review

    View Slide

  34. DATA ENGINEERING MEETUP #1
    https://dem.connpass.com/event/120994/

    View Slide

  35. THANK YOU

    View Slide