Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OASIS : LINE’s Data Analysis Tool Using Apache Spark

OASIS : LINE’s Data Analysis Tool Using Apache Spark

LINE Developers

March 14, 2019
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. OASIS : LINE’S DATA ANALYSIS TOOL
    USING APACHE SPARK
    Keiji Yoshida - LINE Corporation

    View full-size slide

  2. OASIS
    • A web-based data analysis platform
    • Enables employees to analyze their service’s data

    View full-size slide

  3. Agenda 1. Motivation
    2. Features
    3. Use Cases

    View full-size slide

  4. Agenda 1. Motivation
    2. Features
    3. Use Cases

    View full-size slide

  5. DATA PLATFORM
    LINE Ads Platform
    LINE Creators Market
    LINE NEWS
    LINE Pay
    LINE LIVE
    LINE MOBILE
    Hadoop Cluster (Data Lake)
    LINE Ads
    Platform
    LINE Creators
    Market
    LINE NEWS LINE Pay
    LINE LIVE LINE MOBILE
    ETL
    Analysis
    BI / Reporting

    View full-size slide

  6. DATA OPEN
    • Makes the Hadoop cluster public within LINE
    • Enables employees to analyze their service’s data as they like
    • Speeds up their data analysis process and decision making
    Multi-tenant Hadoop Cluster
    LINE Ads
    Platform
    LINE Creators
    Market
    LINE Ads Platform LINE Creators Market

    View full-size slide

  7. REQUIREMENTS
    1. Security
    2. Stability
    3. Features

    View full-size slide

  8. 1. SECURITY
    • Strict access control
    • Allows employees to access only their service’s data
    Multi-tenant Hadoop Cluster
    LINE Ads
    Platform
    LINE Creators
    Market
    LINE Ads Platform LINE Creators Market

    View full-size slide

  9. 1. SECURITY
    • Kerberos authentication
    • Apache Ranger for authorization

    View full-size slide

  10. 2. STABILITY
    • Isolation of applications
    • Resource Control
    Multi-tenant Hadoop Cluster
    App 1
    App 4
    App 2 App 3
    App 5 App 6

    View full-size slide

  11. 2. STABILITY
    • Apache Spark on YARN
    • Utilize Apache YARN’s resource control mechanism
    Multi-tenant Hadoop Cluster

    View full-size slide

  12. 3. FEATURES
    Skill
    Role
    Required
    Features
    SQL Programming Data Science
    X X X Manager Result Sharing
    O X X Planner Query Result
    Visualization
    O O X Engineer ETL
    O O O Data Scientist Ad Hoc Data
    Analysis

    View full-size slide

  13. APACHE ZEPPELIN 0.7.3 : SECURITY
    • Configurable execution user

    View full-size slide

  14. APACHE ZEPPELIN 0.7.3 : SECURITY
    • Launches a Spark application with another user account
    • Cheats Apache Ranger
    Spark Application : User B
    Apache Zeppelin
    HDFS / Apache Ranger
    User A

    View full-size slide

  15. APACHE ZEPPELIN 0.7.3 : STABILITY
    • Runs only on a single server
    • Does not support the “yarn-cluster” mode
    • Easy to freeze
    Apache Zeppelin Server
    Apache Zeppelin Driver Program 1 Driver Program 2
    Driver Program 3 Driver Program 4 Driver Program 5

    View full-size slide

  16. Agenda 1. Motivation
    2. Features
    3. Use Cases

    View full-size slide

  17. SYSTEM ARCHITECTURE
    OASIS
    Spark
    Interpreter
    MySQL Redis
    Hadoop
    YARN
    Cluster
    HDFS /
    Apache
    Ranger
    Job
    Scheduler
    Frontend /
    API
    End Users

    View full-size slide

  18. NOTEBOOK CREATION

    View full-size slide

  19. SPARK APPLICATION
    • Launches per notebook session
    • Uses notebook’s author’s account for accessing HDFS
    • Supports Spark, Spark SQL, PySpark, and SparkR

    View full-size slide

  20. SPARK APPLICATION SHARING

    View full-size slide

  21. NOTEBOOK SHARING
    • Notebooks can be shared within a “space”
    • “space” : root directory of notebooks for each LINE service
    • Access rights: “read write”, “read only”
    Space 1
    Read Write

    Users
    Read Only

    Users
    Notebooks
    Space 2
    Read Write

    Users
    Read Only

    Users
    Notebooks

    View full-size slide

  22. SCHEDULING
    • Executes a notebook automatically
    • Keeps contents of a notebook up to date
    • Runs ETL processing periodically

    View full-size slide

  23. MULTIPLE SERVERS
    • Scalable
    • High available
    OASIS
    Spark
    Interpreter
    MySQL Redis
    Job
    Scheduler
    Frontend /
    API

    View full-size slide

  24. Agenda 1. Motivation
    2. Features
    3. Use Cases

    View full-size slide

  25. Users
    60+
    Spaces
    3,500+
    Notebooks
    550+
    STATS

    View full-size slide

  26. HADOOP CLUSTER (DATA LAKE)
    • 500 DataNodes / NodeManagers
    • HDFS usage : 30PB
    • 150+ Hive databases
    • 1,500+ Hive tables

    View full-size slide

  27. USE CASES
    1. Report
    2. Interactive dashboard
    3. ETL
    4. Ad hoc analysis

    View full-size slide

  28. FUTURE WORK
    • Multiple Hadoop clusters
    • Visualization (Charts : Area, Scatter, Pie, …)
    • OSS
    • Authentication
    • Refactoring
    • Internal review

    View full-size slide

  29. DATA ENGINEERING MEETUP #1
    https://dem.connpass.com/event/120994/

    View full-size slide