Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OASIS : LINE’s Data Analysis Tool Using Apache Spark

OASIS : LINE’s Data Analysis Tool Using Apache Spark

LINE Developers

March 14, 2019
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. DATA PLATFORM LINE Ads Platform LINE Creators Market LINE NEWS

    LINE Pay LINE LIVE LINE MOBILE Hadoop Cluster (Data Lake) LINE Ads Platform LINE Creators Market LINE NEWS LINE Pay LINE LIVE LINE MOBILE ETL Analysis BI / Reporting
  2. DATA OPEN • Makes the Hadoop cluster public within LINE

    • Enables employees to analyze their service’s data as they like • Speeds up their data analysis process and decision making Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market
  3. 1. SECURITY • Strict access control • Allows employees to

    access only their service’s data Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market
  4. 2. STABILITY • Isolation of applications • Resource Control Multi-tenant

    Hadoop Cluster App 1 App 4 App 2 App 3 App 5 App 6
  5. 2. STABILITY • Apache Spark on YARN • Utilize Apache

    YARN’s resource control mechanism Multi-tenant Hadoop Cluster
  6. 3. FEATURES Skill Role Required Features SQL Programming Data Science

    X X X Manager Result Sharing O X X Planner Query Result Visualization O O X Engineer ETL O O O Data Scientist Ad Hoc Data Analysis
  7. APACHE ZEPPELIN 0.7.3 : SECURITY • Launches a Spark application

    with another user account • Cheats Apache Ranger Spark Application : User B Apache Zeppelin HDFS / Apache Ranger User A
  8. APACHE ZEPPELIN 0.7.3 : STABILITY • Runs only on a

    single server • Does not support the “yarn-cluster” mode • Easy to freeze Apache Zeppelin Server Apache Zeppelin Driver Program 1 Driver Program 2 Driver Program 3 Driver Program 4 Driver Program 5
  9. SYSTEM ARCHITECTURE OASIS Spark Interpreter MySQL Redis Hadoop YARN Cluster

    HDFS / Apache Ranger Job Scheduler Frontend / API End Users
  10. SPARK APPLICATION • Launches per notebook session • Uses notebook’s

    author’s account for accessing HDFS • Supports Spark, Spark SQL, PySpark, and SparkR
  11. NOTEBOOK SHARING • Notebooks can be shared within a “space”

    • “space” : root directory of notebooks for each LINE service • Access rights: “read write”, “read only” Space 1 Read Write
 Users Read Only
 Users Notebooks Space 2 Read Write
 Users Read Only
 Users Notebooks
  12. SCHEDULING • Executes a notebook automatically • Keeps contents of

    a notebook up to date • Runs ETL processing periodically
  13. HADOOP CLUSTER (DATA LAKE) • 500 DataNodes / NodeManagers •

    HDFS usage : 30PB • 150+ Hive databases • 1,500+ Hive tables
  14. FUTURE WORK • Multiple Hadoop clusters • Visualization (Charts :

    Area, Scatter, Pie, …) • OSS • Authentication • Refactoring • Internal review