OASIS : LINE’s Data Analysis Tool Using Apache Spark

Slide 1

Slide 1 text

OASIS : LINE’S DATA ANALYSIS TOOL USING APACHE SPARK Keiji Yoshida - LINE Corporation

Slide 2

Slide 2 text

OASIS • A web-based data analysis platform • Enables employees to analyze their service’s data

Slide 3

Slide 3 text

Agenda 1. Motivation 2. Features 3. Use Cases

Slide 4

Slide 4 text

Agenda 1. Motivation 2. Features 3. Use Cases

Slide 5

Slide 5 text

DATA PLATFORM LINE Ads Platform LINE Creators Market LINE NEWS LINE Pay LINE LIVE LINE MOBILE Hadoop Cluster (Data Lake) LINE Ads Platform LINE Creators Market LINE NEWS LINE Pay LINE LIVE LINE MOBILE ETL Analysis BI / Reporting

Slide 6

Slide 6 text

DATA OPEN • Makes the Hadoop cluster public within LINE • Enables employees to analyze their service’s data as they like • Speeds up their data analysis process and decision making Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market

Slide 7

Slide 7 text

REQUIREMENTS 1. Security 2. Stability 3. Features

Slide 8

Slide 8 text

1. SECURITY • Strict access control • Allows employees to access only their service’s data Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market

Slide 9

Slide 9 text

1. SECURITY • Kerberos authentication • Apache Ranger for authorization

Slide 10

Slide 10 text

2. STABILITY • Isolation of applications • Resource Control Multi-tenant Hadoop Cluster App 1 App 4 App 2 App 3 App 5 App 6

Slide 11

Slide 11 text

2. STABILITY • Apache Spark on YARN • Utilize Apache YARN’s resource control mechanism Multi-tenant Hadoop Cluster

Slide 12

Slide 12 text

3. FEATURES Skill Role Required Features SQL Programming Data Science X X X Manager Result Sharing O X X Planner Query Result Visualization O O X Engineer ETL O O O Data Scientist Ad Hoc Data Analysis

Slide 13

Slide 13 text

APACHE ZEPPELIN 0.7.3 : SECURITY • Configurable execution user

Slide 14

Slide 14 text

APACHE ZEPPELIN 0.7.3 : SECURITY • Launches a Spark application with another user account • Cheats Apache Ranger Spark Application : User B Apache Zeppelin HDFS / Apache Ranger User A

Slide 15

Slide 15 text

APACHE ZEPPELIN 0.7.3 : STABILITY • Runs only on a single server • Does not support the “yarn-cluster” mode • Easy to freeze Apache Zeppelin Server Apache Zeppelin Driver Program 1 Driver Program 2 Driver Program 3 Driver Program 4 Driver Program 5

Slide 16

Slide 16 text

OASIS

Slide 17

Slide 17 text

Agenda 1. Motivation 2. Features 3. Use Cases

Slide 18

Slide 18 text

SYSTEM ARCHITECTURE OASIS Spark Interpreter MySQL Redis Hadoop YARN Cluster HDFS / Apache Ranger Job Scheduler Frontend / API End Users

Slide 19

Slide 19 text

NOTEBOOK CREATION

Slide 20

Slide 20 text

SPARK APPLICATION • Launches per notebook session • Uses notebook’s author’s account for accessing HDFS • Supports Spark, Spark SQL, PySpark, and SparkR

Slide 21

Slide 21 text

SPARK APPLICATION SHARING

Slide 22

Slide 22 text

NOTEBOOK SHARING • Notebooks can be shared within a “space” • “space” : root directory of notebooks for each LINE service • Access rights: “read write”, “read only” Space 1 Read Write  Users Read Only  Users Notebooks Space 2 Read Write  Users Read Only  Users Notebooks

Slide 23

Slide 23 text

SCHEDULING • Executes a notebook automatically • Keeps contents of a notebook up to date • Runs ETL processing periodically

Slide 24

Slide 24 text

MULTIPLE SERVERS • Scalable • High available OASIS Spark Interpreter MySQL Redis Job Scheduler Frontend / API

Slide 25

Slide 25 text

Agenda 1. Motivation 2. Features 3. Use Cases

Slide 26

Slide 26 text

Users 60+ Spaces 3,500+ Notebooks 550+ STATS

Slide 27

Slide 27 text

HADOOP CLUSTER (DATA LAKE) • 500 DataNodes / NodeManagers • HDFS usage : 30PB • 150+ Hive databases • 1,500+ Hive tables

Slide 28

Slide 28 text

USE CASES 1. Report 2. Interactive dashboard 3. ETL 4. Ad hoc analysis

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

FUTURE WORK • Multiple Hadoop clusters • Visualization (Charts : Area, Scatter, Pie, …) • OSS • Authentication • Refactoring • Internal review

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text