OASIS : LINE’S DATA ANALYSIS TOOL
USING APACHE SPARK
Keiji Yoshida - LINE Corporation
Slide 2
Slide 2 text
OASIS
• A web-based data analysis platform
• Enables employees to analyze their service’s data
Slide 3
Slide 3 text
Agenda 1. Motivation
2. Features
3. Use Cases
Slide 4
Slide 4 text
Agenda 1. Motivation
2. Features
3. Use Cases
Slide 5
Slide 5 text
DATA PLATFORM
LINE Ads Platform
LINE Creators Market
LINE NEWS
LINE Pay
LINE LIVE
LINE MOBILE
Hadoop Cluster (Data Lake)
LINE Ads
Platform
LINE Creators
Market
LINE NEWS LINE Pay
LINE LIVE LINE MOBILE
ETL
Analysis
BI / Reporting
Slide 6
Slide 6 text
DATA OPEN
• Makes the Hadoop cluster public within LINE
• Enables employees to analyze their service’s data as they like
• Speeds up their data analysis process and decision making
Multi-tenant Hadoop Cluster
LINE Ads
Platform
LINE Creators
Market
LINE Ads Platform LINE Creators Market
Slide 7
Slide 7 text
REQUIREMENTS
1. Security
2. Stability
3. Features
Slide 8
Slide 8 text
1. SECURITY
• Strict access control
• Allows employees to access only their service’s data
Multi-tenant Hadoop Cluster
LINE Ads
Platform
LINE Creators
Market
LINE Ads Platform LINE Creators Market
Slide 9
Slide 9 text
1. SECURITY
• Kerberos authentication
• Apache Ranger for authorization
2. STABILITY
• Apache Spark on YARN
• Utilize Apache YARN’s resource control mechanism
Multi-tenant Hadoop Cluster
Slide 12
Slide 12 text
3. FEATURES
Skill
Role
Required
Features
SQL Programming Data Science
X X X Manager Result Sharing
O X X Planner Query Result
Visualization
O O X Engineer ETL
O O O Data Scientist Ad Hoc Data
Analysis
Slide 13
Slide 13 text
APACHE ZEPPELIN 0.7.3 : SECURITY
• Configurable execution user
Slide 14
Slide 14 text
APACHE ZEPPELIN 0.7.3 : SECURITY
• Launches a Spark application with another user account
• Cheats Apache Ranger
Spark Application : User B
Apache Zeppelin
HDFS / Apache Ranger
User A
Slide 15
Slide 15 text
APACHE ZEPPELIN 0.7.3 : STABILITY
• Runs only on a single server
• Does not support the “yarn-cluster” mode
• Easy to freeze
Apache Zeppelin Server
Apache Zeppelin Driver Program 1 Driver Program 2
Driver Program 3 Driver Program 4 Driver Program 5
Slide 16
Slide 16 text
OASIS
Slide 17
Slide 17 text
Agenda 1. Motivation
2. Features
3. Use Cases
Slide 18
Slide 18 text
SYSTEM ARCHITECTURE
OASIS
Spark
Interpreter
MySQL Redis
Hadoop
YARN
Cluster
HDFS /
Apache
Ranger
Job
Scheduler
Frontend /
API
End Users
Slide 19
Slide 19 text
NOTEBOOK CREATION
Slide 20
Slide 20 text
SPARK APPLICATION
• Launches per notebook session
• Uses notebook’s author’s account for accessing HDFS
• Supports Spark, Spark SQL, PySpark, and SparkR
Slide 21
Slide 21 text
SPARK APPLICATION SHARING
Slide 22
Slide 22 text
NOTEBOOK SHARING
• Notebooks can be shared within a “space”
• “space” : root directory of notebooks for each LINE service
• Access rights: “read write”, “read only”
Space 1
Read Write
Users
Read Only
Users
Notebooks
Space 2
Read Write
Users
Read Only
Users
Notebooks
Slide 23
Slide 23 text
SCHEDULING
• Executes a notebook automatically
• Keeps contents of a notebook up to date
• Runs ETL processing periodically
Slide 24
Slide 24 text
MULTIPLE SERVERS
• Scalable
• High available
OASIS
Spark
Interpreter
MySQL Redis
Job
Scheduler
Frontend /
API