OASIS : LINE’s Data Analysis Tool Using Apache Spark

OASIS : LINE’S DATA ANALYSIS TOOL USING APACHE SPARK Keiji
Yoshida - LINE Corporation

OASIS • A web-based data analysis platform • Enables employees
to analyze their service’s data

Agenda 1. Motivation 2. Features 3. Use Cases

DATA PLATFORM LINE Ads Platform LINE Creators Market LINE NEWS
LINE Pay LINE LIVE LINE MOBILE Hadoop Cluster (Data Lake) LINE Ads Platform LINE Creators Market LINE NEWS LINE Pay LINE LIVE LINE MOBILE ETL Analysis BI / Reporting

DATA OPEN • Makes the Hadoop cluster public within LINE
• Enables employees to analyze their service’s data as they like • Speeds up their data analysis process and decision making Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market

REQUIREMENTS 1. Security 2. Stability 3. Features

1. SECURITY • Strict access control • Allows employees to
access only their service’s data Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market

1. SECURITY • Kerberos authentication • Apache Ranger for authorization

2. STABILITY • Isolation of applications • Resource Control Multi-tenant
Hadoop Cluster App 1 App 4 App 2 App 3 App 5 App 6

2. STABILITY • Apache Spark on YARN • Utilize Apache
YARN’s resource control mechanism Multi-tenant Hadoop Cluster

3. FEATURES Skill Role Required Features SQL Programming Data Science
X X X Manager Result Sharing O X X Planner Query Result Visualization O O X Engineer ETL O O O Data Scientist Ad Hoc Data Analysis

APACHE ZEPPELIN 0.7.3 : SECURITY • Configurable execution user

APACHE ZEPPELIN 0.7.3 : SECURITY • Launches a Spark application
with another user account • Cheats Apache Ranger Spark Application : User B Apache Zeppelin HDFS / Apache Ranger User A

APACHE ZEPPELIN 0.7.3 : STABILITY • Runs only on a
single server • Does not support the “yarn-cluster” mode • Easy to freeze Apache Zeppelin Server Apache Zeppelin Driver Program 1 Driver Program 2 Driver Program 3 Driver Program 4 Driver Program 5

SYSTEM ARCHITECTURE OASIS Spark Interpreter MySQL Redis Hadoop YARN Cluster
HDFS / Apache Ranger Job Scheduler Frontend / API End Users

NOTEBOOK CREATION

SPARK APPLICATION • Launches per notebook session • Uses notebook’s
author’s account for accessing HDFS • Supports Spark, Spark SQL, PySpark, and SparkR

SPARK APPLICATION SHARING

NOTEBOOK SHARING • Notebooks can be shared within a “space”
• “space” : root directory of notebooks for each LINE service • Access rights: “read write”, “read only” Space 1 Read Write  Users Read Only  Users Notebooks Space 2 Read Write  Users Read Only  Users Notebooks

SCHEDULING • Executes a notebook automatically • Keeps contents of
a notebook up to date • Runs ETL processing periodically

MULTIPLE SERVERS • Scalable • High available OASIS Spark Interpreter
MySQL Redis Job Scheduler Frontend / API

Users 60+ Spaces 3,500+ Notebooks 550+ STATS

HADOOP CLUSTER (DATA LAKE) • 500 DataNodes / NodeManagers •
HDFS usage : 30PB • 150+ Hive databases • 1,500+ Hive tables

USE CASES 1. Report 2. Interactive dashboard 3. ETL 4.
Ad hoc analysis

FUTURE WORK • Multiple Hadoop clusters • Visualization (Charts :
Area, Scatter, Pie, …) • OSS • Authentication • Refactoring • Internal review

DATA ENGINEERING MEETUP #1 https://dem.connpass.com/event/120994/

THANK YOU

OASIS : LINE’s Data Analysis Tool Using Apache ...

OASIS : LINE’s Data Analysis Tool Using Apache Spark

LINE Developers

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript