Slide 1

Slide 1 text

The future challenges of LINE Data Platform Data Platform dept., LINE Corporation 1

Slide 2

Slide 2 text

奥田輔 Tasuku OKUDA Engineering Manager, Data Engineering1 team, Data Platform dept., Data Engineering Center LINE New grad - 新卒入社 - in 2013 LINE Game DBA (MySQL, MongoDB) → ETL engineer for LINE app → Ingestion Pipeline developer for server log → Hadoop migration project leader

Slide 3

Slide 3 text

Agenda 1. Mission 2. Data Platform in LINE i. Architecture ii. KPI for Scale 3 3. Challenges 4. Public Activity 5. Conclusion

Slide 4

Slide 4 text

Mission 4

Slide 5

Slide 5 text

Mission - LINE wide 5

Slide 6

Slide 6 text

CLOSING THE DISTANCE https://linecorp.com/ja/company/mission 6

Slide 7

Slide 7 text

LINE STYLE 7

Slide 8

Slide 8 text

Always Data-driven LINE STYLE 04 感覚ではなく、データ=事実を信じる 8

Slide 9

Slide 9 text

Mission - Data Platform 9

Slide 10

Slide 10 text

Make Data-driven easy 10

Slide 11

Slide 11 text

Make Data-driven easy ! 11

Slide 12

Slide 12 text

Governed, Integrated, Self-service data platform 12

Slide 13

Slide 13 text

~2020 13

Slide 14

Slide 14 text

2021~ ! 14

Slide 15

Slide 15 text

Data Democracy 15

Slide 16

Slide 16 text

As ML infrastructure 16

Slide 17

Slide 17 text

Data Platform in LINE 17

Slide 18

Slide 18 text

18

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

Architecture 20

Slide 21

Slide 21 text

Tool/API Compute Storage Data Governance HDFS HBase Elasticsearch Kafka YARN Kubernetes Hive Spark Trino Flink Ranger Yanagishima OASIS LINE Analytics Portal Tableau Jupyter RStudio Datahub Central Dogma Kibana Grafana Prometheus 21

Slide 22

Slide 22 text

Kafka Flink HDFS Elasticsearch External System Kubernetes Data Collecting 22

Slide 23

Slide 23 text

Data Analyzing HDFS YARN / Kubernetes Hive Spark Trino Yanagishima OASIS LINE Analytics Tableau Jupyter RStudio Datahub 23

Slide 24

Slide 24 text

KPI for Scale 24

Slide 25

Slide 25 text

270 PB 25 HDFS Capacity

Slide 26

Slide 26 text

410 TB/day 26 HDFS Daily Increase

Slide 27

Slide 27 text

5,436 servers 27 Managing servers (PM/VM)

Slide 28

Slide 28 text

56,000 tables 28 Hive tables

Slide 29

Slide 29 text

300,000 jobs/day 29 YARN/Presto jobs

Slide 30

Slide 30 text

13,000,000 records/sec 30 Pipeline incoming records

Slide 31

Slide 31 text

75 Engineers 31 In Data Platform (JP/KR)

Slide 32

Slide 32 text

Challenges 32

Slide 33

Slide 33 text

Data Democracy 33

Slide 34

Slide 34 text

Data Observability Data Democracy 34

Slide 35

Slide 35 text

Data Discovery What data do we have? What kind of data? How much cost? Who is the data owner? Universal Catalog Hive Kafka HBase MySQL MongoDB ObjStorage Deltalake Iceberg Hudi Streaming Snapshot CDC Core ML DS Service Client External Storage Computing Users Daily Monthly Budget 35

Slide 36

Slide 36 text

Capacity planning Archival Storage IDC design Network planning Resource optimization Kubernetes ObjStorage Erasure Coding 36

Slide 37

Slide 37 text

As ML infrastructure 37

Slide 38

Slide 38 text

Data Reactivity As ML infrastructure 38

Slide 39

Slide 39 text

Online Storage Offline Storage E2E pipeline latency HDFS TiDB HBase Elasticsearch CockroachDB Kafka Flink 39

Slide 40

Slide 40 text

Data mutation/versioning Iceberg Deltalake Hudi Schema Evolution ACID Time Travel Partition Evolution 40

Slide 41

Slide 41 text

Public Activity 41

Slide 42

Slide 42 text

Public activity – LINE DEVDAY https://linedevday.linecorp.com/ • Access analysis of Data Platform users https://linedevday.linecorp.com/2020/en/sessions/0974 • 100+PB scale Unified Hadoop cluster Federation with 2k+ nodes https://linedevday.linecorp.com/jp/2019/sessions/D1-5 42

Slide 43

Slide 43 text

Public activity – LINE Engineering Blog https://engineering.linecorp.com/blog/ • Introduce Data Platform Department https://engineering.linecorp.com/ja/blog/introduce-data-platform-department/ • LINE全社のデータ基盤のミドルウェアやData ingestion pipelineの開発・ 運用を担当しているチームを紹介します https://engineering.linecorp.com/ja/blog/data-infrastructure-ingestion-pipeline/ • Introducing Frey: LINE’s new self-service batch ingestion system https://engineering.linecorp.com/en/blog/introducing-frey-lines-new-self-service-batch-ingestion-system/ • ダウンタイムなしでHadoopクラスタを移行した時の話 https://engineering.linecorp.com/ja/blog/migrating-a-hadoop-cluster-without-downtime/ 43

Slide 44

Slide 44 text

Public activity – OSS contribution • HDFS • Hive • Spark • Ranger • Flink • Trino (formally PrestoSQL) 44

Slide 45

Slide 45 text

Conclusion 45

Slide 46

Slide 46 text

CLOSING THE DISTANCE Data Reactivity Data Democracy Data Observability Always Data-driven As ML infrastructure LINE CODE 04 46

Slide 47

Slide 47 text

One more thing… 47

Slide 48

Slide 48 text

We are hiring! 48

Slide 49

Slide 49 text

LINE採用情報 https://linecorp.com/ja/career/ 49

Slide 50

Slide 50 text

LINE新卒採用2022 https://linecorp.com/ja/career/new grads/ 50

Slide 51

Slide 51 text

Data Platform Open Position https://linecorp.com/ja/career/ja/all?text=data%20platform • Data Platform Engineer • Software Engineer • Site Reliability Engineer • Distributed System Administrator • Elasticsearch Engineer • ソリューションエンジニア • プロダクトマネージャー 51

Slide 52

Slide 52 text

Thank you! 52