Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINEのデータプラットフォームが目指すべき未来 / The future of LINE data platform we are aiming for

LINEのデータプラットフォームが目指すべき未来 / The future of LINE data platform we are aiming for

奥田輔(LINE株式会社 Data Engineering1チーム マネージャー)

LINE採用情報
https://linecorp.com/ja/career/

LINE株式会社のData Platform Open Position
https://linecorp.com/ja/career/ja/all?text=data%20platform

LINE新卒採用2022
https://linecorp.com/ja/career/newgrads/

53850955f15249a1a9dc49df6113e400?s=128

LINE Developers
PRO

May 19, 2021
Tweet

Transcript

  1. The future challenges of LINE Data Platform Data Platform dept.,

    LINE Corporation 1
  2. 奥田輔 Tasuku OKUDA Engineering Manager, Data Engineering1 team, Data Platform

    dept., Data Engineering Center LINE New grad - 新卒入社 - in 2013 LINE Game DBA (MySQL, MongoDB) → ETL engineer for LINE app → Ingestion Pipeline developer for server log → Hadoop migration project leader
  3. Agenda 1. Mission 2. Data Platform in LINE i. Architecture

    ii. KPI for Scale 3 3. Challenges 4. Public Activity 5. Conclusion
  4. Mission 4

  5. Mission - LINE wide 5

  6. CLOSING THE DISTANCE https://linecorp.com/ja/company/mission 6

  7. LINE STYLE 7

  8. Always Data-driven LINE STYLE 04 感覚ではなく、データ=事実を信じる 8

  9. Mission - Data Platform 9

  10. Make Data-driven easy 10

  11. Make Data-driven easy ! 11

  12. Governed, Integrated, Self-service data platform 12

  13. ~2020 13

  14. 2021~ ! 14

  15. Data Democracy 15

  16. As ML infrastructure 16

  17. Data Platform in LINE 17

  18. 18

  19. 19

  20. Architecture 20

  21. Tool/API Compute Storage Data Governance HDFS HBase Elasticsearch Kafka YARN

    Kubernetes Hive Spark Trino Flink Ranger Yanagishima OASIS LINE Analytics Portal Tableau Jupyter RStudio Datahub Central Dogma Kibana Grafana Prometheus 21
  22. Kafka Flink HDFS Elasticsearch External System Kubernetes Data Collecting 22

  23. Data Analyzing HDFS YARN / Kubernetes Hive Spark Trino Yanagishima

    OASIS LINE Analytics Tableau Jupyter RStudio Datahub 23
  24. KPI for Scale 24

  25. 270 PB 25 HDFS Capacity

  26. 410 TB/day 26 HDFS Daily Increase

  27. 5,436 servers 27 Managing servers (PM/VM)

  28. 56,000 tables 28 Hive tables

  29. 300,000 jobs/day 29 YARN/Presto jobs

  30. 13,000,000 records/sec 30 Pipeline incoming records

  31. 75 Engineers 31 In Data Platform (JP/KR)

  32. Challenges 32

  33. Data Democracy 33

  34. Data Observability Data Democracy 34

  35. Data Discovery What data do we have? What kind of

    data? How much cost? Who is the data owner? Universal Catalog Hive Kafka HBase MySQL MongoDB ObjStorage Deltalake Iceberg Hudi Streaming Snapshot CDC Core ML DS Service Client External Storage Computing Users Daily Monthly Budget 35
  36. Capacity planning Archival Storage IDC design Network planning Resource optimization

    Kubernetes ObjStorage Erasure Coding 36
  37. As ML infrastructure 37

  38. Data Reactivity As ML infrastructure 38

  39. Online Storage Offline Storage E2E pipeline latency HDFS TiDB HBase

    Elasticsearch CockroachDB Kafka Flink 39
  40. Data mutation/versioning Iceberg Deltalake Hudi Schema Evolution ACID Time Travel

    Partition Evolution 40
  41. Public Activity 41

  42. Public activity – LINE DEVDAY https://linedevday.linecorp.com/ • Access analysis of

    Data Platform users https://linedevday.linecorp.com/2020/en/sessions/0974 • 100+PB scale Unified Hadoop cluster Federation with 2k+ nodes https://linedevday.linecorp.com/jp/2019/sessions/D1-5 42
  43. Public activity – LINE Engineering Blog https://engineering.linecorp.com/blog/ • Introduce Data

    Platform Department https://engineering.linecorp.com/ja/blog/introduce-data-platform-department/ • LINE全社のデータ基盤のミドルウェアやData ingestion pipelineの開発・ 運用を担当しているチームを紹介します https://engineering.linecorp.com/ja/blog/data-infrastructure-ingestion-pipeline/ • Introducing Frey: LINE’s new self-service batch ingestion system https://engineering.linecorp.com/en/blog/introducing-frey-lines-new-self-service-batch-ingestion-system/ • ダウンタイムなしでHadoopクラスタを移行した時の話 https://engineering.linecorp.com/ja/blog/migrating-a-hadoop-cluster-without-downtime/ 43
  44. Public activity – OSS contribution • HDFS • Hive •

    Spark • Ranger • Flink • Trino (formally PrestoSQL) 44
  45. Conclusion 45

  46. CLOSING THE DISTANCE Data Reactivity Data Democracy Data Observability Always

    Data-driven As ML infrastructure LINE CODE 04 46
  47. One more thing… 47

  48. We are hiring! 48

  49. LINE採用情報 https://linecorp.com/ja/career/ 49

  50. LINE新卒採用2022 https://linecorp.com/ja/career/new grads/ 50

  51. Data Platform Open Position https://linecorp.com/ja/career/ja/all?text=data%20platform • Data Platform Engineer •

    Software Engineer • Site Reliability Engineer • Distributed System Administrator • Elasticsearch Engineer • ソリューションエンジニア • プロダクトマネージャー 51
  52. Thank you! 52