Creating Stream DataPipeline on GCP Using Apache Beam

1 Creating a Streaming Data Pipeline on Google Cloud Platform
using Apache Beam {“id”: “@shoe116”, “team”: “Data Platform”}

2 - Mercari Data Platform Team - Data Engineer at
Mercari since 2018 - Beam, Kafka, Storm, Hive, Hadoop... - Twitter/GitHub @shoe116 Shuichi Suzuki

3 About Mercari Agenda Monolith to Microservices, On-premises to Cloud
Updating Data Pipeline for Microservice Architecture New Stream data Pipeline on GCP 02 03 04 01

4 What is Mercari? C2C marketplace app that allows users
to enjoy buying and selling

5 By the Numbers (JP/Full Year) 232 346.8 GMV¹ 490.2
B  In: billion JPY FY 06/2017 FY 06/2018 21.2 33.4 Net Sales 46.2 B  In: billion JPY FY 06/2017 FY 06/2018 8.45 10.75 13.57 M  In: Million people FY 06/2017 FY 06/2018 Source: Internal documents, from FY2018.6 Presentation Material  1. GMV after cancellation  2. Monthly Active Users in June. Number of registered users that used our app in the month  490.2 FY 06/2019 46.2 FY 06/2019 13.57 FY 06/2019

6 Company Proﬁle February 1st, 2013 Tokyo, Sendai, Fukuoka, Palo
Alto, Portland Approx. 1,800 including subsidiaries Headcount Oﬃces Established Japan’s First Unicorn: listed on Tokyo Stock Exchange’s Mothers Market – a board for high-growth companies – in June 2018

7 Monolith to Microservices, On-premises to Cloud

8 To have ownership all developers Why Microservices on Cloud?
To develop & improve more rapidly To co-work with diversity talents 2. 3. 1. To scale Organization

9 System Architecture: Before Microservices MySQL Mercari API On Premise

10 System Architecture: Introduce API gateway API gateway MySQL Kubernetes
Mercari API On Premise

11 System Architecture: Add a service API gateway Authority Service
A MySQL Kubernetes Mercari API Cloud Spanner On Premise

12 System Architecture: Current API gateway Authority Service A MySQL
Mercari API Service B Service D Service C Cloud Spanner On Premise

13 System Architecture: Future API gateway Authority Service A MySQL
Kubernetes Mercari API Service B Service D Service C Cloud Spanner On Premise Logic migration Datastore migration

14 Updating Data Pipeline for Microservice Architecture

15 Moving data between sources to sinks What is Data
Pipeline? High-throughput, low-latency High availability and scalability 2. 3. 1. Data Pipelines is In particular, I am speaking about stream data pipeline to send logs from production to Data Warehouse, for analytics.

16 Before Microservices A pipeline was very simple because the
source is monolith. Monolith Mercari API

17 After Microservices The data sources of a pipeline become
multiple. We have to adapt our pipeline to new microservice architecture. ??

18 Our Technical Challenges • Handling ever more data in
an even more efficient way ◦ 300K+/sec requests is coming from only an api gateway. ◦ More and more services make match more data traffic. • Processing schema full data with more flexibility. ◦ Each microservices has own schema for express their behaviors ◦ Every schemas evolution is happen independently, because it depend on updating their own business logics. • Don’t make pipelines for every microservices ◦ The number of microservices also fluctuates. ◦ Data Platform should not control their life cycles.

19 New Stream Data Pipeline on Google Cloud Platform

20 Split the log collection and data processing phases Design
guidelines for new stream data pipeline Support structured output with schema evolution Keep high throughput and scalability for multiple inputs 2. 3. 1. Use GCS because our data sources and sinks are work on GCP 4.

21 Scalable Message Queue like Kafka Using GCP Managed Services
Distributed processing Engine like Flink Cloud Dataﬂow Cloud Pub/Sub Cloud Storage Object storage to store unstructured data BigQuery Scalable, managed DWH for analytics

22 Overview of new stream pipeline on GCP

23 3 Types of Message Queue (Cloud Pub/Sub)

24 Each services have own ‘Ramp’ topic for a pipeline.
Services post serialized protobuf messages (byte[]) with Map[String, String] attributes to the topic. 3 Types of Message Queue (Cloud Pub/Sub) DataHub Ramp Aggregated as ‘dead-letter’ messages that can't be processed successfully. Dead-Letter Hub All data are aggregated to DataHub as common avro format records . We have Raw DataHub and Structured DataHub.

25 3 Types of Stores (GCS and BigQuery)

26 All data are written to DataLake (GCS), as Avro
files. We have 2 types DataLakes, raw and structured. 3 Types of Stores (GCS and BigQuery) BigQuery DataLake All dead-letter messages are stored in Dead-Letter Lake (GCS) as Avro files. Dead-Letter Lake All data in Structured Datalake upload to BigQuery. Using avro files is very suitable in this case.

27 ETL processes using Apache Beam and Avro

28 Apache Beam and Spotify Scio • Unified programming model
SDK for batch and stream data processing • Support multiple engines; ◦ Apache Apex, Flink, Spark, Samza... ◦ Google Cloud Dataflow • Currently, only one SDK to write Dataflow ◦ Cloud Dataflow SDK will be decommissioned. Apache Beam • A Scala API for Apache Beam and Cloud Dataflow. • Make handling collections match easier. • Our team members prefer Scala rather than Java. Spotify Scio

29 DataHub Avro Protocol Our pipeline internal message format. Why
Avro? - We can write Avro to GCS easily. - We can load Avro to BQ easily. - We can query Avro ﬁles on GCS as BQ external tables. DataHub Avro contains; - MetaData of the record (UUID, timestamp and etc...) - Output destinations and schema information - Content type (Avro, Protobuf, etc) - Data payload

30 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3",
"fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } metadata

"fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } Destination / Schema Info

"fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } Payload (protobuf or avro)

33 4 kinds of ETL processes

34 1. Aggregating Data from Ramps to RawDataHub

35 Reading records from ramp topics, like Mapper in MapReduce
model. 1. Aggregating Data from Ramps to RawDataHub Making ‘DataHub Avro’ records from each protobuf raw messages. An Avro record’s payload is exactly the same bytes array with an input. Write avro records to Raw DataHub or Dead-LetterHub as Reducer. T L E

36 2. Converting raw protobuf to Avro Generic Records

37 2. Converting raw records to structured Avro records T
L E Writing converted avro records with those schema to Structured DataHub. Messages that can't be converted go to Dead-Letter. Reading avro records from Raw DataHub, and getting schema info and a payload. Converting raw protobuf messages to avro records using ‘Object Container Files’ format. Converted records contain their own schemas, followings can use ‘Schema on Read’ strategies.

38 3. Writing Avro records to GCS as Avro ﬁles

39 3. Writing Avro records to GCS as Avro ﬁles
Partitioning records by ‘Group-By’ shufﬂing, the destination and schema information with processing time windows. T L E Writing avro records to GCS, DataLakes or Dead-Letter Lake. Beam’s AvroIO and FileIO API are very useful in this use case. Reading avro records from DataHubs or Dead-Letter Hub, and getting destinations.

40 4. Insert avro records to BigQuery as Stream

41 4. Insert avro records to BigQuery as Stream T
L E Reading avro records (with schema) from Structured DataHub and getting destinations. Inserting TableRaw objects to BigQuery as stream, using Beam’s BigQueryIO with DynamicDestinations API. Converting avro records to BigQuery TableRaw objects and identifying a table names from those destinations.

42 Mercari is changing its system architecture, Monolith to Microservices,
On-premises to Cloud Conclusion Creating new pipeline for microservice architecture on GCP stuffs(Pub/Sub, GCS, Dataﬂow and BigQuery). We use Apache Beam with Spotify Scio to write ETL jobs which work on Cloud Dataﬂow. Avro format is very useful because Apache Beam has APIs to write both GCS and BigQuery. 02 03 04 01

Creating Stream DataPipeline on GCP Using Apach...

Creating Stream DataPipeline on GCP Using Apache Beam

More Decks by Shu Suzuki

Other Decks in Technology

Featured

Transcript