Creating Stream DataPipeline on GCP Using Apache Beam

Slide 1

Slide 1 text

1 Creating a Streaming Data Pipeline on Google Cloud Platform using Apache Beam {“id”: “@shoe116”, “team”: “Data Platform”}

Slide 2

Slide 2 text

2 - Mercari Data Platform Team - Data Engineer at Mercari since 2018 - Beam, Kafka, Storm, Hive, Hadoop... - Twitter/GitHub @shoe116 Shuichi Suzuki

Slide 3

Slide 3 text

3 About Mercari Agenda Monolith to Microservices, On-premises to Cloud Updating Data Pipeline for Microservice Architecture New Stream data Pipeline on GCP 02 03 04 01

Slide 4

Slide 4 text

4 What is Mercari? C2C marketplace app that allows users to enjoy buying and selling

Slide 5

Slide 5 text

5 By the Numbers (JP/Full Year) 232 346.8 GMV¹ 490.2 B  In: billion JPY FY 06/2017 FY 06/2018 21.2 33.4 Net Sales 46.2 B  In: billion JPY FY 06/2017 FY 06/2018 8.45 10.75 13.57 M  In: Million people FY 06/2017 FY 06/2018 Source: Internal documents, from FY2018.6 Presentation Material  1. GMV after cancellation  2. Monthly Active Users in June. Number of registered users that used our app in the month  490.2 FY 06/2019 46.2 FY 06/2019 13.57 FY 06/2019

Slide 6

Slide 6 text

6 Company Proﬁle February 1st, 2013 Tokyo, Sendai, Fukuoka, Palo Alto, Portland Approx. 1,800 including subsidiaries Headcount Oﬃces Established Japan’s First Unicorn: listed on Tokyo Stock Exchange’s Mothers Market – a board for high-growth companies – in June 2018

Slide 7

Slide 7 text

7 Monolith to Microservices, On-premises to Cloud

Slide 8

Slide 8 text

8 To have ownership all developers Why Microservices on Cloud? To develop & improve more rapidly To co-work with diversity talents 2. 3. 1. To scale Organization

Slide 9

Slide 9 text

9 System Architecture: Before Microservices MySQL Mercari API On Premise

Slide 10

Slide 10 text

10 System Architecture: Introduce API gateway API gateway MySQL Kubernetes Mercari API On Premise

Slide 11

Slide 11 text

11 System Architecture: Add a service API gateway Authority Service A MySQL Kubernetes Mercari API Cloud Spanner On Premise

Slide 12

Slide 12 text

12 System Architecture: Current API gateway Authority Service A MySQL Mercari API Service B Service D Service C Cloud Spanner On Premise

Slide 13

Slide 13 text

13 System Architecture: Future API gateway Authority Service A MySQL Kubernetes Mercari API Service B Service D Service C Cloud Spanner On Premise Logic migration Datastore migration

Slide 14

Slide 14 text

14 Updating Data Pipeline for Microservice Architecture

Slide 15

Slide 15 text

15 Moving data between sources to sinks What is Data Pipeline? High-throughput, low-latency High availability and scalability 2. 3. 1. Data Pipelines is In particular, I am speaking about stream data pipeline to send logs from production to Data Warehouse, for analytics.

Slide 16

Slide 16 text

16 Before Microservices A pipeline was very simple because the source is monolith. Monolith Mercari API

Slide 17

Slide 17 text

17 After Microservices The data sources of a pipeline become multiple. We have to adapt our pipeline to new microservice architecture. ??

Slide 18

Slide 18 text

18 Our Technical Challenges ● Handling ever more data in an even more efficient way ○ 300K+/sec requests is coming from only an api gateway. ○ More and more services make match more data traffic. ● Processing schema full data with more flexibility. ○ Each microservices has own schema for express their behaviors ○ Every schemas evolution is happen independently, because it depend on updating their own business logics. ● Don’t make pipelines for every microservices ○ The number of microservices also fluctuates. ○ Data Platform should not control their life cycles.

Slide 19

Slide 19 text

19 New Stream Data Pipeline on Google Cloud Platform

Slide 20

Slide 20 text

20 Split the log collection and data processing phases Design guidelines for new stream data pipeline Support structured output with schema evolution Keep high throughput and scalability for multiple inputs 2. 3. 1. Use GCS because our data sources and sinks are work on GCP 4.

Slide 21

Slide 21 text

21 Scalable Message Queue like Kafka Using GCP Managed Services Distributed processing Engine like Flink Cloud Dataﬂow Cloud Pub/Sub Cloud Storage Object storage to store unstructured data BigQuery Scalable, managed DWH for analytics

Slide 22

Slide 22 text

22 Overview of new stream pipeline on GCP

Slide 23

Slide 23 text

23 3 Types of Message Queue (Cloud Pub/Sub)

Slide 24

Slide 24 text

24 Each services have own ‘Ramp’ topic for a pipeline. Services post serialized protobuf messages (byte[]) with Map[String, String] attributes to the topic. 3 Types of Message Queue (Cloud Pub/Sub) DataHub Ramp Aggregated as ‘dead-letter’ messages that can't be processed successfully. Dead-Letter Hub All data are aggregated to DataHub as common avro format records . We have Raw DataHub and Structured DataHub.

Slide 25

Slide 25 text

25 3 Types of Stores (GCS and BigQuery)

Slide 26

Slide 26 text

26 All data are written to DataLake (GCS), as Avro files. We have 2 types DataLakes, raw and structured. 3 Types of Stores (GCS and BigQuery) BigQuery DataLake All dead-letter messages are stored in Dead-Letter Lake (GCS) as Avro files. Dead-Letter Lake All data in Structured Datalake upload to BigQuery. Using avro files is very suitable in this case.

Slide 27

Slide 27 text

27 ETL processes using Apache Beam and Avro

Slide 28

Slide 28 text

28 Apache Beam and Spotify Scio ● Unified programming model SDK for batch and stream data processing ● Support multiple engines; ○ Apache Apex, Flink, Spark, Samza... ○ Google Cloud Dataflow ● Currently, only one SDK to write Dataflow ○ Cloud Dataflow SDK will be decommissioned. Apache Beam ● A Scala API for Apache Beam and Cloud Dataflow. ● Make handling collections match easier. ● Our team members prefer Scala rather than Java. Spotify Scio

Slide 29

Slide 29 text

29 DataHub Avro Protocol Our pipeline internal message format. Why Avro? - We can write Avro to GCS easily. - We can load Avro to BQ easily. - We can query Avro ﬁles on GCS as BQ external tables. DataHub Avro contains; - MetaData of the record (UUID, timestamp and etc...) - Output destinations and schema information - Content type (Avro, Protobuf, etc) - Data payload

Slide 30

Slide 30 text

30 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3", "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } metadata

Slide 31

Slide 31 text

31 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3", "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } Destination / Schema Info

Slide 32

Slide 32 text

32 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3", "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } Payload (protobuf or avro)

Slide 33

Slide 33 text

33 4 kinds of ETL processes

Slide 34

Slide 34 text

34 1. Aggregating Data from Ramps to RawDataHub

Slide 35

Slide 35 text

35 Reading records from ramp topics, like Mapper in MapReduce model. 1. Aggregating Data from Ramps to RawDataHub Making ‘DataHub Avro’ records from each protobuf raw messages. An Avro record’s payload is exactly the same bytes array with an input. Write avro records to Raw DataHub or Dead-LetterHub as Reducer. T L E

Slide 36

Slide 36 text

36 2. Converting raw protobuf to Avro Generic Records

Slide 37

Slide 37 text

37 2. Converting raw records to structured Avro records T L E Writing converted avro records with those schema to Structured DataHub. Messages that can't be converted go to Dead-Letter. Reading avro records from Raw DataHub, and getting schema info and a payload. Converting raw protobuf messages to avro records using ‘Object Container Files’ format. Converted records contain their own schemas, followings can use ‘Schema on Read’ strategies.

Slide 38

Slide 38 text

38 3. Writing Avro records to GCS as Avro ﬁles

Slide 39

Slide 39 text

39 3. Writing Avro records to GCS as Avro ﬁles Partitioning records by ‘Group-By’ shufﬂing, the destination and schema information with processing time windows. T L E Writing avro records to GCS, DataLakes or Dead-Letter Lake. Beam’s AvroIO and FileIO API are very useful in this use case. Reading avro records from DataHubs or Dead-Letter Hub, and getting destinations.

Slide 40

Slide 40 text

40 4. Insert avro records to BigQuery as Stream

Slide 41

Slide 41 text

41 4. Insert avro records to BigQuery as Stream T L E Reading avro records (with schema) from Structured DataHub and getting destinations. Inserting TableRaw objects to BigQuery as stream, using Beam’s BigQueryIO with DynamicDestinations API. Converting avro records to BigQuery TableRaw objects and identifying a table names from those destinations.

Slide 42

Slide 42 text

42 Mercari is changing its system architecture, Monolith to Microservices, On-premises to Cloud Conclusion Creating new pipeline for microservice architecture on GCP stuffs(Pub/Sub, GCS, Dataﬂow and BigQuery). We use Apache Beam with Spotify Scio to write ETL jobs which work on Cloud Dataﬂow. Avro format is very useful because Apache Beam has APIs to write both GCS and BigQuery. 02 03 04 01