Creating Stream DataPipeline on GCP Using Apache Beam

5e70fc45fb6bcca87416c112a9e7e679?s=47 Shu Suzuki
September 12, 2019

Creating Stream DataPipeline on GCP Using Apache Beam

We built a scalable and flexible stream data pipeline for our microservices
on Google Cloud Platform (GCP), using Cloud Pub/Sub, Google Cloud Storage, BigQuery, and Cloud Dataflow, using Apache Beam. The stream data pipeline is working on the production system for Mercari, one of the biggest C2C e-commerce services in Japan. The pipeline currently accepts logs from 5+ microservices, and the number will increase soon.

5e70fc45fb6bcca87416c112a9e7e679?s=128

Shu Suzuki

September 12, 2019
Tweet

Transcript

  1. 1 Creating a Streaming Data Pipeline on Google Cloud Platform

    using Apache Beam {“id”: “@shoe116”, “team”: “Data Platform”}
  2. 2 - Mercari Data Platform Team - Data Engineer at

    Mercari since 2018 - Beam, Kafka, Storm, Hive, Hadoop... - Twitter/GitHub @shoe116 Shuichi Suzuki
  3. 3 About Mercari Agenda Monolith to Microservices, On-premises to Cloud

    Updating Data Pipeline for Microservice Architecture New Stream data Pipeline on GCP 02 03 04 01
  4. 4 What is Mercari? C2C marketplace app that allows users

    to enjoy buying and selling
  5. 5 By the Numbers (JP/Full Year) 232 346.8 GMV¹ 490.2

    B
 In: billion JPY FY 06/2017 FY 06/2018 21.2 33.4 Net Sales 46.2 B
 In: billion JPY FY 06/2017 FY 06/2018 8.45 10.75 13.57 M
 In: Million people FY 06/2017 FY 06/2018 Source: Internal documents, from FY2018.6 Presentation Material
 1. GMV after cancellation
 2. Monthly Active Users in June. Number of registered users that used our app in the month
 490.2 FY 06/2019 46.2 FY 06/2019 13.57 FY 06/2019
  6. 6 Company Profile February 1st, 2013 Tokyo, Sendai, Fukuoka, Palo

    Alto, Portland Approx. 1,800 including subsidiaries Headcount Offices Established Japan’s First Unicorn: listed on Tokyo Stock Exchange’s Mothers Market – a board for high-growth companies – in June 2018
  7. 7 Monolith to Microservices, On-premises to Cloud

  8. 8 To have ownership all developers Why Microservices on Cloud?

    To develop & improve more rapidly To co-work with diversity talents 2. 3. 1. To scale Organization
  9. 9 System Architecture: Before Microservices MySQL Mercari API On Premise

  10. 10 System Architecture: Introduce API gateway API gateway MySQL Kubernetes

    Mercari API On Premise
  11. 11 System Architecture: Add a service API gateway Authority Service

    A MySQL Kubernetes Mercari API Cloud Spanner On Premise
  12. 12 System Architecture: Current API gateway Authority Service A MySQL

    Mercari API Service B Service D Service C Cloud Spanner On Premise
  13. 13 System Architecture: Future API gateway Authority Service A MySQL

    Kubernetes Mercari API Service B Service D Service C Cloud Spanner On Premise Logic migration Datastore migration
  14. 14 Updating Data Pipeline for Microservice Architecture

  15. 15 Moving data between sources to sinks What is Data

    Pipeline? High-throughput, low-latency High availability and scalability 2. 3. 1. Data Pipelines is In particular, I am speaking about stream data pipeline to send logs from production to Data Warehouse, for analytics.
  16. 16 Before Microservices A pipeline was very simple because the

    source is monolith. Monolith Mercari API
  17. 17 After Microservices The data sources of a pipeline become

    multiple. We have to adapt our pipeline to new microservice architecture. ??
  18. 18 Our Technical Challenges • Handling ever more data in

    an even more efficient way ◦ 300K+/sec requests is coming from only an api gateway. ◦ More and more services make match more data traffic. • Processing schema full data with more flexibility. ◦ Each microservices has own schema for express their behaviors ◦ Every schemas evolution is happen independently, because it depend on updating their own business logics. • Don’t make pipelines for every microservices ◦ The number of microservices also fluctuates. ◦ Data Platform should not control their life cycles.
  19. 19 New Stream Data Pipeline on Google Cloud Platform

  20. 20 Split the log collection and data processing phases Design

    guidelines for new stream data pipeline Support structured output with schema evolution Keep high throughput and scalability for multiple inputs 2. 3. 1. Use GCS because our data sources and sinks are work on GCP 4.
  21. 21 Scalable Message Queue like Kafka Using GCP Managed Services

    Distributed processing Engine like Flink Cloud Dataflow Cloud Pub/Sub Cloud Storage Object storage to store unstructured data BigQuery Scalable, managed DWH for analytics
  22. 22 Overview of new stream pipeline on GCP

  23. 23 3 Types of Message Queue (Cloud Pub/Sub)

  24. 24 Each services have own ‘Ramp’ topic for a pipeline.

    Services post serialized protobuf messages (byte[]) with Map[String, String] attributes to the topic. 3 Types of Message Queue (Cloud Pub/Sub) DataHub Ramp Aggregated as ‘dead-letter’ messages that can't be processed successfully. Dead-Letter Hub All data are aggregated to DataHub as common avro format records . We have Raw DataHub and Structured DataHub.
  25. 25 3 Types of Stores (GCS and BigQuery)

  26. 26 All data are written to DataLake (GCS), as Avro

    files. We have 2 types DataLakes, raw and structured. 3 Types of Stores (GCS and BigQuery) BigQuery DataLake All dead-letter messages are stored in Dead-Letter Lake (GCS) as Avro files. Dead-Letter Lake All data in Structured Datalake upload to BigQuery. Using avro files is very suitable in this case.
  27. 27 ETL processes using Apache Beam and Avro

  28. 28 Apache Beam and Spotify Scio • Unified programming model

    SDK for batch and stream data processing • Support multiple engines; ◦ Apache Apex, Flink, Spark, Samza... ◦ Google Cloud Dataflow • Currently, only one SDK to write Dataflow ◦ Cloud Dataflow SDK will be decommissioned. Apache Beam • A Scala API for Apache Beam and Cloud Dataflow. • Make handling collections match easier. • Our team members prefer Scala rather than Java. Spotify Scio
  29. 29 DataHub Avro Protocol Our pipeline internal message format. Why

    Avro? - We can write Avro to GCS easily. - We can load Avro to BQ easily. - We can query Avro files on GCS as BQ external tables. DataHub Avro contains; - MetaData of the record (UUID, timestamp and etc...) - Output destinations and schema information - Content type (Avro, Protobuf, etc) - Data payload
  30. 30 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3",

    "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } metadata
  31. 31 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3",

    "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } Destination / Schema Info
  32. 32 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3",

    "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } Payload (protobuf or avro)
  33. 33 4 kinds of ETL processes

  34. 34 1. Aggregating Data from Ramps to RawDataHub

  35. 35 Reading records from ramp topics, like Mapper in MapReduce

    model. 1. Aggregating Data from Ramps to RawDataHub Making ‘DataHub Avro’ records from each protobuf raw messages. An Avro record’s payload is exactly the same bytes array with an input. Write avro records to Raw DataHub or Dead-LetterHub as Reducer. T L E
  36. 36 2. Converting raw protobuf to Avro Generic Records

  37. 37 2. Converting raw records to structured Avro records T

    L E Writing converted avro records with those schema to Structured DataHub. Messages that can't be converted go to Dead-Letter. Reading avro records from Raw DataHub, and getting schema info and a payload. Converting raw protobuf messages to avro records using ‘Object Container Files’ format. Converted records contain their own schemas, followings can use ‘Schema on Read’ strategies.
  38. 38 3. Writing Avro records to GCS as Avro files

  39. 39 3. Writing Avro records to GCS as Avro files

    Partitioning records by ‘Group-By’ shuffling, the destination and schema information with processing time windows. T L E Writing avro records to GCS, DataLakes or Dead-Letter Lake. Beam’s AvroIO and FileIO API are very useful in this use case. Reading avro records from DataHubs or Dead-Letter Hub, and getting destinations.
  40. 40 4. Insert avro records to BigQuery as Stream

  41. 41 4. Insert avro records to BigQuery as Stream T

    L E Reading avro records (with schema) from Structured DataHub and getting destinations. Inserting TableRaw objects to BigQuery as stream, using Beam’s BigQueryIO with DynamicDestinations API. Converting avro records to BigQuery TableRaw objects and identifying a table names from those destinations.
  42. 42 Mercari is changing its system architecture, Monolith to Microservices,

    On-premises to Cloud Conclusion Creating new pipeline for microservice architecture on GCP stuffs(Pub/Sub, GCS, Dataflow and BigQuery). We use Apache Beam with Spotify Scio to write ETL jobs which work on Cloud Dataflow. Avro format is very useful because Apache Beam has APIs to write both GCS and BigQuery. 02 03 04 01