Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating Stream DataPipeline on GCP Using Apache Beam

Shu Suzuki
September 12, 2019

Creating Stream DataPipeline on GCP Using Apache Beam

We built a scalable and flexible stream data pipeline for our microservices
on Google Cloud Platform (GCP), using Cloud Pub/Sub, Google Cloud Storage, BigQuery, and Cloud Dataflow, using Apache Beam. The stream data pipeline is working on the production system for Mercari, one of the biggest C2C e-commerce services in Japan. The pipeline currently accepts logs from 5+ microservices, and the number will increase soon.

Shu Suzuki

September 12, 2019
Tweet

More Decks by Shu Suzuki

Other Decks in Technology

Transcript

  1. 1 Creating a Streaming Data Pipeline on Google Cloud Platform

    using Apache Beam {“id”: “@shoe116”, “team”: “Data Platform”}
  2. 2 - Mercari Data Platform Team - Data Engineer at

    Mercari since 2018 - Beam, Kafka, Storm, Hive, Hadoop... - Twitter/GitHub @shoe116 Shuichi Suzuki
  3. 3 About Mercari Agenda Monolith to Microservices, On-premises to Cloud

    Updating Data Pipeline for Microservice Architecture New Stream data Pipeline on GCP 02 03 04 01
  4. 5 By the Numbers (JP/Full Year) 232 346.8 GMV¹ 490.2

    B
 In: billion JPY FY 06/2017 FY 06/2018 21.2 33.4 Net Sales 46.2 B
 In: billion JPY FY 06/2017 FY 06/2018 8.45 10.75 13.57 M
 In: Million people FY 06/2017 FY 06/2018 Source: Internal documents, from FY2018.6 Presentation Material
 1. GMV after cancellation
 2. Monthly Active Users in June. Number of registered users that used our app in the month
 490.2 FY 06/2019 46.2 FY 06/2019 13.57 FY 06/2019
  5. 6 Company Profile February 1st, 2013 Tokyo, Sendai, Fukuoka, Palo

    Alto, Portland Approx. 1,800 including subsidiaries Headcount Offices Established Japan’s First Unicorn: listed on Tokyo Stock Exchange’s Mothers Market – a board for high-growth companies – in June 2018
  6. 8 To have ownership all developers Why Microservices on Cloud?

    To develop & improve more rapidly To co-work with diversity talents 2. 3. 1. To scale Organization
  7. 11 System Architecture: Add a service API gateway Authority Service

    A MySQL Kubernetes Mercari API Cloud Spanner On Premise
  8. 12 System Architecture: Current API gateway Authority Service A MySQL

    Mercari API Service B Service D Service C Cloud Spanner On Premise
  9. 13 System Architecture: Future API gateway Authority Service A MySQL

    Kubernetes Mercari API Service B Service D Service C Cloud Spanner On Premise Logic migration Datastore migration
  10. 15 Moving data between sources to sinks What is Data

    Pipeline? High-throughput, low-latency High availability and scalability 2. 3. 1. Data Pipelines is In particular, I am speaking about stream data pipeline to send logs from production to Data Warehouse, for analytics.
  11. 16 Before Microservices A pipeline was very simple because the

    source is monolith. Monolith Mercari API
  12. 17 After Microservices The data sources of a pipeline become

    multiple. We have to adapt our pipeline to new microservice architecture. ??
  13. 18 Our Technical Challenges • Handling ever more data in

    an even more efficient way ◦ 300K+/sec requests is coming from only an api gateway. ◦ More and more services make match more data traffic. • Processing schema full data with more flexibility. ◦ Each microservices has own schema for express their behaviors ◦ Every schemas evolution is happen independently, because it depend on updating their own business logics. • Don’t make pipelines for every microservices ◦ The number of microservices also fluctuates. ◦ Data Platform should not control their life cycles.
  14. 20 Split the log collection and data processing phases Design

    guidelines for new stream data pipeline Support structured output with schema evolution Keep high throughput and scalability for multiple inputs 2. 3. 1. Use GCS because our data sources and sinks are work on GCP 4.
  15. 21 Scalable Message Queue like Kafka Using GCP Managed Services

    Distributed processing Engine like Flink Cloud Dataflow Cloud Pub/Sub Cloud Storage Object storage to store unstructured data BigQuery Scalable, managed DWH for analytics
  16. 24 Each services have own ‘Ramp’ topic for a pipeline.

    Services post serialized protobuf messages (byte[]) with Map[String, String] attributes to the topic. 3 Types of Message Queue (Cloud Pub/Sub) DataHub Ramp Aggregated as ‘dead-letter’ messages that can't be processed successfully. Dead-Letter Hub All data are aggregated to DataHub as common avro format records . We have Raw DataHub and Structured DataHub.
  17. 26 All data are written to DataLake (GCS), as Avro

    files. We have 2 types DataLakes, raw and structured. 3 Types of Stores (GCS and BigQuery) BigQuery DataLake All dead-letter messages are stored in Dead-Letter Lake (GCS) as Avro files. Dead-Letter Lake All data in Structured Datalake upload to BigQuery. Using avro files is very suitable in this case.
  18. 28 Apache Beam and Spotify Scio • Unified programming model

    SDK for batch and stream data processing • Support multiple engines; ◦ Apache Apex, Flink, Spark, Samza... ◦ Google Cloud Dataflow • Currently, only one SDK to write Dataflow ◦ Cloud Dataflow SDK will be decommissioned. Apache Beam • A Scala API for Apache Beam and Cloud Dataflow. • Make handling collections match easier. • Our team members prefer Scala rather than Java. Spotify Scio
  19. 29 DataHub Avro Protocol Our pipeline internal message format. Why

    Avro? - We can write Avro to GCS easily. - We can load Avro to BQ easily. - We can query Avro files on GCS as BQ external tables. DataHub Avro contains; - MetaData of the record (UUID, timestamp and etc...) - Output destinations and schema information - Content type (Avro, Protobuf, etc) - Data payload
  20. 30 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3",

    "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } metadata
  21. 31 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3",

    "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } Destination / Schema Info
  22. 32 Schema of ‘DataHubAvro’ {"type": "record", "name": "DataHubAvro", "namespace": "com.mercari.data.model.v3",

    "fields": [{ {"name": "uuid", "type": "string"}, {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-micros"}}, {"name": "topic_name", "type": "string"}, {"name": "service_name", "type": "string"}, {"name": "log_name", "type": "string"}, {"name": "content_type", "type": ["null", "string"], "default": null}, {"name": "user_agent", "type": ["null", "string"], "default": null}, {"name": "payload","type": "bytes"} ]} } Payload (protobuf or avro)
  23. 35 Reading records from ramp topics, like Mapper in MapReduce

    model. 1. Aggregating Data from Ramps to RawDataHub Making ‘DataHub Avro’ records from each protobuf raw messages. An Avro record’s payload is exactly the same bytes array with an input. Write avro records to Raw DataHub or Dead-LetterHub as Reducer. T L E
  24. 37 2. Converting raw records to structured Avro records T

    L E Writing converted avro records with those schema to Structured DataHub. Messages that can't be converted go to Dead-Letter. Reading avro records from Raw DataHub, and getting schema info and a payload. Converting raw protobuf messages to avro records using ‘Object Container Files’ format. Converted records contain their own schemas, followings can use ‘Schema on Read’ strategies.
  25. 39 3. Writing Avro records to GCS as Avro files

    Partitioning records by ‘Group-By’ shuffling, the destination and schema information with processing time windows. T L E Writing avro records to GCS, DataLakes or Dead-Letter Lake. Beam’s AvroIO and FileIO API are very useful in this use case. Reading avro records from DataHubs or Dead-Letter Hub, and getting destinations.
  26. 41 4. Insert avro records to BigQuery as Stream T

    L E Reading avro records (with schema) from Structured DataHub and getting destinations. Inserting TableRaw objects to BigQuery as stream, using Beam’s BigQueryIO with DynamicDestinations API. Converting avro records to BigQuery TableRaw objects and identifying a table names from those destinations.
  27. 42 Mercari is changing its system architecture, Monolith to Microservices,

    On-premises to Cloud Conclusion Creating new pipeline for microservice architecture on GCP stuffs(Pub/Sub, GCS, Dataflow and BigQuery). We use Apache Beam with Spotify Scio to write ETL jobs which work on Cloud Dataflow. Avro format is very useful because Apache Beam has APIs to write both GCS and BigQuery. 02 03 04 01