MongoDB Europe 2016: Using Beam and BigQuery with MongoDB

Slide 1

Slide 1 text

Warehousing MongoDB Data Using Apache Beam and BigQuery Sandeep Parikh Head of Solutions Architecture, Americas East @crcsmnky

Slide 2

Slide 2 text

Agenda MongoDB on Google Cloud Platform What is Data Warehousing Tools & Technologies Example Use Case

Slide 3

Slide 3 text

Confidential & Proprietary Google Cloud Platform 3 MongoDB on Google Cloud Platform

Slide 4

Slide 4 text

Google Cloud Platform 4 MongoDB on Google Cloud Platform

Slide 5

Slide 5 text

Google Cloud Platform 5 Manually Deploying MongoDB

Slide 6

Slide 6 text

Google Cloud Platform 6 Google Cloud Launcher

Slide 7

Slide 7 text

Google Cloud Platform 7 MongoDB Cloud Manager

Slide 8

Slide 8 text

Google Cloud Platform 8 MongoDB Cloud Manager How do you automate this?

Slide 9

Slide 9 text

Google Cloud Platform 9 Bootstrapping MongoDB Cloud Manager Deployment Manager Template

Slide 10

Slide 10 text

Google Cloud Platform 10 Cloud Deployment Manager Provision, configure your deployment Configuration as code Declarative approach to configuration Template-driven Supports YAML, Jinja, and Python Use schemas to constrain parameters References control order and dependencies

Slide 11

Slide 11 text

Google Cloud Platform 11 Bootstrapping Cloud Manager Schema, Configuration & Template Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager Three Compute Engine instances, each with 500 GB PD-SSD MongoDB Cloud Manager automation agent pre-installed and configured $ gcloud deployment-manager deployments create mongodb-cloud-manager \ --config mongodb-cloud-manager.jinja \ --properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY

Slide 12

Slide 12 text

Confidential & Proprietary Google Cloud Platform 12 What’s a Data Warehouse

Slide 13

Slide 13 text

Data Warehouses are central repositories of integrated data from one or more disparate sources https://en.wikipedia.org/wiki/Data_warehouse

Slide 14

Slide 14 text

Google Cloud Platform 14 Data Warehouse Money Data Data Data Insights Profit!

Slide 15

Slide 15 text

Confidential & Proprietary Google Cloud Platform 15 Tools and Technologies

Slide 16

Slide 16 text

Where: BigQuery

Slide 17

Slide 17 text

Google Cloud Platform 17 BigQuery Complex, Petabyte-scale data warehousing made simple Scales automatically; No setup or admin Foundation for analytics and machine learning

Slide 18

Slide 18 text

Google Cloud Platform 18 RUN QUERY

Slide 19

Slide 19 text

Google Cloud Platform 19

Slide 20

Slide 20 text

How: Apache Beam (incubating)

Slide 21

Slide 21 text

21 Modern data processing Pipeline-centric approach Batch and streaming, from the same codebase Portable across runtime environments Build pipelines using Java (GA), Python (alpha) Apache Beam

Slide 22

Slide 22 text

Google Cloud Platform 22 Apache Beam Lineage MapReduce BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel Apache Beam Google Cloud Dataflow

Slide 23

Slide 23 text

Google Cloud Platform 23 Beam, Modes of Operation 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch

Slide 24

Slide 24 text

Google Cloud Platform 24 Pipelines in Beam Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming

Slide 25

Slide 25 text

Google Cloud Platform 25 Apache Beam Vision Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution

Slide 26

Slide 26 text

Google Cloud Platform 26 Running Apache Beam Cloud Dataflow Local Runner

Slide 27

Slide 27 text

27 A great place for executing Beam pipelines which provides: ● Fully managed, no-ops execution environment ● Integration with Google Cloud Platform ● Java support in GA. Python in Alpha Cloud Dataflow Service

Slide 28

Slide 28 text

Deploy Tear Down Fully Managed: Worker Lifecycle Management

Slide 29

Slide 29 text

Fully Managed: Dynamic Worker Scaling

Slide 30

Slide 30 text

100 mins. 65 mins. vs. Fully Managed: Dynamic Work Rebalancing

Slide 31

Slide 31 text

Integrated: Monitoring UI

Slide 32

Slide 32 text

Integrated: Distributed Logging

Slide 33

Slide 33 text

Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Google Cloud Platform Cloud Dataproc 33

Slide 34

Slide 34 text

Confidential & Proprietary Google Cloud Platform 34 Example Use Case

Slide 35

Slide 35 text

Google Cloud Platform 35 Sensor Data 1000 Devices Cloud Storage ID, Type, Name 27M Log Entries MongoDB Device ID, Value, Timestamp

Slide 36

Slide 36 text

What’s the average reading per sensor type?

Slide 37

Slide 37 text

Google Cloud Platform 37 Beam + MongoDB Export (JSON,CSV) TextIO.Read MongoClient Find

Slide 38

Slide 38 text

Google Cloud Platform 38 Beam + MongoDB (coming soon!) MongoDbIO.Read

Slide 39

Slide 39 text

Google Cloud Platform 39 Pipeline Execution

Slide 40

Slide 40 text

Google Cloud Platform 40 // Read sensor logs from MongoDB and create PCollection of Documents PCollection sensorLogs = p.apply(MongoDbIO.read() .withUri("mongodb://" + options.getMongoDBHost() + ":27017") .withDatabase(options.getMongoDBDatabase()) .withCollection(options.getMongoDBCollection())); // Extract "Device ID -> Value" PCollection PCollection> sensorIdValue = sensorLogs .apply("ExtractValues", ParDo.of(new DoFn>() { @ProcessElement public void processElement(ProcessContext c) { String deviceId = c.element().getObjectId("_id").toString(); Double value = c.element().getDouble("v"); c.output(KV.of(deviceId, value)); } })); Using MongoDbIO.Read

Slide 41

Slide 41 text

Google Cloud Platform 41 Transforming Data Document → Device ID, Value CSV → Device ID, Type Type, Value Type, Mean Value Output to BigQuery

Slide 42

Slide 42 text

Questions? Apache Beam http://beam.incubator.apache.org Cloud Dataflow http://cloud.google.com/dataflow BigQuery http://cloud.google.com/bigquery