Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB Europe 2016: Using Beam and BigQuery with MongoDB

MongoDB Europe 2016: Using Beam and BigQuery with MongoDB


Sandeep Parikh

November 15, 2016


  1. Warehousing MongoDB Data Using Apache Beam and BigQuery Sandeep Parikh

    Head of Solutions Architecture, Americas East @crcsmnky
  2. Agenda MongoDB on Google Cloud Platform What is Data Warehousing

    Tools & Technologies Example Use Case
  3. Confidential & Proprietary Google Cloud Platform 3 MongoDB on Google

    Cloud Platform
  4. Google Cloud Platform 4 MongoDB on Google Cloud Platform

  5. Google Cloud Platform 5 Manually Deploying MongoDB

  6. Google Cloud Platform 6 Google Cloud Launcher

  7. Google Cloud Platform 7 MongoDB Cloud Manager

  8. Google Cloud Platform 8 MongoDB Cloud Manager How do you

    automate this?
  9. Google Cloud Platform 9 Bootstrapping MongoDB Cloud Manager Deployment Manager

  10. Google Cloud Platform 10 Cloud Deployment Manager Provision, configure your

    deployment Configuration as code Declarative approach to configuration Template-driven Supports YAML, Jinja, and Python Use schemas to constrain parameters References control order and dependencies
  11. Google Cloud Platform 11 Bootstrapping Cloud Manager Schema, Configuration &

    Template Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager Three Compute Engine instances, each with 500 GB PD-SSD MongoDB Cloud Manager automation agent pre-installed and configured $ gcloud deployment-manager deployments create mongodb-cloud-manager \ --config mongodb-cloud-manager.jinja \ --properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY
  12. Confidential & Proprietary Google Cloud Platform 12 What’s a Data

  13. Data Warehouses are central repositories of integrated data from one

    or more disparate sources https://en.wikipedia.org/wiki/Data_warehouse
  14. Google Cloud Platform 14 Data Warehouse Money Data Data Data

    Insights Profit!
  15. Confidential & Proprietary Google Cloud Platform 15 Tools and Technologies

  16. Where: BigQuery

  17. Google Cloud Platform 17 BigQuery Complex, Petabyte-scale data warehousing made

    simple Scales automatically; No setup or admin Foundation for analytics and machine learning
  18. Google Cloud Platform 18 RUN QUERY

  19. Google Cloud Platform 19

  20. How: Apache Beam (incubating)

  21. 21 Modern data processing Pipeline-centric approach Batch and streaming, from

    the same codebase Portable across runtime environments Build pipelines using Java (GA), Python (alpha) Apache Beam
  22. Google Cloud Platform 22 Apache Beam Lineage MapReduce BigTable Dremel

    Colossus Flume Megastore Spanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  23. Google Cloud Platform 23 Beam, Modes of Operation 3 Streaming

    4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  24. Google Cloud Platform 24 Pipelines in Beam Pipeline p =

    Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.<Integer>by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming
  25. Google Cloud Platform 25 Apache Beam Vision Beam Model: Fn

    Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution
  26. Google Cloud Platform 26 Running Apache Beam Cloud Dataflow Local

  27. 27 A great place for executing Beam pipelines which provides:

    • Fully managed, no-ops execution environment • Integration with Google Cloud Platform • Java support in GA. Python in Alpha Cloud Dataflow Service
  28. Deploy Tear Down Fully Managed: Worker Lifecycle Management

  29. Fully Managed: Dynamic Worker Scaling

  30. 100 mins. 65 mins. vs. Fully Managed: Dynamic Work Rebalancing

  31. Integrated: Monitoring UI

  32. Integrated: Distributed Logging

  33. Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub

    BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Google Cloud Platform Cloud Dataproc 33
  34. Confidential & Proprietary Google Cloud Platform 34 Example Use Case

  35. Google Cloud Platform 35 Sensor Data 1000 Devices Cloud Storage

    ID, Type, Name 27M Log Entries MongoDB Device ID, Value, Timestamp
  36. What’s the average reading per sensor type?

  37. Google Cloud Platform 37 Beam + MongoDB Export (JSON,CSV) TextIO.Read

    MongoClient Find
  38. Google Cloud Platform 38 Beam + MongoDB (coming soon!) MongoDbIO.Read

  39. Google Cloud Platform 39 Pipeline Execution

  40. Google Cloud Platform 40 // Read sensor logs from MongoDB

    and create PCollection of Documents PCollection<Document> sensorLogs = p.apply(MongoDbIO.read() .withUri("mongodb://" + options.getMongoDBHost() + ":27017") .withDatabase(options.getMongoDBDatabase()) .withCollection(options.getMongoDBCollection())); // Extract "Device ID -> Value" PCollection PCollection<KV<String,Double>> sensorIdValue = sensorLogs .apply("ExtractValues", ParDo.of(new DoFn<Document, KV<String, Double>>() { @ProcessElement public void processElement(ProcessContext c) { String deviceId = c.element().getObjectId("_id").toString(); Double value = c.element().getDouble("v"); c.output(KV.of(deviceId, value)); } })); Using MongoDbIO.Read
  41. Google Cloud Platform 41 Transforming Data Document → Device ID,

    Value CSV → Device ID, Type Type, Value Type, Mean Value Output to BigQuery
  42. Questions? Apache Beam http://beam.incubator.apache.org Cloud Dataflow http://cloud.google.com/dataflow BigQuery http://cloud.google.com/bigquery