Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB Europe 2016: Using Beam and BigQuery with MongoDB

MongoDB Europe 2016: Using Beam and BigQuery with MongoDB

Sandeep Parikh

November 15, 2016
Tweet

More Decks by Sandeep Parikh

Other Decks in Technology

Transcript

  1. Warehousing MongoDB Data Using Apache Beam and BigQuery Sandeep Parikh

    Head of Solutions Architecture, Americas East @crcsmnky
  2. Google Cloud Platform 10 Cloud Deployment Manager Provision, configure your

    deployment Configuration as code Declarative approach to configuration Template-driven Supports YAML, Jinja, and Python Use schemas to constrain parameters References control order and dependencies
  3. Google Cloud Platform 11 Bootstrapping Cloud Manager Schema, Configuration &

    Template Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager Three Compute Engine instances, each with 500 GB PD-SSD MongoDB Cloud Manager automation agent pre-installed and configured $ gcloud deployment-manager deployments create mongodb-cloud-manager \ --config mongodb-cloud-manager.jinja \ --properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY
  4. Data Warehouses are central repositories of integrated data from one

    or more disparate sources https://en.wikipedia.org/wiki/Data_warehouse
  5. Google Cloud Platform 17 BigQuery Complex, Petabyte-scale data warehousing made

    simple Scales automatically; No setup or admin Foundation for analytics and machine learning
  6. 21 Modern data processing Pipeline-centric approach Batch and streaming, from

    the same codebase Portable across runtime environments Build pipelines using Java (GA), Python (alpha) Apache Beam
  7. Google Cloud Platform 22 Apache Beam Lineage MapReduce BigTable Dremel

    Colossus Flume Megastore Spanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  8. Google Cloud Platform 23 Beam, Modes of Operation 3 Streaming

    4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  9. Google Cloud Platform 24 Pipelines in Beam Pipeline p =

    Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.<Integer>by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming
  10. Google Cloud Platform 25 Apache Beam Vision Beam Model: Fn

    Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution
  11. 27 A great place for executing Beam pipelines which provides:

    • Fully managed, no-ops execution environment • Integration with Google Cloud Platform • Java support in GA. Python in Alpha Cloud Dataflow Service
  12. Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub

    BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Google Cloud Platform Cloud Dataproc 33
  13. Google Cloud Platform 35 Sensor Data 1000 Devices Cloud Storage

    ID, Type, Name 27M Log Entries MongoDB Device ID, Value, Timestamp
  14. Google Cloud Platform 40 // Read sensor logs from MongoDB

    and create PCollection of Documents PCollection<Document> sensorLogs = p.apply(MongoDbIO.read() .withUri("mongodb://" + options.getMongoDBHost() + ":27017") .withDatabase(options.getMongoDBDatabase()) .withCollection(options.getMongoDBCollection())); // Extract "Device ID -> Value" PCollection PCollection<KV<String,Double>> sensorIdValue = sensorLogs .apply("ExtractValues", ParDo.of(new DoFn<Document, KV<String, Double>>() { @ProcessElement public void processElement(ProcessContext c) { String deviceId = c.element().getObjectId("_id").toString(); Double value = c.element().getDouble("v"); c.output(KV.of(deviceId, value)); } })); Using MongoDbIO.Read
  15. Google Cloud Platform 41 Transforming Data Document → Device ID,

    Value CSV → Device ID, Type Type, Value Type, Mean Value Output to BigQuery