Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
MongoDB Europe 2016: Using Beam and BigQuery with MongoDB
Sandeep Parikh
November 15, 2016
Technology
0
32
MongoDB Europe 2016: Using Beam and BigQuery with MongoDB
Sandeep Parikh
November 15, 2016
Tweet
Share
More Decks by Sandeep Parikh
See All by Sandeep Parikh
crcsmnky
1
1.2k
crcsmnky
0
45
crcsmnky
0
9
crcsmnky
0
18
crcsmnky
1
78
crcsmnky
0
200
crcsmnky
1
28
crcsmnky
1
66
crcsmnky
1
76
Other Decks in Technology
See All in Technology
gobeyond20xx
0
350
900groove
2
530
inductor
1
140
nitya
0
320
ido_kara_deru
1
110
joytomo
0
100
clustervr
0
250
robcrowley
1
450
hololab
0
200
1027kg
0
210
pohjus
0
3.3k
ryomasumura
0
120
Featured
See All Featured
vanstee
117
4.9k
cromwellryan
104
6.2k
addyosmani
494
110k
swwweet
206
6.9k
myddelton
109
11k
lara
590
61k
ddemaree
273
31k
hannesfritz
28
950
tmm1
61
9.4k
brad_frost
157
6.4k
bryan
100
11k
reverentgeek
167
7.3k
Transcript
Warehousing MongoDB Data Using Apache Beam and BigQuery Sandeep Parikh
Head of Solutions Architecture, Americas East @crcsmnky
Agenda MongoDB on Google Cloud Platform What is Data Warehousing
Tools & Technologies Example Use Case
Confidential & Proprietary Google Cloud Platform 3 MongoDB on Google
Cloud Platform
Google Cloud Platform 4 MongoDB on Google Cloud Platform
Google Cloud Platform 5 Manually Deploying MongoDB
Google Cloud Platform 6 Google Cloud Launcher
Google Cloud Platform 7 MongoDB Cloud Manager
Google Cloud Platform 8 MongoDB Cloud Manager How do you
automate this?
Google Cloud Platform 9 Bootstrapping MongoDB Cloud Manager Deployment Manager
Template
Google Cloud Platform 10 Cloud Deployment Manager Provision, configure your
deployment Configuration as code Declarative approach to configuration Template-driven Supports YAML, Jinja, and Python Use schemas to constrain parameters References control order and dependencies
Google Cloud Platform 11 Bootstrapping Cloud Manager Schema, Configuration &
Template Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager Three Compute Engine instances, each with 500 GB PD-SSD MongoDB Cloud Manager automation agent pre-installed and configured $ gcloud deployment-manager deployments create mongodb-cloud-manager \ --config mongodb-cloud-manager.jinja \ --properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY
Confidential & Proprietary Google Cloud Platform 12 What’s a Data
Warehouse
Data Warehouses are central repositories of integrated data from one
or more disparate sources https://en.wikipedia.org/wiki/Data_warehouse
Google Cloud Platform 14 Data Warehouse Money Data Data Data
Insights Profit!
Confidential & Proprietary Google Cloud Platform 15 Tools and Technologies
Where: BigQuery
Google Cloud Platform 17 BigQuery Complex, Petabyte-scale data warehousing made
simple Scales automatically; No setup or admin Foundation for analytics and machine learning
Google Cloud Platform 18 RUN QUERY
Google Cloud Platform 19
How: Apache Beam (incubating)
21 Modern data processing Pipeline-centric approach Batch and streaming, from
the same codebase Portable across runtime environments Build pipelines using Java (GA), Python (alpha) Apache Beam
Google Cloud Platform 22 Apache Beam Lineage MapReduce BigTable Dremel
Colossus Flume Megastore Spanner PubSub Millwheel Apache Beam Google Cloud Dataflow
Google Cloud Platform 23 Beam, Modes of Operation 3 Streaming
4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
Google Cloud Platform 24 Pipelines in Beam Pipeline p =
Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.<Integer>by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming
Google Cloud Platform 25 Apache Beam Vision Beam Model: Fn
Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution
Google Cloud Platform 26 Running Apache Beam Cloud Dataflow Local
Runner
27 A great place for executing Beam pipelines which provides:
• Fully managed, no-ops execution environment • Integration with Google Cloud Platform • Java support in GA. Python in Alpha Cloud Dataflow Service
Deploy Tear Down Fully Managed: Worker Lifecycle Management
Fully Managed: Dynamic Worker Scaling
100 mins. 65 mins. vs. Fully Managed: Dynamic Work Rebalancing
Integrated: Monitoring UI
Integrated: Distributed Logging
Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub
BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Google Cloud Platform Cloud Dataproc 33
Confidential & Proprietary Google Cloud Platform 34 Example Use Case
Google Cloud Platform 35 Sensor Data 1000 Devices Cloud Storage
ID, Type, Name 27M Log Entries MongoDB Device ID, Value, Timestamp
What’s the average reading per sensor type?
Google Cloud Platform 37 Beam + MongoDB Export (JSON,CSV) TextIO.Read
MongoClient Find
Google Cloud Platform 38 Beam + MongoDB (coming soon!) MongoDbIO.Read
Google Cloud Platform 39 Pipeline Execution
Google Cloud Platform 40 // Read sensor logs from MongoDB
and create PCollection of Documents PCollection<Document> sensorLogs = p.apply(MongoDbIO.read() .withUri("mongodb://" + options.getMongoDBHost() + ":27017") .withDatabase(options.getMongoDBDatabase()) .withCollection(options.getMongoDBCollection())); // Extract "Device ID -> Value" PCollection PCollection<KV<String,Double>> sensorIdValue = sensorLogs .apply("ExtractValues", ParDo.of(new DoFn<Document, KV<String, Double>>() { @ProcessElement public void processElement(ProcessContext c) { String deviceId = c.element().getObjectId("_id").toString(); Double value = c.element().getDouble("v"); c.output(KV.of(deviceId, value)); } })); Using MongoDbIO.Read
Google Cloud Platform 41 Transforming Data Document → Device ID,
Value CSV → Device ID, Type Type, Value Type, Mean Value Output to BigQuery
Questions? Apache Beam http://beam.incubator.apache.org Cloud Dataflow http://cloud.google.com/dataflow BigQuery http://cloud.google.com/bigquery