Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
MongoDB Europe 2016: Using Beam and BigQuery wi...
Search
Sandeep Parikh
November 15, 2016
Technology
130
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
MongoDB Europe 2016: Using Beam and BigQuery with MongoDB
Sandeep Parikh
November 15, 2016
More Decks by Sandeep Parikh
See All by Sandeep Parikh
Bringing Kubernetes Policy Enforcement to GitLab
crcsmnky
0
50
Shift Policy Enforcement Left using GitOps
crcsmnky
0
180
Securing your microservices using Istio
crcsmnky
1
2.1k
Enforcing Service Mesh Structure using OPA Gatekeeper
crcsmnky
0
410
Modern App Dev using Cloud Run and Knative
crcsmnky
0
100
Service Mush: Debugging Istio Deployments
crcsmnky
0
78
Google Cloud for Serverless Compute
crcsmnky
1
180
Kubernetes and Hybrid Deployments
crcsmnky
0
330
MongoDB World 2016: MongoDB and Google Cloud
crcsmnky
1
160
Other Decks in Technology
See All in Technology
MUSUBI 田中裕一『AIと共に行う「しごとのリデザイン」- スモールバックオフィス編』AI Ops Lab #4
musubi
0
280
iAEONの段階的リアーキテクト戦略 / iAEON's_Gradual_Re-architecture_Strategy
aeonpeople
0
240
SteampipeとExcel Power QueryでAWS構成定義書の作成を自動化する
jhashimoto
0
170
フィジカル版Github Onshapeの紹介
shiba_8ro
0
310
FPC(フレキシブル)基板にZephyr実装してみた。
iotengineer22
0
150
データレイクの「見えない問題」を可視化する
sansantech
PRO
1
170
5分でわかるDuckDB Quack
chanyou0311
2
170
ザ・データベース、MySQL ~ OSC 2026 Sendai ~
sakaik
0
170
AI 不只幫你寫 Code: 當專案從 300 暴增到 1500, 我們如何撐住 DevOps
appleboy
0
120
スタートアップにAmazon EKSは早すぎる? マルチプロダクト戦略を加速する Platform Engineeringの実践 / Is Amazon EKS Too Soon for Startups? Practical Platform Engineering to Accelerate a Multi-Product Strategy
elmodev09
1
1.5k
螺旋型キャリアの生存戦略 / kinoko-conf2026
rakus_dev
1
700
水を運ぶ人としてのリーダーシップ
izumii19
3
720
Featured
See All Featured
Documentation Writing (for coders)
carmenintech
77
5.4k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
2
400
How to Think Like a Performance Engineer
csswizardry
28
2.7k
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
430
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8.2k
Technical Leadership for Architectural Decision Making
baasie
3
420
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.9k
Code Review Best Practice
trishagee
74
20k
Designing Powerful Visuals for Engaging Learning
tmiket
1
420
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Transcript
Warehousing MongoDB Data Using Apache Beam and BigQuery Sandeep Parikh
Head of Solutions Architecture, Americas East @crcsmnky
Agenda MongoDB on Google Cloud Platform What is Data Warehousing
Tools & Technologies Example Use Case
Confidential & Proprietary Google Cloud Platform 3 MongoDB on Google
Cloud Platform
Google Cloud Platform 4 MongoDB on Google Cloud Platform
Google Cloud Platform 5 Manually Deploying MongoDB
Google Cloud Platform 6 Google Cloud Launcher
Google Cloud Platform 7 MongoDB Cloud Manager
Google Cloud Platform 8 MongoDB Cloud Manager How do you
automate this?
Google Cloud Platform 9 Bootstrapping MongoDB Cloud Manager Deployment Manager
Template
Google Cloud Platform 10 Cloud Deployment Manager Provision, configure your
deployment Configuration as code Declarative approach to configuration Template-driven Supports YAML, Jinja, and Python Use schemas to constrain parameters References control order and dependencies
Google Cloud Platform 11 Bootstrapping Cloud Manager Schema, Configuration &
Template Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager Three Compute Engine instances, each with 500 GB PD-SSD MongoDB Cloud Manager automation agent pre-installed and configured $ gcloud deployment-manager deployments create mongodb-cloud-manager \ --config mongodb-cloud-manager.jinja \ --properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY
Confidential & Proprietary Google Cloud Platform 12 What’s a Data
Warehouse
Data Warehouses are central repositories of integrated data from one
or more disparate sources https://en.wikipedia.org/wiki/Data_warehouse
Google Cloud Platform 14 Data Warehouse Money Data Data Data
Insights Profit!
Confidential & Proprietary Google Cloud Platform 15 Tools and Technologies
Where: BigQuery
Google Cloud Platform 17 BigQuery Complex, Petabyte-scale data warehousing made
simple Scales automatically; No setup or admin Foundation for analytics and machine learning
Google Cloud Platform 18 RUN QUERY
Google Cloud Platform 19
How: Apache Beam (incubating)
21 Modern data processing Pipeline-centric approach Batch and streaming, from
the same codebase Portable across runtime environments Build pipelines using Java (GA), Python (alpha) Apache Beam
Google Cloud Platform 22 Apache Beam Lineage MapReduce BigTable Dremel
Colossus Flume Megastore Spanner PubSub Millwheel Apache Beam Google Cloud Dataflow
Google Cloud Platform 23 Beam, Modes of Operation 3 Streaming
4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
Google Cloud Platform 24 Pipelines in Beam Pipeline p =
Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.<Integer>by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming
Google Cloud Platform 25 Apache Beam Vision Beam Model: Fn
Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution
Google Cloud Platform 26 Running Apache Beam Cloud Dataflow Local
Runner
27 A great place for executing Beam pipelines which provides:
• Fully managed, no-ops execution environment • Integration with Google Cloud Platform • Java support in GA. Python in Alpha Cloud Dataflow Service
Deploy Tear Down Fully Managed: Worker Lifecycle Management
Fully Managed: Dynamic Worker Scaling
100 mins. 65 mins. vs. Fully Managed: Dynamic Work Rebalancing
Integrated: Monitoring UI
Integrated: Distributed Logging
Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub
BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Google Cloud Platform Cloud Dataproc 33
Confidential & Proprietary Google Cloud Platform 34 Example Use Case
Google Cloud Platform 35 Sensor Data 1000 Devices Cloud Storage
ID, Type, Name 27M Log Entries MongoDB Device ID, Value, Timestamp
What’s the average reading per sensor type?
Google Cloud Platform 37 Beam + MongoDB Export (JSON,CSV) TextIO.Read
MongoClient Find
Google Cloud Platform 38 Beam + MongoDB (coming soon!) MongoDbIO.Read
Google Cloud Platform 39 Pipeline Execution
Google Cloud Platform 40 // Read sensor logs from MongoDB
and create PCollection of Documents PCollection<Document> sensorLogs = p.apply(MongoDbIO.read() .withUri("mongodb://" + options.getMongoDBHost() + ":27017") .withDatabase(options.getMongoDBDatabase()) .withCollection(options.getMongoDBCollection())); // Extract "Device ID -> Value" PCollection PCollection<KV<String,Double>> sensorIdValue = sensorLogs .apply("ExtractValues", ParDo.of(new DoFn<Document, KV<String, Double>>() { @ProcessElement public void processElement(ProcessContext c) { String deviceId = c.element().getObjectId("_id").toString(); Double value = c.element().getDouble("v"); c.output(KV.of(deviceId, value)); } })); Using MongoDbIO.Read
Google Cloud Platform 41 Transforming Data Document → Device ID,
Value CSV → Device ID, Type Type, Value Type, Mean Value Output to BigQuery
Questions? Apache Beam http://beam.incubator.apache.org Cloud Dataflow http://cloud.google.com/dataflow BigQuery http://cloud.google.com/bigquery