Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
MongoDB Europe 2016: Using Beam and BigQuery wi...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Sandeep Parikh
November 15, 2016
Technology
130
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
MongoDB Europe 2016: Using Beam and BigQuery with MongoDB
Sandeep Parikh
November 15, 2016
More Decks by Sandeep Parikh
See All by Sandeep Parikh
Bringing Kubernetes Policy Enforcement to GitLab
crcsmnky
0
50
Shift Policy Enforcement Left using GitOps
crcsmnky
0
180
Securing your microservices using Istio
crcsmnky
1
2.1k
Enforcing Service Mesh Structure using OPA Gatekeeper
crcsmnky
0
410
Modern App Dev using Cloud Run and Knative
crcsmnky
0
100
Service Mush: Debugging Istio Deployments
crcsmnky
0
78
Google Cloud for Serverless Compute
crcsmnky
1
180
Kubernetes and Hybrid Deployments
crcsmnky
0
330
MongoDB World 2016: MongoDB and Google Cloud
crcsmnky
1
160
Other Decks in Technology
See All in Technology
いまさら聞けない「仕様駆動開発入門」 〜AI活用時代の開発プロセスを考える〜
findy_eventslides
2
180
【Cyber-sec+】経営層を"動かす"ための考え方
hssh2_bin
0
200
【NRUG vol.18】KubernetesにおけるNew Relicデータ取得量削減の考え方
nrug_member
0
170
新しいUbuntu/GNOMEが使いたいからXからWaylandへ移行頑張ってるの巻 2026-06-20
nobutomurata
0
160
Claude Codeをどのように キャッチアップしているか
oikon48
13
8.7k
IaC コードを資産へ:AWS CDK 社内ライブラリと横断展開 / aws-summit-japan-2026
gotok365
10
1.5k
入門!AWS Blocks
ysuzuki
1
170
SONiC Scale-Up Working Group から探る Scale-UpやUltraEthernet機能の実装方法
ebiken
PRO
2
450
PostgreSQL 19 新機能概要 OSC Hokkaido 2026
nori_shinoda
0
210
秘密度ラベル初心者が第1歩でつまづかないための「設計・運用」ポイント
seafay
PRO
1
430
AIが自律的に回る開発ループを設計してチーム開発に組み込む
nekorush14
0
110
自分が詳しくない領域でAIを使う #プロヒス2026
konifar
20
7k
Featured
See All Featured
Lessons Learnt from Crawling 1000+ Websites
charlesmeaden
PRO
1
1.3k
Beyond borders and beyond the search box: How to win the global "messy middle" with AI-driven SEO
davidcarrasco
3
170
Visual Storytelling: How to be a Superhuman Communicator
reverentgeek
2
560
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
240
Side Projects
sachag
455
43k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.8k
Writing Fast Ruby
sferik
630
63k
Producing Creativity
orderedlist
PRO
348
40k
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
850
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
66
55k
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
430
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
230
23k
Transcript
Warehousing MongoDB Data Using Apache Beam and BigQuery Sandeep Parikh
Head of Solutions Architecture, Americas East @crcsmnky
Agenda MongoDB on Google Cloud Platform What is Data Warehousing
Tools & Technologies Example Use Case
Confidential & Proprietary Google Cloud Platform 3 MongoDB on Google
Cloud Platform
Google Cloud Platform 4 MongoDB on Google Cloud Platform
Google Cloud Platform 5 Manually Deploying MongoDB
Google Cloud Platform 6 Google Cloud Launcher
Google Cloud Platform 7 MongoDB Cloud Manager
Google Cloud Platform 8 MongoDB Cloud Manager How do you
automate this?
Google Cloud Platform 9 Bootstrapping MongoDB Cloud Manager Deployment Manager
Template
Google Cloud Platform 10 Cloud Deployment Manager Provision, configure your
deployment Configuration as code Declarative approach to configuration Template-driven Supports YAML, Jinja, and Python Use schemas to constrain parameters References control order and dependencies
Google Cloud Platform 11 Bootstrapping Cloud Manager Schema, Configuration &
Template Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager Three Compute Engine instances, each with 500 GB PD-SSD MongoDB Cloud Manager automation agent pre-installed and configured $ gcloud deployment-manager deployments create mongodb-cloud-manager \ --config mongodb-cloud-manager.jinja \ --properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY
Confidential & Proprietary Google Cloud Platform 12 What’s a Data
Warehouse
Data Warehouses are central repositories of integrated data from one
or more disparate sources https://en.wikipedia.org/wiki/Data_warehouse
Google Cloud Platform 14 Data Warehouse Money Data Data Data
Insights Profit!
Confidential & Proprietary Google Cloud Platform 15 Tools and Technologies
Where: BigQuery
Google Cloud Platform 17 BigQuery Complex, Petabyte-scale data warehousing made
simple Scales automatically; No setup or admin Foundation for analytics and machine learning
Google Cloud Platform 18 RUN QUERY
Google Cloud Platform 19
How: Apache Beam (incubating)
21 Modern data processing Pipeline-centric approach Batch and streaming, from
the same codebase Portable across runtime environments Build pipelines using Java (GA), Python (alpha) Apache Beam
Google Cloud Platform 22 Apache Beam Lineage MapReduce BigTable Dremel
Colossus Flume Megastore Spanner PubSub Millwheel Apache Beam Google Cloud Dataflow
Google Cloud Platform 23 Beam, Modes of Operation 3 Streaming
4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
Google Cloud Platform 24 Pipelines in Beam Pipeline p =
Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.<Integer>by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming
Google Cloud Platform 25 Apache Beam Vision Beam Model: Fn
Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution
Google Cloud Platform 26 Running Apache Beam Cloud Dataflow Local
Runner
27 A great place for executing Beam pipelines which provides:
• Fully managed, no-ops execution environment • Integration with Google Cloud Platform • Java support in GA. Python in Alpha Cloud Dataflow Service
Deploy Tear Down Fully Managed: Worker Lifecycle Management
Fully Managed: Dynamic Worker Scaling
100 mins. 65 mins. vs. Fully Managed: Dynamic Work Rebalancing
Integrated: Monitoring UI
Integrated: Distributed Logging
Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub
BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Google Cloud Platform Cloud Dataproc 33
Confidential & Proprietary Google Cloud Platform 34 Example Use Case
Google Cloud Platform 35 Sensor Data 1000 Devices Cloud Storage
ID, Type, Name 27M Log Entries MongoDB Device ID, Value, Timestamp
What’s the average reading per sensor type?
Google Cloud Platform 37 Beam + MongoDB Export (JSON,CSV) TextIO.Read
MongoClient Find
Google Cloud Platform 38 Beam + MongoDB (coming soon!) MongoDbIO.Read
Google Cloud Platform 39 Pipeline Execution
Google Cloud Platform 40 // Read sensor logs from MongoDB
and create PCollection of Documents PCollection<Document> sensorLogs = p.apply(MongoDbIO.read() .withUri("mongodb://" + options.getMongoDBHost() + ":27017") .withDatabase(options.getMongoDBDatabase()) .withCollection(options.getMongoDBCollection())); // Extract "Device ID -> Value" PCollection PCollection<KV<String,Double>> sensorIdValue = sensorLogs .apply("ExtractValues", ParDo.of(new DoFn<Document, KV<String, Double>>() { @ProcessElement public void processElement(ProcessContext c) { String deviceId = c.element().getObjectId("_id").toString(); Double value = c.element().getDouble("v"); c.output(KV.of(deviceId, value)); } })); Using MongoDbIO.Read
Google Cloud Platform 41 Transforming Data Document → Device ID,
Value CSV → Device ID, Type Type, Value Type, Mean Value Output to BigQuery
Questions? Apache Beam http://beam.incubator.apache.org Cloud Dataflow http://cloud.google.com/dataflow BigQuery http://cloud.google.com/bigquery