Building a Lambda
Architecture in 10 minutes
with BigQuery, CEP and Docker
Slide 3
Slide 3 text
+Kazunori Sato
@kazunori_279
Solutions Architect,
Cloud Platform GBU, Google Inc
- GCP solutions design
- Professional services for GCP
- Docker/GCP meetups support
Slide 4
Slide 4 text
The Problem:
Analyze big data
in real-time
Slide 5
Slide 5 text
“I want a real-time dashboard
for my 200 web servers.”
- a customer with 200 Google Compute Engine instances
BigQuery
BQ Stream
+ Fluentd
Norikra
CEP
Lambda
Arch
How to analyze big data?
Slide 10
Slide 10 text
At Google, we have “big” big data everywhere
What if a Googler is asked:
“Can you give me the list of top 20 Android apps installed in 2012?”
Slide 11
Slide 11 text
At Google,
we run SQLs
on Dremel
= Google BigQuery
SELECT
top(appId, 20) AS app,
count(*) AS count
FROM installlog.2012;
ORDER BY
count DESC
It scans 68B rows in ~30 sec,
No index used.
Slide 12
Slide 12 text
Column Oriented Storage
Record Oriented Storage Column Oriented Storage
Less bandwidth, More compression
Slide 13
Slide 13 text
select top(title), count(*)
from publicdata:samples.wikipedia
Massively Parallel Processing
Scanning 1 TB in 1 sec
takes 5,000 disks
Each query runs on thousands of servers
Slide 14
Slide 14 text
Fast aggregation by tree structure
Mixer 0
Mixer 1 Mixer 1
Leaf Leaf Leaf Leaf
Distributed Storage SELECT state, year
COUNT(*)
GROUP BY state
WHERE year >= 1980 and year < 1990
ORDER BY count_babies DESC
LIMIT 10
COUNT(*)
GROUP BY state
Slide 15
Slide 15 text
How to collect
big data?
Slide 16
Slide 16 text
BigQuery
BQ Stream
+ Fluentd
Norikra
CEP
Lambda
Arch
How to collect big data?
Slide 17
Slide 17 text
BigQuery Streaming
Low cost: $0.01 per
100,000 rows
Real time availability
of data
100,000 rows per
second x tables
Slide 18
Slide 18 text
Slideshare uses Fluentd for collecting logs from >500 servers.
"We take full advantage of its extendable plugin architecture and use it as a message bus that collects data
from hundreds of servers into multiple backend systems." Sylvain Kalache, Operations Engineer
Slide 19
Slide 19 text
Why Fluentd? Because it’s super easy to use,
and has extensive plugins written by active community.
Slide 20
Slide 20 text
Now Fluentd logs can be imported to
BigQuery really easy
Slide 21
Slide 21 text
How to analyze
in real-time?
Slide 22
Slide 22 text
BigQuery
BQ Stream
+ Fluentd
Norikra
CEP
Lambda
Arch
How to analyze in real-time?
Slide 23
Slide 23 text
Norikra: an open source Complex Event Processing (CEP)
Production use at LINE, the largest asian SNS with 400M users, for massive log analysis
Slide 24
Slide 24 text
Real-time analysis on streaming data
with in-memory continuous query
Slide 25
Slide 25 text
How to analyze
big data in real-time?
Slide 26
Slide 26 text
BigQuery
BQ Stream
+ Fluentd
Norikra
CEP
Lambda
Arch
How to analyze big data in real-time?
Slide 27
Slide 27 text
Lambda Architecture is:
A complementary pair of:
- in-memory real-time processing
- large HDD/SSD batch processing
Proposed by Nathan Marz
ex. Twitter Summingbird
Slow, but large and persistent.
Fast, but small and volatile.
Slide 28
Slide 28 text
A Recipe for a Lambda Architecture in 10 minutes
Fluentd: event log collection from various event sources
Norikra: scalable real time Complex Event Processing (CEP)
BigQuery: scalable query engine for large datasets
1
2
3
Google Spreadsheet: flexible dashboard with a variety of charts
Docker: repeatable deployment in 10 minutes
4
5
Slide 29
Slide 29 text
Lambda Arch by BQ+Norikra
Slide 30
Slide 30 text
Google Spreadsheet as a dashboard
for real-time and big data views
Slide 31
Slide 31 text
Everything Dockernized
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
Demo
Slide 34
Slide 34 text
Applications
● Gaming: How many new users has purchased the first item in last 10 minutes?
● Media: How many people hit the vote button during the live TV program?
● Retail: What is the current total revenue of all stores nationwide?
● Ads: What is the conversion rate of impressions/clicks to purchase?
● Co-relate system resource usage with access/application logs
● Real-time DoS or cheating detection
● Send e-mail notification from Apps Script triggered by CEP query
Real-time KPI Dashboard
Real-time Monitoring and Alerting
Slide 35
Slide 35 text
Summary
Slide 36
Slide 36 text
Real-time analytics by Norikra CEP
with 10 sec latency
Big data collection and analytics by
BigQuery + Fluentd at ~1M rows/s
Available on GitHub:
GoogleCloudPlatform/lambda-dashboard
Solution Benefits
Real-time dashboard with Google Spreadsheet
Deployable within 10 min with Docker