Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Building a Lambda Architecture in 10 minutes with BigQuery, CEP and Docker

Slide 3

Slide 3 text

+Kazunori Sato @kazunori_279 Solutions Architect, Cloud Platform GBU, Google Inc - GCP solutions design - Professional services for GCP - Docker/GCP meetups support

Slide 4

Slide 4 text

The Problem: Analyze big data in real-time

Slide 5

Slide 5 text

“I want a real-time dashboard for my 200 web servers.” - a customer with 200 Google Compute Engine instances

Slide 6

Slide 6 text

The Solution

Slide 7

Slide 7 text

BigQuery BQ Stream + Fluentd Norikra CEP Lambda Arch The Solution: Lambda Architecture

Slide 8

Slide 8 text

How to analyze big data?

Slide 9

Slide 9 text

BigQuery BQ Stream + Fluentd Norikra CEP Lambda Arch How to analyze big data?

Slide 10

Slide 10 text

At Google, we have “big” big data everywhere What if a Googler is asked: “Can you give me the list of top 20 Android apps installed in 2012?”

Slide 11

Slide 11 text

At Google, we run SQLs on Dremel = Google BigQuery SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012; ORDER BY count DESC It scans 68B rows in ~30 sec, No index used.

Slide 12

Slide 12 text

Column Oriented Storage Record Oriented Storage Column Oriented Storage Less bandwidth, More compression

Slide 13

Slide 13 text

select top(title), count(*) from publicdata:samples.wikipedia Massively Parallel Processing Scanning 1 TB in 1 sec takes 5,000 disks Each query runs on thousands of servers

Slide 14

Slide 14 text

Fast aggregation by tree structure Mixer 0 Mixer 1 Mixer 1 Leaf Leaf Leaf Leaf Distributed Storage SELECT state, year COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 ORDER BY count_babies DESC LIMIT 10 COUNT(*) GROUP BY state

Slide 15

Slide 15 text

How to collect big data?

Slide 16

Slide 16 text

BigQuery BQ Stream + Fluentd Norikra CEP Lambda Arch How to collect big data?

Slide 17

Slide 17 text

BigQuery Streaming Low cost: $0.01 per 100,000 rows Real time availability of data 100,000 rows per second x tables

Slide 18

Slide 18 text

Slideshare uses Fluentd for collecting logs from >500 servers. "We take full advantage of its extendable plugin architecture and use it as a message bus that collects data from hundreds of servers into multiple backend systems." Sylvain Kalache, Operations Engineer

Slide 19

Slide 19 text

Why Fluentd? Because it’s super easy to use, and has extensive plugins written by active community.

Slide 20

Slide 20 text

Now Fluentd logs can be imported to BigQuery really easy

Slide 21

Slide 21 text

How to analyze in real-time?

Slide 22

Slide 22 text

BigQuery BQ Stream + Fluentd Norikra CEP Lambda Arch How to analyze in real-time?

Slide 23

Slide 23 text

Norikra: an open source Complex Event Processing (CEP) Production use at LINE, the largest asian SNS with 400M users, for massive log analysis

Slide 24

Slide 24 text

Real-time analysis on streaming data with in-memory continuous query

Slide 25

Slide 25 text

How to analyze big data in real-time?

Slide 26

Slide 26 text

BigQuery BQ Stream + Fluentd Norikra CEP Lambda Arch How to analyze big data in real-time?

Slide 27

Slide 27 text

Lambda Architecture is: A complementary pair of: - in-memory real-time processing - large HDD/SSD batch processing Proposed by Nathan Marz ex. Twitter Summingbird Slow, but large and persistent. Fast, but small and volatile.

Slide 28

Slide 28 text

A Recipe for a Lambda Architecture in 10 minutes Fluentd: event log collection from various event sources Norikra: scalable real time Complex Event Processing (CEP) BigQuery: scalable query engine for large datasets 1 2 3 Google Spreadsheet: flexible dashboard with a variety of charts Docker: repeatable deployment in 10 minutes 4 5

Slide 29

Slide 29 text

Lambda Arch by BQ+Norikra

Slide 30

Slide 30 text

Google Spreadsheet as a dashboard for real-time and big data views

Slide 31

Slide 31 text

Everything Dockernized

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Demo

Slide 34

Slide 34 text

Applications ● Gaming: How many new users has purchased the first item in last 10 minutes? ● Media: How many people hit the vote button during the live TV program? ● Retail: What is the current total revenue of all stores nationwide? ● Ads: What is the conversion rate of impressions/clicks to purchase? ● Co-relate system resource usage with access/application logs ● Real-time DoS or cheating detection ● Send e-mail notification from Apps Script triggered by CEP query Real-time KPI Dashboard Real-time Monitoring and Alerting

Slide 35

Slide 35 text

Summary

Slide 36

Slide 36 text

Real-time analytics by Norikra CEP with 10 sec latency Big data collection and analytics by BigQuery + Fluentd at ~1M rows/s Available on GitHub: GoogleCloudPlatform/lambda-dashboard Solution Benefits Real-time dashboard with Google Spreadsheet Deployable within 10 min with Docker

Slide 37

Slide 37 text

Questions?

Slide 38

Slide 38 text

No content