BigQuery and the world after MapReduce

BigQuery and the world after MapReduce Handling Big Data in
the Googley way

+Kazunori Sato @kazunori_279 Solutions Architect, Cloud Platform GBU, Google Inc
- GCP solutions design - Professional services for GCP - Docker/GCP meetups support

“Big Data” at Google: How do we handle it

At Google, we have “big” big data everywhere What if
a Googler is asked: “Can you give me the list of top 20 Android apps installed in 2012?”

At Google, we don’t use MapReduce for this We use
Dremel = Google BigQuery SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012 ORDER BY count DESC It scans 68B rows in ~20 sec, No index used.

BigQuery is low cost Storage: $0.026 per GB per month
Queries: $5 per TB

Why is BigQuery so fast?

Column Oriented Storage Record Oriented Storage Column Oriented Storage Less
bandwidth, More compression

Inside BQ: ColumnIO on Colossus Colossus File System - The
next generation Google File System Chunk Servers - ColumnIO on CFS - Replicated to multiple DCs asynchronously - Mitigates Tail Latency, provides 99.9% availability

select top(title), count(*) from publicdata:samples.wikipedia Massively Parallel Processing Scanning 1
TB in 1 sec takes 5,000 disks Each query runs on thousands of servers

Fast aggregation by tree structure Mixer 0 Mixer 1 Mixer
1 Shard Shard Shard Shard ColumnIO on Colossus SELECT state, year COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 ORDER BY count_babies DESC LIMIT 10 COUNT(*) GROUP BY state

Inside BQ: Query Processing 1. Mixer receives a query -
Interprets the query - Maps tables to CFS files, requests to Shards 2. Shards process the partial queries - Open the ColumnIO files - Reads the table, applies WHERE, GROUP BY 3. Mixer aggregates the results - Applies ORDER BY, LIMIT

Inside BQ: Small JOIN SELECT wiki.title FROM [publicdata:samples.wikipedia] AS wiki
JOIN ( SELECT word FROM [publicdata:samples.shakespeare] GROUP BY word ) AS shakes ON wiki.title = shakes.word;

Small JOIN: executed with Broadcast JOIN - One table should
be < 8MB, sent to every shard From: Google BigQuery Analytics Inside BQ: Small JOIN

Inside BQ: Big JOIN SELECT wiki.title FROM [publicdata:samples.wikipedia] AS wiki
JOIN EACH ( SELECT word FROM [publicdata:samples.shakespeare] GROUP EACH BY word ) AS shakes ON wiki.title = shakes.word;

Inside BQ: Big JOIN Big JOIN: executed with shuffling -
Both tables can be > 8MB - BQ shuffler doesn’t sort; just hash partitioning From: Google BigQuery Analytics

Inside BQ: Big JOIN Big JOIN performance - JOIN 608M
records x 38M records → 60 - 80 sec SELECT COUNT(*) FROM [bigquery-samples:wikimedia_pageviews. 200801] as a JOIN EACH [bigquery-samples:wikimedia_pageviews.200802] as b ON a.title = b.title WHERE b.language = "ja" From: Google BigQuery Analytics

BigQuery + Hadoop

HDFS (optional) Work Nodes Work Nodes Master Node Work Nodes
HDFS (optional) Name Node (optional) Local SSD PD SSD PD standard GCS Connector BigQuery Connector Datastore Connector Connectors bdutil orchestration Connectors for Hadoop

BigQuery Connector for Hadoop // Set the job-level projectId. conf.set(BigQueryConfiguration.PROJECT_ID_KEY,
projectId); // Configure input parameters. BigQueryConfiguration.configureBigQueryInput(conf, ”publicdata:samples.shakespeare”); // Set InputFormat. job.setInputFormatClass(BigQueryInputFormat.class);

BigQuery Connector for Hadoop // Configure output parameters String outSchema
= "[{'name': 'Word','type': 'STRING'},{'name': 'Number','type': 'INTEGER'}]"; BigQueryConfiguration.configureBigQueryOutput( conf, ”gcp_samples.output”, outSchema); // Set OutputFormat. job.setOutputFormatClass(BigQueryOutputFormat.class);

BigQuery + Fluentd

BigQuery Streaming Low cost: $0.01 per 100,000 rows Real time
availability of data 100,000 rows per second x tables

Slideshare uses Fluentd for collecting logs from >500 servers. "We
take full advantage of its extendable plugin architecture and use it as a message bus that collects data from hundreds of servers into multiple backend systems." Sylvain Kalache, Operations Engineer

Why Fluentd? Because it’s super easy to use, and has
extensive plugins written by active community.

Now Fluentd logs can be imported to BigQuery really easy,
~1M rows/s

BigQuery + User Defined Function

User Defined Function with JavaScript BigQuery User Defined Function by
JavaScript: - takes input JSON as arguments - emits output JSON as results JS in SQL - can be written in place of tables/subqueries Coming Soon!

for (var i = 0; i < row.hits.length - 1;
i++) { if (row.hits[i].page.pagepath.indexOf('github_nested') != -1 && row.hits[i + 1].page.pagepath.indexOf('github_timeline') != -1) { result.push({ visitid: row.visitid, duration: row.hits[i + 1].time - row.hits[i].time }); } } User Defined Function with JavaScript

SELECT actor, repository_name, page_name FROM js( -- Base query. (SELECT
actor, payload.pages.page_name, repository.name, type FROM [publicdata:samples.github_nested] WHERE payload.pages.page_name IS NOT NULL), -- Input fields from base query. actor, payload.pages.page_name, repository.name, type, -- Output schema of JS function. "[{name: 'actor', type: 'string'}, {name: 'repository_name', type: 'string'}, {name: 'page_name', type: 'string'}]", -- JS function. "function(r, emit) { var foundHome = false; var foundNotHome = null; for (var i = 0; i < r.payload.pages.length; i++) { if (r.payload.pages[i].page_name == 'Home') { foundHome = true; } else if (r.payload.pages[i].page_name) { foundNotHome = r.payload.pages[i].page_name; } } if (foundHome && foundNotHome) { emit({actor: r.actor, repository_name: r.repository.name, page_name: foundNotHome}); } }") ORDER BY actor, repository_name, page_name User Defined Function with JavaScript Search GitHub repository for updates that change a 'Home' page

The World after MapReduce

From: Wired

BigQuery engine BigQuery Abuse detection User interactions Streaming Batch User
engagement analytics Cloud Pub/Sub ACL ACL Topic 2 Business dashboard Data science tools Users Devs Data scientists Business App events ACL ACL Topic 1 Storage Services Cloud Storage Cloud Datastore Cloud SQL Open Source bdutil orchestration Connectors Cloud Dataflow

Cloud Pub/Sub ACL ACL Topic 2 Consumer Consumer Consumer Publisher
Publisher Publisher Publisher ACL ACL Topic 1 Cloud Pub/Sub Coming Soon!

Abuse detection User interactions Cloud Dataflow Streaming Batch User engagement
analytics Google Cloud Dataflow Optimize Schedule Cloud Dataflow Coming Soon!

Cloud Dataflow = Managed Flume + MillWheel on GCE

Prefix Suggestions #ar #argentina, #arugularocks, #argylesocks #arg #argentina, #argylesocks, #argonauts
#arge #argentina, #argentum, #argentine Example: Auto completing hashtags

{a->[apple, art, argentina], ar->[art, argentina, armenia],...} Count ExpandPrefixes Top(3) Write
Read ExtractTags {a->(argentina, 5M), a->(armenia, 2M), …, ar-> (argentina, 5M), ar->(armenia, 2M), ...} {#argentina scores!, watching #armenia vs #argentina, my #art project, …} {argentina, armenia, argentina, art, ...} {argentina->5M, armenia->2M, art->90M, ...} Tweets Predictions

Count ExpandPrefixes Top(3) Write Read ExtractTags Tweets Predictions Pipeline p
= Pipeline.create(); p.begin() p.run(); .apply(ParDo.of(new ExtractTags())) .apply(Top.largestPerKey(3)) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(TextIO.Write.to(“gs://…”)); .apply(TextIO.Read.from(“gs://…”))

Automatically optimizes the execution plan DAG

Let’s stream it! Google Cloud Pub/Sub for reads and writes.
time #ar* rank game begins armenia wins! #argyle #armeniarocks Age out old data #argentinagoal

Pipeline p = Pipeline.create(); p.begin() .apply(PubsubIO.Read.from(“input_topic”)) .apply(Bucket.by(SlidingWindows.of(60, MINUTES)) .apply(ParDo.of(new ExtractTags()))
.apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.to(“output_topic”)); p.run(); The same code can be used for stream processing

Demo at YouTube

Questions?

Backup

How to analyze big data in real-time?

Lambda Architecture is: A complementary pair of: - in-memory real-time
processing - large HDD/SSD batch processing Proposed by Nathan Marz ex. Twitter Summingbird Slow, but large and persistent. Fast, but small and volatile.

Norikra: an open source Complex Event Processing (CEP) Production use
at LINE, the largest asian SNS with 400M users, for massive log analysis

Real-time analysis on streaming data with in-memory continuous query

Lambda Arch by BQ+Norikra

BigQuery and the world after MapReduce

BigQuery and the world after MapReduce

More Decks by Kazunori Sato

Other Decks in Technology

Featured

Transcript