Dremel = Google BigQuery SELECT top(appId, 20) AS app, count(*) AS count FROM installlog.2012 ORDER BY count DESC It scans 68B rows in ~20 sec, No index used.
next generation Google File System Chunk Servers - ColumnIO on CFS - Replicated to multiple DCs asynchronously - Mitigates Tail Latency, provides 99.9% availability
1 Shard Shard Shard Shard ColumnIO on Colossus SELECT state, year COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 ORDER BY count_babies DESC LIMIT 10 COUNT(*) GROUP BY state
Interprets the query - Maps tables to CFS files, requests to Shards 2. Shards process the partial queries - Open the ColumnIO files - Reads the table, applies WHERE, GROUP BY 3. Mixer aggregates the results - Applies ORDER BY, LIMIT
records x 38M records → 60 - 80 sec SELECT COUNT(*) FROM [bigquery-samples:wikimedia_pageviews. 200801] as a JOIN EACH [bigquery-samples:wikimedia_pageviews.200802] as b ON a.title = b.title WHERE b.language = "ja" From: Google BigQuery Analytics
take full advantage of its extendable plugin architecture and use it as a message bus that collects data from hundreds of servers into multiple backend systems." Sylvain Kalache, Operations Engineer
actor, payload.pages.page_name, repository.name, type FROM [publicdata:samples.github_nested] WHERE payload.pages.page_name IS NOT NULL), -- Input fields from base query. actor, payload.pages.page_name, repository.name, type, -- Output schema of JS function. "[{name: 'actor', type: 'string'}, {name: 'repository_name', type: 'string'}, {name: 'page_name', type: 'string'}]", -- JS function. "function(r, emit) { var foundHome = false; var foundNotHome = null; for (var i = 0; i < r.payload.pages.length; i++) { if (r.payload.pages[i].page_name == 'Home') { foundHome = true; } else if (r.payload.pages[i].page_name) { foundNotHome = r.payload.pages[i].page_name; } } if (foundHome && foundNotHome) { emit({actor: r.actor, repository_name: r.repository.name, page_name: foundNotHome}); } }") ORDER BY actor, repository_name, page_name User Defined Function with JavaScript Search GitHub repository for updates that change a 'Home' page
.apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(PubsubIO.Write.to(“output_topic”)); p.run(); The same code can be used for stream processing
processing - large HDD/SSD batch processing Proposed by Nathan Marz ex. Twitter Summingbird Slow, but large and persistent. Fast, but small and volatile.