Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BigQuery and Event Ticketing

BigQuery and Event Ticketing

Our presentation at BITKOM BigData Summit in Frankfurt in 2014, featuring some BigQuery trivia about how we use it at Koliseo

Nacho Coloma

March 26, 2014
Tweet

More Decks by Nacho Coloma

Other Decks in Technology

Transcript

  1. Google Cloud Platform Storage Services Compute Cloud Storage Cloud SQL

    Cloud Datastore Compute Engine App Engine BigQuery Cloud Endpoints
  2. Reasons to build Koliseo on App Engine Infinite-scaling NoSQL storage.

    Flexible, automated server scaling under user load. No DevOps team: we don’t wear pagers. Low cost.
  3. Photo: 0Four MapReduce and NoSQL when all you have is

    a hammer, everything looks like a nail
  4. The HTTP Archive Introduced in 1996 Registers the Alexa Top

    1,000,000 Sites About 400GB of raw CSV data That’s answers to a lot of questions
  5. Which sites are using Prototype and jQuery today? * as

    of June 1 2013 (not really today)
  6. How do we know that? SELECT pages.pageid, url, cnt, libs,

    pages.rank rank FROM [httparchive:runs.2013_06_01_pages] as pages JOIN ( SELECT pageid, count(distinct(type)) cnt, GROUP_CONCAT(type) libs FROM ( SELECT REGEXP_EXTRACT(url, r'(jquery|prototype).*\.js') type, pageid FROM [httparchive:runs.2013_06_01_requests] WHERE REGEXP_MATCH(url, r'jquery|prototype.*\.js') GROUP BY pageid, type ) GROUP BY pageid HAVING cnt >= 2 ) as lib ON lib.pageid = pages.pageid WHERE rank IS NOT NULL ORDER BY rank asc We have a query to prove it Source: http://www.igvita.com/2013/06/20/http-archive-bigquery-web-performance-answers/
  7. Storage Cloud Storage Cloud SQL Cloud Datastore Compute Compute Engine

    (IaaS) App Engine (PaaS) Services BigQuery Cloud Endpoints Google Cloud Platform Big Data analysis tool
  8. Google innovations in the last twelve years Spanner Dremel MapReduce

    Big Table Colossus 2012 2013 2002 2004 2006 2008 2010 GFS Compute Engine Awesomeness starts here
  9. Google BigQuery Analyze terabytes of data in seconds Data imported

    in bulk as CSV or JSON Supports streaming up to 100K updates/sec per table Use the browser tool, the command-line tool or REST API
  10. BigQuery is a prototyping tool Answers questions that you need

    to ask once in your life. Has a flexible interface to launch queries interactively, thinking on your feet. Processes terabytes of data in seconds. It’s much cheaper than the alternative.
  11. What are the top 100 most active Ruby repositories on

    GitHub? SELECT repository_name, count(repository_name) as pushes, repository_description, repository_url FROM [githubarchive:github.timeline] WHERE type="PushEvent" AND repository_language="Ruby" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00') GROUP BY repository_name, repository_description, repository_url ORDER BY pushes DESC LIMIT 100 Source: http://bigqueri.es/t/what-are-the-top-100-most-active-ruby-repositories-on-github/9
  12. Much more flexible than SQL Multi-valued attributes lived_in: [ {

    city: ‘La Laguna’, since: ‘19752903’ }, { city: ‘Madrid’, since: ‘20010101’ }, { city: ‘Cologne’, since: ‘20130401’ } ] Correlation and nth percentile SELECT CORR(temperature, number_of_people) Data manipulation: dates, urls, regex, IP...
  13. What are the top cities contributing modifications to Wikipedia? SELECT

    COUNT(*) c, city, countryLabel, NTH(1, latitude) lat, NTH(1, longitude) lng FROM ( SELECT INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum, INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB FROM [publicdata:samples.wikipedia] WHERE contributor_ip IS NOT NULL ) AS a JOIN EACH [fh-bigquery:geocode.geolite_city_bq_b2b] AS b ON a.classB = b.classB WHERE a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum AND city != '' GROUP BY city, countryLabel ORDER BY 1 DESC Source: Geoip geolocation with Google BigQuery
  14. Cost of BigQuery Not for dashboards: If you need to

    launch your query frequently, it’s more cost effective to use MapReduce or SQL Loading data Free Exporting data Free Storage $80 per TB/month Interactive queries $35 per TB processed Batch queries $20 per TB processed 85% price drop
  15. Cost of BigQuery (updated) Not for dashboards: If you need

    to launch your query frequently, it’s more cost effective to use MapReduce or SQL Loading data Free Exporting data Free Storage $0.026 per GB/month Interactive queries $0.005 per GB processed Batch queries $0.005 per GB processed