Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Márton Kodok] Powering Interactive Data Analys...

[Márton Kodok] Powering Interactive Data Analysis with Google BigQuery

Presentation from GDG DevFest Ukraine 2017 - the biggest community-driven Google tech conference in the CEE.

Learn more at: https://devfest.gdg.org.ua

Google Developers Group Lviv

October 14, 2017
Tweet

More Decks by Google Developers Group Lviv

Other Decks in Technology

Transcript

  1. Every company, no matter how far from the tech they

    are, is evolving into a software company, and by extension a data company. Turning everything into “data” drives innovation Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  2. For a small company it’s important to have access to

    modern BigData tools without running a dedicated team for it. Small companies should do BigData - but how? Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  3. ❏ Need backend/database to STORE, QUERY, EXTRACT data ❏ Deep

    analytics - large, multi-source, complex, unstructured ❏ Be real time ❏ Terabyte scale - Cost effective ❏ Run Ad-Hoc reports - Without Developer - interactive ❏ Minimal engineering efforts - no dedicated BigData team ❏ Simple Query language (prefered SQL / Javascript) Making analytics accessible to more companies Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  4. Legacy Business Reporting System Web Mobile Web Server Database SQL

    Cached Platform Services CMS/Framework Report & Share Business Analysis Scheduled Tasks Batch Processing Compute Engine Multiple Instances Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  5. Web Mobile Web Server Database SQL Cached Platform Services CMS/Framework

    Report & Share Business Analysis Scheduled Tasks Batch Processing Compute Engine Multiple Instances Behind the Scenes: Days To Insights Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  6. Legacy Business Reporting System Web Mobile Web Server Database SQL

    Cached Platform Services CMS/Framework Report & Share Business Analysis Scheduled Tasks Batch Processing Compute Engine Multiple Instances Minutes to kick in Hours to Run Batch Processing Hours to Clean and Aggregate DAYS TO INSIGHTS Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  7. • Analytics-as-a-Service - Data Warehouse in the Cloud • Fully-Managed

    by Google (US or EU zone) • Scales into Petabytes • Ridiculously fast • SQL 2011 Standard + Javascript UDF (User Defined Functions) • Familiar DB Structure (table, views, record, nested, JSON) • Open Interfaces (REST, ODBC, Web UI, BQ command line tool) • Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors • Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Oct 2017 What is BigQuery? Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  8. Architecting for The Cloud BigQuery On-Premises Servers Pipelines ETL Engine

    Event Sourcing Frontend Platform Services Metrics / Logs/ Streaming Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  9. Data Pipeline Integration at REEA.net Analytics Backend BigQuery On-Premises Servers

    Pipelines FluentD Event Sourcing Frontend Platform Services Metrics / Logs/ Streaming Development Team Data Analysts Report & Share Business Analysis Tools Tableau QlikView Data Studio Internal Dashboard Database SQL Application Servers Servers Cloud Storage archive Load Export Replay Standard Devices HTTPS Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  10. The following slides will present a sample Fluentd configuration to:

    1. Transform a record 2. Copy event to multiple outputs 3. Store event data in File (for backup/log purposes) 4. Stream to BigQuery (for immediate analyses) Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  11. <filter frontend.user.*> @type record_transformer </filter> <match frontend.user.*> @type copy <store>

    @type forest subtype file </store> <store> @type bigquery </store> … </match> Filter plugin mutates incoming data. Add/modify/delete event data transform attributes without a code deploy. 1 2 3 4 Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua The copy output plugin copies events to multiple outputs. File(s), multiple databases, DB engines. Great to ship same event to multiple subsystems. The Bigquery output plugin on the fly streams the event to the BigQuery warehouse. No need to write integration. Data is available immediately for querying. Whenever needed other output plugins can be wired in: Kafka, Google Cloud Storage output plugin.
  12. record_transformer copy file BigQuery <filter frontend.user.*> @type record_transformer enable_ruby remove_keys

    host <record> bq {"insert_id":"${uid}","host":"${host}", "created":"${time.to_i}"} avg ${record["total"] / record["count"]} </record> </filter> syntax: Ruby, easy to use. Great for: - date transformation, - quick normalizations, - calculating something on the fly, and store in clear log/analytics db - renaming without code deploy. 1 Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua 2 3 4
  13. record_transformer copy file BigQuery <match frontend.user.*> @type copy <store> @type

    forest subtype file <template> path /tank/storage/${tag}.*.log time_slice_format %Y%m%d </template> </store> </match> 1 Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua 2 3 4
  14. record_transformer copy file BigQuery <match frontend.user.*> @type bigquery method insert

    auth_method json_key json_key /etc/td-agent/keys/key-31da042be48c.json project project_id dataset dataset_name time_field timestamp time_slice_format %Y%m%d table user$%{time_slice} ignore_unknown_values schema_path /etc/td-agent/schema/user_login.json </match> 1 Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua 2 3 4 Connector uses: - JSON key auth file - JSON table schema Pro features: - streaming to Partitioned tables - ignore unknown values (not reflected in schema)
  15. • On data that it is difficult to process/analyze using

    traditional databases • On exploring unstructured data • Not a replacement to traditional DBs, but it compliments the system • Applying Javascript UDF on columnar storage to resolve complex tasks (eg: Javascript for natural language processing) • On streams (forms, Kafka, IoT streams) • Major strength is handling Large datasets Where to use BigQuery? Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  16. ➢ Optimize product pages Find, store, analyse in BQ time

    consuming user actions from using 25x more custom events/hits than Google Analytics ➢ Email engagement Having stored every open/click raw data improve: subject line, layout, follow up action emails, assistant like experience by heavy A/B Split Tests on email marketing campaigns (interactive feedback loop) ➢ Funnel Analysis Wrangle all the data to discover: a small improvement, an AI driven upsell personal like experience, pre-sell products configured on the go - not yet in catalog, but easily can be tweaked/customized Achievements - goal reached by measuring everything Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  17. Example HITS chain: • article1 -> page2 -> page3 ->

    page4 -> orderpage1 -> thankyoupage1 • page1 -> article2-> page3 -> orderpage2 -> ... Attribute credit to first article visited on purchase Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua
  18. • No manual sharding • No capacity guessing • No

    idle resources • No maintenance windows • No manual scaling • No file mgmt BigQuery: Serverless Data Warehouse Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua serverless data warehouse depicted
  19. • no provisioning/deploy • no running out of resources •

    no more focus on large scale execution plan • no more throwing away-, expiring-, aggregating old data. • run raw ad-hoc queries (either by analysts/sales or Devs) • use Javascript in SQL to have an awesome BigData experience wrangling “unstructured” like nerd Our benefits Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua