Upgrade to Pro — share decks privately, control downloads, hide ads and more …

モバイル KPI 分析の新標準 Fluentd + Google BigQuery #gcpライブ #gcpja

モバイル KPI 分析の新標準 Fluentd + Google BigQuery #gcpライブ #gcpja

GoogleCloudPlatformJapan

February 25, 2015
Tweet

More Decks by GoogleCloudPlatformJapan

Other Decks in Programming

Transcript

  1. Confidential and proprietary
    モバイル KPI 分析の新標準
    Fluentd + Google BigQuery
    Cloud Platformチーム デベロッパーアドボケイト 佐藤一憲
    #gcpライブ

    View Slide

  2. Confidential and proprietary
    +Kazunori Sato
    @kazunori_279
    Developer Advocate,
    Cloud Platform, Google Inc
    - GCP developer community support
    - GCP product launch support

    View Slide

  3. Confidential and proprietary
    agenda
    Big Data in Google and Google BigQuery
    Why BigQuery is so fast?
    Real-time Streaming Import by Fluentd + BigQuery
    Real-time KPI analytics by Lambda Architecture

    View Slide

  4. Confidential and proprietary
    Big Data in Google and
    Google BigQuery

    View Slide

  5. Confidential and proprietary
    100 hours/min
    100 petabytes
    500+ million users
    900+ million devices
    Big Data in Google

    View Slide

  6. Confidential and proprietary
    Cloud Technology Innovations
    2012 2013
    MapReduce
    Spanner/F1
    2003 2006 2007 2010 2011
    GFS
    Omega
    Colossus
    Cloud Storage
    Dremel
    BigQuery
    Big Table
    Cloud Datastore
    Paxos impl.
    2004

    View Slide

  7. Confidential and proprietary
    At Google, we have “big” big data everywhere
    What if a Googler is asked:
    “Can you give me the list of top 20 Android apps installed in 2012?”

    View Slide

  8. Confidential and proprietary
    In Google, we don’t use
    MapReduce for this
    We use Dremel
    = Google BigQuery
    SELECT
    top(appId, 20) AS app,
    count(*) AS count
    FROM installlog.2012
    ORDER BY
    count DESC
    It scans 100B rows in ~30 sec,
    No index used.

    View Slide

  9. Confidential and proprietary
    Google BigQuery: Massively Parallel Query Service

    View Slide

  10. Confidential and proprietary
    Storage: $0.020 per GB per month
    Queries: $5 per TB
    Cost of BigQuery

    View Slide

  11. Confidential and proprietary
    Gaming, Social, Mobile
    Ads, Digital Marketing, DMP,
    Media
    Monitoring, Alerting and Security
    Retails
    Internet of Things (IoT)
    Applications

    View Slide

  12. Confidential and proprietary
    BigQuery Analytic Service in the Cloud
    BigQuery
    R and Pandas
    Microsoft Excel
    Google Spreadsheet
    Hadoop/Hive
    Spark
    Adwords
    DoubleClick
    Google Analytics
    Event Logs,
    Databases
    IoT Devices
    Analyze Export
    BI Tools
    Import
    Import, Analyze and Export

    View Slide

  13. Confidential and proprietary
    Tableau Demo
    BIME Demo
    BigQuery + BI

    View Slide

  14. Confidential and proprietary
    Why BigQuery is so fast?

    View Slide

  15. Confidential and proprietary
    Column Oriented Storage
    Record Oriented Storage Column Oriented Storage
    Less bandwidth, More compression

    View Slide

  16. Confidential and proprietary
    select top(title), count(*)
    from publicdata:samples.wikipedia
    Massively Parallel Processing
    Scanning 1 TB in 1 sec
    takes 5,000 disks
    Each query runs on thousands of servers

    View Slide

  17. Confidential and proprietary
    Fast aggregation by tree structure
    Mixer 0
    Mixer 1 Mixer 1
    Shard Shard Shard Shard
    ColumnIO on Colossus SELECT state, year
    COUNT(*)
    GROUP BY state
    WHERE year >= 1980 and year < 1990
    ORDER BY count_babies DESC
    LIMIT 10
    COUNT(*)
    GROUP BY state

    View Slide

  18. Confidential and proprietary
    Inside BQ: Big JOIN
    Big JOIN: executed with shuffling
    - Both tables can be > 8MB
    - BQ shuffler doesn’t sort; just hash partitioning
    From: Google BigQuery Analytics

    View Slide

  19. Confidential and proprietary
    Real-time Streaming Import
    with Fluentd + BigQuery

    View Slide

  20. Confidential and proprietary
    “I want a real-time dashboard
    for collecting the votes and
    system stats from 200
    servers”

    View Slide

  21. Confidential and proprietary
    BigQuery Streaming
    Low cost: $0.01 per
    100,000 rows
    Real time availability
    of data
    100,000 rows per
    second x tables

    View Slide

  22. Confidential and proprietary
    Slideshare uses Fluentd for collecting logs from >500 servers.
    "We take full advantage of its extendable plugin architecture and use it as a message bus that collects data
    from hundreds of servers into multiple backend systems." Sylvain Kalache, Operations Engineer

    View Slide

  23. Confidential and proprietary
    Why Fluentd? Because it’s super easy to use,
    and has extensive plugins written by active community.

    View Slide

  24. Confidential and proprietary
    Now Fluentd logs can be imported to
    BigQuery really easy, ~1M rows/s

    View Slide

  25. Confidential and proprietary
    Search “fluentd bigquery” on GitHub

    View Slide

  26. Confidential and proprietary
    Google Spreadsheet
    IoT Example: RasPi > BigQuery > Spreadsheet

    View Slide

  27. Confidential and proprietary
    Real-time KPI Analytics with
    Lambda Architecture

    View Slide

  28. Confidential and proprietary
    Lambda Architecture is:
    A complementary pair of:
    - in-memory real-time processing
    - large HDD/SSD batch processing
    Proposed by Nathan Marz
    ex. Twitter Summingbird
    Slow, but large and persistent.
    Fast, but small and volatile.

    View Slide

  29. Confidential and proprietary
    Norikra: an open source stream processing tool
    Production use at LINE, the largest asian SNS with 500M users, for massive log analysis
    Super easy to use: requires no heavy-weighted cluster set-up

    View Slide

  30. Confidential and proprietary
    Real-time KPI analysis with SQL-based in-
    memory continuous query

    View Slide

  31. Confidential and proprietary
    Proposed Solution: Lambda Architecture

    View Slide

  32. Confidential and proprietary
    Proposed Solution: Lambda Architecture
    Fluentd: event log collection from various event sources
    Norikra: easy, scalable real time stream processing
    BigQuery: scalable query engine for large datasets
    1
    2
    3
    Google Spreadsheet: flexible dashboard with charts
    Docker: repeatable deployment in 10 minutes
    4
    5

    View Slide

  33. Confidential and proprietary
    Demo

    View Slide

  34. Confidential and proprietary
    ● Gaming: How many new users has purchased the first item in last 10 minutes?
    ● Media: How many people hit the vote button during the live TV program?
    ● Retail: What is the current total revenue of all stores nationwide?
    ● Ads: What is the conversion rate of impressions/clicks to purchase?
    ● Co-relate system resource usage with access/application logs
    ● Real-time DoS or cheating detection
    ● Send e-mail notification from Apps Script triggered by Norikra
    Real-time KPI Dashboard
    Real-time Monitoring and Alerting
    Applications

    View Slide

  35. Confidential and proprietary
    Easy real-time SQL-based KPI analytics
    at 1M+ rows/sec by Norikra
    Easy real-time streaming import
    at 1M+ rows/sec by BigQuery + Fluentd
    Search “lambda dashboard” on GitHub
    Solution Benefits
    Real-time dashboard with Google Spreadsheet
    Deployable within 10 min with Docker

    View Slide