Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Elasticsearch for Analytics

Using Elasticsearch for Analytics

This presentation summarizes how we use Elasticsearch for analytics at Wingify for our product Visual Website Optimizer (http://vwo.com). This presentation was prepared for my poster session at The Fifth Elephant (https://funnel.hasgeek.com/fifthel2014/1143-using-elasticsearch-for-analytics).

Vaidik Kapoor

August 01, 2014
Tweet

More Decks by Vaidik Kapoor

Other Decks in Programming

Transcript

  1. Using Elasticsearch for Analytics
    How we use Elasticsearch for Analytics at Wingify?
    Vaidik Kapoor
    github.com/vaidik
    twitter.com/vaidikkapoor

    View full-size slide

  2. Problem Statement
    VWO collects number of visitors and conversions per goal per variation for every campaign
    created. These numbers are used by our customers to make optimization decisions - very
    useful but limiting as these numbers are overall numbers and drilling down was not possible.
    There is a need to develop an analytics engine:
    ● capable of storing millions of daily data points, essentially JSON docs.
    ● should expose flexible and powerful query interface for segmenting visitors and
    conversions data. This is extremely useful for our customers to derive insights.
    ● querying should not be extremely slow - response times of 2-5 seconds are acceptable.
    ● not too difficult to maintain in production - operations should be easy for a lean team.
    ● should be easy to extend to provide new features.

    View full-size slide

  3. ● A distributed near real-time search engine, also considered as an analytics
    engine since a lot of people use it that way - proven solution.
    ● Highly available, fault tolerant, distributed - built from the ground up to work
    in the cloud.
    ● Elasticsearch is distributed - cluster management takes care of node
    downtimes which makes operations rather easy instead of being a headache.
    Application development remains the same no matter how you deploy
    Elasticsearch i.e. a cluster or single node.
    ● Capable of performing all the major types of searches, matches and
    aggregations. Also supports limited Regular Expressions.
    ● Easy index and replica creation on live cluster.
    ● Easy management of cluster and indices through REST API.
    to the rescue

    View full-size slide

  4. 1. Store a document for every unique
    visitor per campaign in Elasticsearch.
    Document contains:
    a. Visitor related segment
    properties like geo data,
    platform information, referral,
    etc.
    b. Information related to
    conversion of goals
    2. Use Nested Types for creating
    hierarchy between every unique
    visitor’s visit and conversions.
    3. Use Aggregations/Facets framework
    for generating datewise count of
    visitors and conversions and basic stats
    like average and total revenue, sum of
    squares of revenue, etc.
    4. Never use script facets/aggs to get
    counts of a combination of values from
    the same document. Scripts are slow.
    Instead index result of script at index
    time.
    Visitor documents in Elasticsearch:
    {
    "account": 196,
    "experiment": 77,
    "combination": "5",
    "hit_time": "2014-07-09T23:21:15",
    "ip": "71.12.234.0"
    "os": "Android",
    "os_version": "4.1.2",
    "device": "Huawei Y301A2",
    "device_type": "Mobile",
    "touch_capable": true,
    "browser": "Android",
    "browser_version": "4.1.2",
    "document_encoding": "UTF-8",
    "user_language": "en-us",
    "city": "Mandeville",
    "country": "United States",
    "region": "Louisiana",
    "url": "https://vwo.com/free-
    trial",
    "query_params": [],
    "direct_traffic": true,
    "search_traffic": false,
    "email_traffic": false,
    "returning_visitor": false,
    "converted_goals": [...],
    ...
    }
    How we use Elasticsearch
    "converted_goals": [
    {
    "id": 2,
    "facet_term": "5_2",
    "conversion_time":
    "2014-07-09T23:32:41"
    },
    {
    "id": 6,
    "facet_term": "5_6",
    "conversion_time":
    "2014-07-09T23:37:04"
    }
    ]

    View full-size slide

  5. Alongside Elasticsearch as our primary data store, we use a bunch of other
    things:
    ● RabbitMQ - our central queue which receives all the analytics data
    and pushes to all the consumers which write to different data stores
    including Elasticsearch and MySQL.
    ● MySQL for storing overall counters of visitors and conversions per
    goal per variations of every campaign. This serves as a cache in front
    of Elasticsearch - prevents us from calculating total counts by
    iterating over all the documents and makes loading of reports faster.
    ● Consumers - written in Python, responsible for sanitizing and storing
    data in Elasticsearch and MySQL. New visitors are inserted as a
    document in Elasticsearch. Conversions of existing visitors are
    recorded in the document previously inserted for the visitor that
    converted using Elasticsearch’s Update API (Script Updates).
    ● Analytics API Server - written in Python using Flask, Gevent and
    Celery
    ○ Exposes APIs for querying segmented data and for other
    tasks such as start tracking campaign, flushing campaign
    data, flushing account data, etc.
    ○ Provides a custom JSON based Query DSL which makes the
    Query API easy to consumer. The API server translates this
    Query DSL to Elasticsearch’s DSL. Example:
    {
    “and”: [
    { “or”: [ { “city”: “New Delhi” },
    { “city”: “Gurgaon” } ] },
    { “not”: { “device_type”: “Mobile” } }
    ]
    }
    Current Architecture
    USA West Asia
    Europe
    USA East
    Data Acquisition Servers
    Central Queue
    1 2 3 4
    Consumers / Workers
    Front-end
    Application
    Analytics API
    Server
    U
    pdate
    counters
    Sync visitors and
    conversions

    View full-size slide

  6. Elasticsearch scales, only when planned for. Consider the following:
    ● Make your data shardable - cannot emphasize enough on this. If you cannot shard your data, then
    scaling out will always be a problem, especially with time-series data as it always grows. There are
    options like user and time based indices. You may shard according to something else. Find what works
    for you.
    ● Use routing to scale reads. Without routing, queries will hit all the shards to find lesser number of
    documents out of total documents per shard (difficult to find needle in a larger haystack). If you have
    a lot of shards, then ES will not return unless response from all the shards have arrived and
    aggregated at the node that received the request.
    ● Avoid hotspots because of routing. Sometimes some shards can have a lot more data as compared to
    rest of the shards.
    ● Use Bulk API for the right things - updating or deleting large number of documents on adhoc basis,
    bulk indexing from another source, etc.
    ● Increase the number of shards per index for data distribution but keep it sane if you are creating too
    many indices (like per day) as shards are resource hungry.
    ● Increase replica count to get higher search throughput.
    Plan for Scaling

    View full-size slide

  7. ● Elasticsearch does not have ACL - important if you are dealing with user data.
    ○ There are existing 3rd party plugins for ACL.
    ○ In our opinion, run Elasticsearch behind Nginx (or Apache) and let Nginx take care of
    ACL. This can be easily achieved using Nginx + Lua. You may use something equivalent.
    ● Have dedicated Master nodes - these will ensure that Elasticsearch’s cluster management
    does not stop (important for HA). Master-only nodes can run on relatively small machines as
    compared to Data nodes.
    ● Disable deleting of indices using wildcards or _all to avoid the most obvious disaster.
    ● Spend some time with the JVM. Monitor resource consumption, especially memory and see
    which Garbage Collector is working the best for you. For us, G1GC worked better than CMS
    due to high indexing rate requirement.
    ● Consider using Doc Values - major advantage is that it takes off memory management out of
    JVM and let the kernel do the memory management for disk cache.
    ● Use the Snapshot API and prepare to use Restore API, hoping you never really have to.
    ● Consider rolling restarts with Optimizing indices before restart.
    Ops - What We Learned

    View full-size slide