Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sydney Elastic Meetup - Elastic{ON} 2016 Recap

Sydney Elastic Meetup - Elastic{ON} 2016 Recap

A recap of the main announcements from our recent global conference! We touched on Elasticsearch, Logstash, Kibana and Beats and so much more.

Presented by Mark Walkom, Joshua Rich, Russ Cam and Christian Strzadala.

For more detailed information on each of the topics, see the videos and slides from the original presenters at the conference here - https://www.elastic.co/elasticon/conf/2016/sf

Elastic Co

March 23, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 2

  2. 4 4

  3. 5

  4. 6

  5. Example Pipeline extract mysql fields from `message` field and then

    remove it from the document { “description”: “mysql pipeline”, “processors”: [ { “grok”: { “field” : “message”, “pattern” : “...” } }, { “remove” : { “field” : “message” } } ] }
  6. Indexing [An Enriched] Document specifying an ingest pipeline to execute

    before indexing PUT localhost:9200/logs/mysql/1?pipeline=my_mysql_pipeline { “message”: "070917 16:29:01 21 Query select * from location" }
  7. Indexing [An Enriched ] Document document as it is being

    indexed PUT localhost:9200/logs/mysql/1 { “timestamp”: “2007-09-17T16:29:01-08:00”, “thread_id”: “21”, “command_type”: “Query”, “command”: “select * from location” }
  8. 14 • Pure Java implementation of grok, mutate, kv and

    date Logstash filters. • Runs on new dedicated node type (master: false, data: false) • Intercepts bulk/index requests, performs enrichment, sends to data nodes. • Multiple pipelines will be supported • Pipeline configuration stored as a document in index • Eliminates the need for a server running Logstash for simple filtering. *beat -> (Ingest Node -> Elasticsearch) https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html Ingest Node summary
  9. Current model - TF/IDF A one size fits most approach

    TF - Term Frequency Number of times the term occurs in document IDF - Inverse Document Frequency (Inverse of How often the term occurs across all documents)
  10. TF Saturation Curve TF/IDF - keeps growing! BM25, k =

    10 • Limit influence of Term Frequency • Tune influence by tweaking k BM25, k = 1
  11. Influence of Document Length • Tune influence of document length

    by tweaking b TF/IDF BM25, b = 0.75 BM25, b = 0.1
  12. A short history of BM25 BM25 becomes default! Elasticsearch 5.0

    Lucene 6.0 Robertson/Sparck Jones weight TREC-3 BM25 final! Pluggable similarities + BM25 in Lucene (GSoC, David Nemeskey) Poisson distribution for terms here 1970 2000 2010 1999 1976 1977 1993 TREC-2 Leap of faith Probability ranking principle First Lucene release (TF/IDF) We are 1994 1975 1980 1990 2011 2016 ?
  13. 23 • Literature suggests so • Challenges suggest so •

    Users say so • Lucene Developers say so • Konrad Beiske says so - “BM25 vs Lucene Default Similarity” https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity • BUT: it depends on the features of your corpus Is it better? BM25 summary
  14. geo_point Indexing Improvements postings based on GeoPointField introduced in LUCENE

    5.4 2.1 2.2 2.3 100% 75% 50% 25% 0% Throughput Index Size Time Heap 80% 53% 52% 66% 54% 28% 39% 12%
  15. 2.1 2.2 2.3 geo_point Search Improvements 100% 75% 70% 50%

    45% 21% 25% 0% BoundingBox Distance DistanceRang e Polygon 82% 51% 36% 26% 11% postings based on GeoPointField introduced in LUCENE 5.4
  16. geo_shape Search • Supports the following geo_shapes ◦ Point, MultiPoint

    ◦ LineString, MultiLineString ◦ Polygon (with holes), MultiPolygon (with holes) • Shapes are parsed using OGC and ISO standards definitions • Support relational queries Intersect s Disjoint Within Contains
  17. geo_point vs. geo_shape? • Some queries work on geo_point •

    Some queries work on geo_shape • New experimental geo_field type coming ◦ Can be used to represent both
  18. geo_distance Aggregation { "aggs" : { “sf_rings” : { “geo_distance”

    : { “field” : “location”, “origin” : [ 32.95, -96.82 ], “ranges” : [ { “to” : 50 }, { “from” : 50, “to” : 100 }, { “from” : 100, “to” : 300 } ] } } } }
  19. geohash_grid Aggregation { “aggs” : { “crime_cells” : { “geohash_grid”

    : { “field” : “location”, “precision” : 8 } } } }
  20. geo_centroid Aggregation “query” : { “match” : { “crime” :

    “burglary” } }, “aggs” : { “towns” : { “terms” : { “field" : “town” }, “aggs” : { “centroid” : { “geo_centroid” : { “field” : “location” } } } } }
  21. Reindex copy documents from one index to another POST /_reindex?refresh

    { “source”: { “index”: “old-index” }, “destination”: { “index”: “new-index” } }
  22. Reindex copy documents matching query and modify whilst reindexing POST

    /_reindex?refresh { “source”: { “index”: “old-index”, “query”: { “match” : { “text” : “foo” } } }, “destination”: { “index”: “new-index” }, “script”: { “inline”: “ctx._id = ctx._id + \”a\”” } }
  23. Update by Query update documents in existing index POST /test/_update_by_query?refresh

    { “query”: { “match”: { “text”: “foo” } }, “script”: { “inline”: “ctx._source.likes++” } }
  24. 44 Reindex API summary • Uses scroll API under covers

    • Takes a snapshot of documents when reindex starts • Touch documents to pick up mapping updates on the fly • Change the mappings as part of the Reindex API • Backported to 2.3 https://github.com/elastic/elasticsearch/issues/15201
  25. 46 What is Graph Technology Good for? Fraud Detection •

    Given Credit card purchase histories… ◦ Where did people with fraudulent purchases shop most often? ◦ Which vendor is responsible for stolen credit card numbers? • Given car emissions data… ◦ Which car manufacturer fails emissions tests most often? ◦ At which shops
  26. 47 What is Graph Technology Good for? Identifying Relationships •

    Given Wikipedia... ◦ What topics / entities / locations are meaningfully related? • Given network traffic data… ◦ What external IPs do machines on my network talk to?
  27. 48 What is Graph Technology Good for? Recommendations • Given

    my Purchase History.. ◦ What am I most likely to buy next? • Given Last.FM music preferences… ◦ What music do people who like Mozart also like? • Given search and click data… ◦ What results do people who searched for “mixer” tend to click on?
  28. 51 We have the technology! • When indexing we already

    count and calculate statistics • Using these statistics, we can bring relevance to relationships • Identify links / properties of entities / groups significantly different from global averages • Aggregations enable efficient traversal and scale Information Retrieval Techniques FTW
  29. Graph API Finding artists related to Jack Johnson GET /wikipedia/_graph/explore

    { “query”: { “query_string”: { “query”: “Jack Johnson” } }, “vertices”: [ { “field”: “artists.raw” } ], “connections”: { “vertices”: [ { “field”: “artists.raw" } ] } }
  30. Visual Exploration in Kibana Searching at Best Buy for a

    “mixer” - query and the product clicked
  31. 54 • Simple graph-walking API • Leverages full Elasticsearch query

    DSL - No SPARQL • Relevance or count-based • Distributed query execution • Near-realtime data availability • Easily explore and visualize with Kibana https://www.elastic.co/webinars/sneak-peek-of-graph-capabilities-with-elasticsearch Graph API summary
  32. 57 • Create / Add to existing Resource Group •

    Multiple Elasticsearch versions • 3 Dedicated Master nodes - configurable VM sizes • Configurable (Master Eligible) Data nodes • Monitoring, Security and Kibana (BYOL) • Internal/External Load Balancer https://www.elastic.co/blog/microsoft-azure-marketplace-elasticsearch-kibana-and-more-now-available Windows Azure Marketplace
  33. 60 • Query Profiler • Task Manager ◦ Long running

    tasks, status, cancellation, throttling • Java HTTP Client • Safer scripting language ◦ Painless. dynamic / static suppor • JarHell check and Java Security Manager • Splitting string fields into text and keyword types https://www.elastic.co/elasticon/conf/2016/sf/whats-evolving-in-elasticsearch And More!
  34. 62 • Overall throughput measurements, as well as per filter

    throughput measurements (i.e., which grok filter is causing the slowdown) • Rety/error counts • Internal buffer sizes (i.e., useful for multiline monitoring). • Similar to stats APIs in Elasticsearch (i.e., RESTful interface for interrogation). • Brace yourselves, a GUI is coming as well (probably not in v5). Logstash Lines 1 Monitoring API Coming in v5
  35. 63 • Dynamic config reload (v2.3) ‒ --auto-reload and --reload-interval

    options • Resiliency improvements (v5) ‒ Logstash already damn good at not losing events (unless you like, kill -9 it…) ‒ Persistent, disk-based queuing ‒ Output batching/ack’ing of events • Clustering and all the goodies that come with it: ‒ Centralised configuration and management APIs ‒ Horizontally scale (i.e. spread out) event processing ‒ Will still support single instance, file-based configuration Logstash Lines 2 Other Goodies
  36. 65 • Filtering (v5): ‒ Choose a set of fields

    to be exported ‒ Remove a set of fields that are not interested for the user ‒ e.g., drop all 200 OK responses in Packetbeat • Broker outputs (v5): ‒ Redis output is back! ‒ Kafka coming Some quick ones
  37. 66 • Single beat binary with built-in modules for different

    metrics • Ability to write own modules or use existing beat as basis for a module Metricbeat (v5) Modular framework for beats
  38. 67 • Unifedbeat - Index into ElasticSearch the alert records

    from network intrusion detection software. ‒ https://github.com/cleesmith/unifiedbeat • Nagioscheckbeat - Index Nagios checks into Elasticsearch ‒ https://github.com/PhaedrusTheGreek/nagioscheckbeat • Factbeat - Ship Facter facts to Elasticsearch ‒ https://github.com/jarpy/factbeat • Hsbeat - Index JVM stats/metrics to Elasticsearch ‒ https://github.com/YaSuenag/hsbeat • Flowbeat - Collect sFlow data ‒ https://github.com/FStelzer/flowbeat • Udpbeat - recieve structured logs over UDP and index to Elasticsearch ‒ https://github.com/gravitational/udpbeat • Batterybeat - monitor you Mac’s battery performance ‒ https://github.com/colings86/batterybeat • Twitterbeat - Polls tweets and index them into Elasticsearch ‒ https://github.com/buehler/go-elastic-twitterbeat And more here: https://www.elastic.co/guide/en/beats/libbeat/current/community-beats.html The community is full of busy beavers... We ❤ our community Beats
  39. 68 Never miss a beat • Provides standardised code framework

    for a Beat • Code structure, unit, integration and system tests (hooks into TravisCI) • Allows your beat to be built for all platforms on which Beats are supported https://github.com/elastic/beat-generator Got an idea? Write your own beat easily with the beats generator!
  40. 69 Ability to customize colors, text, numbers, labels, layouts, skins,

    and visualizations. 1 All-new Visualization tools for Graph and Time Series data. 2 Strong integration with Security, Monitoring, and the rest of the Elastic Stack 3
  41. 75 • Found has been rebranded as Elastic Cloud. •

    Original team members from Found and growing the team with a range of talented engineers. • Why Elastic Cloud is the best choice for your hosted Elasticsearch needs. (https://www. elastic.co/blog/why-elastic-cloud-is-the-best- choice-for-your-hosted-elasticsearch-needs) Where did Found go?
  42. 76 • Goal of providing the best hosted Elastic Stack

    experience. • Currently provides hosted Elasticsearch and Kibana. • Utilises Shield for security and Marvel for monitoring. • Deploy one to many Elasticsearch clusters under a single account. • Latest releases of Elastic stack on the day of release. Never heard of it, what is it? Elastic Cloud Deploy
  43. 77 • Ability to provision clusters within an organisation's current

    cloud infrastructure. • Currently in closed beta. • Will initially support AWS and aim for a wide variety of cloud providers including owned hardware. Deploy the cloud in-house Elastic Cloud Enterprise
  44. 79 • The library provides a connector between Hadoop and

    Elasticsearch. • Integration with many Hadoop libraries (Hive, Spark, Storm, etc). • Utilise HDFS. Overview Elasticsearch-Hadoop
  45. 81 Elasticsearch-Hadoop • Allows you to get a ‘view’ into

    your Elasticsearch index from within Hive client. • Elasticsearch connector handles the JSON parsing, connectivity, etc to Elasticsearch. • Handles all versions of Hive. Hive
  46. 82 Elasticsearch-Hadoop • Integration with Java / Scala Spark client.

    • Supports all versions of Spark. • Allows Elasticsearch to act as a Spark collection. • Integrates with Spark SQL Spark
  47. 83 Elasticsearch-Hadoop • HDFS as a repository for Elasticsearch snapshots.

    • Have Hsync support to confirm data is written to disk. Storage
  48. 85 Packs are extensions that apply to the whole stack

    … Shay Banon Founder and Chief Technical Officer, Elastic
  49. 86 • Created and supported by Elastic. • Single package

    with unified features and versions. • Simple download and install • Security, Monitoring, Alerting, and more.