Sydney Elastic Meetup - Elastic{ON} 2016 Recap

1 Catching Up With the Elastic Stack

3 The Elastic Stack

7 Ingest Node Ingest all the things

Ingest Pipeline Like logstash, but just the filters, in node
form grok Inputs remove outputs

Example Pipeline extract mysql fields from `message` field and then
remove it from the document { “description”: “mysql pipeline”, “processors”: [ { “grok”: { “field” : “message”, “pattern” : “...” } }, { “remove” : { “field” : “message” } } ] }

Indexing [An Enriched] Document specifying an ingest pipeline to execute
before indexing PUT localhost:9200/logs/mysql/1?pipeline=my_mysql_pipeline { “message”: "070917 16:29:01 21 Query select * from location" }

Indexing [An Enriched ] Document document as it is being
indexed PUT localhost:9200/logs/mysql/1 { “timestamp”: “2007-09-17T16:29:01-08:00”, “thread_id”: “21”, “command_type”: “Query”, “command”: “select * from location” }

Pipeline Management Create, Retrieve, Update, Delete PUT _ingest/pipeline/pipeline-name GET _ingest/pipeline/pipeline-name
GET ingest/pipeline/* DELETE _ingest/pipeline/pipeline-name

Set Grok attachment The Processors Rename Geoip Convert Date

14 • Pure Java implementation of grok, mutate, kv and
date Logstash filters. • Runs on new dedicated node type (master: false, data: false) • Intercepts bulk/index requests, performs enrichment, sends to data nodes. • Multiple pipelines will be supported • Pipeline configuration stored as a document in index • Eliminates the need for a server running Logstash for simple filtering. *beat -> (Ingest Node -> Elasticsearch) https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html Ingest Node summary

15 BM25 New default relevancy scoring

What is BM25? Probabilistic ranking approach to scoring

Or as Britta put it...

Current model - TF/IDF A one size fits most approach
TF - Term Frequency Number of times the term occurs in document IDF - Inverse Document Frequency (Inverse of How often the term occurs across all documents)

IDF Comparison BM25 TF/IDF BM25 TF/IDF

TF Saturation Curve TF/IDF - keeps growing! BM25, k =
10 • Limit influence of Term Frequency • Tune influence by tweaking k BM25, k = 1

Influence of Document Length • Tune influence of document length
by tweaking b TF/IDF BM25, b = 0.75 BM25, b = 0.1

A short history of BM25 BM25 becomes default! Elasticsearch 5.0
Lucene 6.0 Robertson/Sparck Jones weight TREC-3 BM25 final! Pluggable similarities + BM25 in Lucene (GSoC, David Nemeskey) Poisson distribution for terms here 1970 2000 2010 1999 1976 1977 1993 TREC-2 Leap of faith Probability ranking principle First Lucene release (TF/IDF) We are 1994 1975 1980 1990 2011 2016 ?

23 • Literature suggests so • Challenges suggest so •
Users say so • Lucene Developers say so • Konrad Beiske says so - “BM25 vs Lucene Default Similarity” https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity • BUT: it depends on the features of your corpus Is it better? BM25 summary

24 Geospatial Data Structure Improvements Where are we now?

Geo capabilities ...they’re becoming more popular

Combined with free text search Piano tuners in New York

Monitoring Network traffic Identifying “bad actors”

geo_point Indexing Improvements postings based on GeoPointField introduced in LUCENE
5.4 2.1 2.2 2.3 100% 75% 50% 25% 0% Throughput Index Size Time Heap 80% 53% 52% 66% 54% 28% 39% 12%

2.1 2.2 2.3 geo_point Search Improvements 100% 75% 70% 50%
45% 21% 25% 0% BoundingBox Distance DistanceRang e Polygon 82% 51% 36% 26% 11% postings based on GeoPointField introduced in LUCENE 5.4

geo_shape Search • Supports the following geo_shapes ◦ Point, MultiPoint
◦ LineString, MultiLineString ◦ Polygon (with holes), MultiPolygon (with holes) • Shapes are parsed using OGC and ISO standards definitions • Support relational queries Intersect s Disjoint Within Contains

geo_point vs. geo_shape? • Some queries work on geo_point •
Some queries work on geo_shape • New experimental geo_field type coming ◦ Can be used to represent both

geo_distance Aggregation { "aggs" : { “sf_rings” : { “geo_distance”
: { “field” : “location”, “origin” : [ 32.95, -96.82 ], “ranges” : [ { “to” : 50 }, { “from” : 50, “to” : 100 }, { “from” : 100, “to” : 300 } ] } } } }

geo_distance Aggregation

geohash_grid Aggregation { “aggs” : { “crime_cells” : { “geohash_grid”
: { “field” : “location”, “precision” : 8 } } } }

geohash_grid Aggregation

geo_centroid Aggregation “query” : { “match” : { “crime” :
“burglary” } }, “aggs” : { “towns” : { “terms” : { “field" : “town” }, “aggs” : { “centroid” : { “geo_centroid” : { “field” : “location” } } } } }

geo_centroid Aggregation

39 • They’re heaps FASTER! https://www.elastic.co/blog/supercharging-geopoint Geospatial Data Structure Improvements
summary

40 Reindex API Moving data more easily

Reindex copy documents from one index to another POST /_reindex?refresh
{ “source”: { “index”: “old-index” }, “destination”: { “index”: “new-index” } }

Reindex copy documents matching query and modify whilst reindexing POST
/_reindex?refresh { “source”: { “index”: “old-index”, “query”: { “match” : { “text” : “foo” } } }, “destination”: { “index”: “new-index” }, “script”: { “inline”: “ctx._id = ctx._id + \”a\”” } }

Update by Query update documents in existing index POST /test/_update_by_query?refresh
{ “query”: { “match”: { “text”: “foo” } }, “script”: { “inline”: “ctx._source.likes++” } }

44 Reindex API summary • Uses scroll API under covers
• Takes a snapshot of documents when reindex starts • Touch documents to pick up mapping updates on the fly • Change the mappings as part of the Reindex API • Backported to 2.3 https://github.com/elastic/elasticsearch/issues/15201

45 Graph API In a world of interconnected data

46 What is Graph Technology Good for? Fraud Detection •
Given Credit card purchase histories… ◦ Where did people with fraudulent purchases shop most often? ◦ Which vendor is responsible for stolen credit card numbers? • Given car emissions data… ◦ Which car manufacturer fails emissions tests most often? ◦ At which shops

47 What is Graph Technology Good for? Identifying Relationships •
Given Wikipedia... ◦ What topics / entities / locations are meaningfully related? • Given network traffic data… ◦ What external IPs do machines on my network talk to?

48 What is Graph Technology Good for? Recommendations • Given
my Purchase History.. ◦ What am I most likely to buy next? • Given Last.FM music preferences… ◦ What music do people who like Mozart also like? • Given search and click data… ◦ What results do people who searched for “mixer” tend to click on?

49 Theoretical Challenges Zipf’s Law and the problem of super
connected entities

50 Theoretical Challenges Graph exploration typically done through “most frequent”
connections http://imgs.xkcd.com/comics/heatmap.png

51 We have the technology! • When indexing we already
count and calculate statistics • Using these statistics, we can bring relevance to relationships • Identify links / properties of entities / groups significantly different from global averages • Aggregations enable efficient traversal and scale Information Retrieval Techniques FTW

Graph API Finding artists related to Jack Johnson GET /wikipedia/_graph/explore
{ “query”: { “query_string”: { “query”: “Jack Johnson” } }, “vertices”: [ { “field”: “artists.raw” } ], “connections”: { “vertices”: [ { “field”: “artists.raw" } ] } }

Visual Exploration in Kibana Searching at Best Buy for a
“mixer” - query and the product clicked

54 • Simple graph-walking API • Leverages full Elasticsearch query
DSL - No SPARQL • Relevance or count-based • Distributed query execution • Near-realtime data availability • Easily explore and visualize with Kibana https://www.elastic.co/webinars/sneak-peek-of-graph-capabilities-with-elasticsearch Graph API summary

55 Windows

56 Up and running with ARM template Windows Azure Marketplace

57 • Create / Add to existing Resource Group •
Multiple Elasticsearch versions • 3 Dedicated Master nodes - configurable VM sizes • Configurable (Master Eligible) Data nodes • Monitoring, Security and Kibana (BYOL) • Internal/External Load Balancer https://www.elastic.co/blog/microsoft-azure-marketplace-elasticsearch-kibana-and-more-now-available Windows Azure Marketplace

58 GUI and scriptable Windows MSI Installer

59 And More!

60 • Query Profiler • Task Manager ◦ Long running
tasks, status, cancellation, throttling • Java HTTP Client • Safer scripting language ◦ Painless. dynamic / static suppor • JarHell check and Java Security Manager • Splitting string fields into text and keyword types https://www.elastic.co/elasticon/conf/2016/sf/whats-evolving-in-elasticsearch And More!

61 Logstash is levelling up!

62 • Overall throughput measurements, as well as per filter
throughput measurements (i.e., which grok filter is causing the slowdown) • Rety/error counts • Internal buffer sizes (i.e., useful for multiline monitoring). • Similar to stats APIs in Elasticsearch (i.e., RESTful interface for interrogation). • Brace yourselves, a GUI is coming as well (probably not in v5). Logstash Lines 1 Monitoring API Coming in v5

63 • Dynamic config reload (v2.3) ‒ --auto-reload and --reload-interval
options • Resiliency improvements (v5) ‒ Logstash already damn good at not losing events (unless you like, kill -9 it…) ‒ Persistent, disk-based queuing ‒ Output batching/ack’ing of events • Clustering and all the goodies that come with it: ‒ Centralised configuration and management APIs ‒ Horizontally scale (i.e. spread out) event processing ‒ Will still support single instance, file-based configuration Logstash Lines 2 Other Goodies

64 Boom! Beats!

65 • Filtering (v5): ‒ Choose a set of fields
to be exported ‒ Remove a set of fields that are not interested for the user ‒ e.g., drop all 200 OK responses in Packetbeat • Broker outputs (v5): ‒ Redis output is back! ‒ Kafka coming Some quick ones

66 • Single beat binary with built-in modules for different
metrics • Ability to write own modules or use existing beat as basis for a module Metricbeat (v5) Modular framework for beats

67 • Unifedbeat - Index into ElasticSearch the alert records
from network intrusion detection software. ‒ https://github.com/cleesmith/unifiedbeat • Nagioscheckbeat - Index Nagios checks into Elasticsearch ‒ https://github.com/PhaedrusTheGreek/nagioscheckbeat • Factbeat - Ship Facter facts to Elasticsearch ‒ https://github.com/jarpy/factbeat • Hsbeat - Index JVM stats/metrics to Elasticsearch ‒ https://github.com/YaSuenag/hsbeat • Flowbeat - Collect sFlow data ‒ https://github.com/FStelzer/flowbeat • Udpbeat - recieve structured logs over UDP and index to Elasticsearch ‒ https://github.com/gravitational/udpbeat • Batterybeat - monitor you Mac’s battery performance ‒ https://github.com/colings86/batterybeat • Twitterbeat - Polls tweets and index them into Elasticsearch ‒ https://github.com/buehler/go-elastic-twitterbeat And more here: https://www.elastic.co/guide/en/beats/libbeat/current/community-beats.html The community is full of busy beavers... We ❤ our community Beats

68 Never miss a beat • Provides standardised code framework
for a Beat • Code structure, unit, integration and system tests (hooks into TravisCI) • Allows your beat to be built for all platforms on which Beats are supported https://github.com/elastic/beat-generator Got an idea? Write your own beat easily with the beats generator!

69 Ability to customize colors, text, numbers, labels, layouts, skins,
and visualizations. 1 All-new Visualization tools for Graph and Time Series data. 2 Strong integration with Security, Monitoring, and the rest of the Elastic Stack 3

‹#› Greater space efficiency

‹#› Increased dashboard density

‹#› Applications as first class citizens

74 Elastic Cloud Formally known as Found

75 • Found has been rebranded as Elastic Cloud. •
Original team members from Found and growing the team with a range of talented engineers. • Why Elastic Cloud is the best choice for your hosted Elasticsearch needs. (https://www. elastic.co/blog/why-elastic-cloud-is-the-best- choice-for-your-hosted-elasticsearch-needs) Where did Found go?

76 • Goal of providing the best hosted Elastic Stack
experience. • Currently provides hosted Elasticsearch and Kibana. • Utilises Shield for security and Marvel for monitoring. • Deploy one to many Elasticsearch clusters under a single account. • Latest releases of Elastic stack on the day of release. Never heard of it, what is it? Elastic Cloud Deploy

77 • Ability to provision clusters within an organisation's current
cloud infrastructure. • Currently in closed beta. • Will initially support AWS and aim for a wide variety of cloud providers including owned hardware. Deploy the cloud in-house Elastic Cloud Enterprise

78 Elasticsearch Hadoop Integration

79 • The library provides a connector between Hadoop and
Elasticsearch. • Integration with many Hadoop libraries (Hive, Spark, Storm, etc). • Utilise HDFS. Overview Elasticsearch-Hadoop

80 Elasticsearch-Hadoop Ecosystem

81 Elasticsearch-Hadoop • Allows you to get a ‘view’ into
your Elasticsearch index from within Hive client. • Elasticsearch connector handles the JSON parsing, connectivity, etc to Elasticsearch. • Handles all versions of Hive. Hive

82 Elasticsearch-Hadoop • Integration with Java / Scala Spark client.
• Supports all versions of Spark. • Allows Elasticsearch to act as a Spark collection. • Integrates with Spark SQL Spark

83 Elasticsearch-Hadoop • HDFS as a repository for Elasticsearch snapshots.
• Have Hsync support to confirm data is written to disk. Storage

84 Elasticsearch-Hadoop • Aggregation support from within the Elasticsearch-Hadoop connector
• Monitoring Roadmap

85 Packs are extensions that apply to the whole stack
… Shay Banon Founder and Chief Technical Officer, Elastic

86 • Created and supported by Elastic. • Single package
with unified features and versions. • Simple download and install • Security, Monitoring, Alerting, and more.

Sydney Elastic Meetup - Elastic{ON} 2016 Recap

Sydney Elastic Meetup - Elastic{ON} 2016 Recap

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript