To infinity and beyond

To infinity and beyond! A practical guide for Mooseherds (and
other carers of livestock) @clintongormley #elasticsearch OSCON 2013

I have an idea for a killer app!

Quick! Lets...

Design our objects

Flatten them into tables

Normalize data

Add indexes

Add tables for many-to-one

More indexes

Need full text search?

Copy data to search engine

Keep the two in sync

Get search results, pull objects from DB

Success! Success!

Need to scale

Buy a bigger box

Tune indexes

Add caching

Fix caching bugs

Master - Slave replication

Buy SSDs

Denormalize data

Buy bigger boxes

Shard your data (ie rewrite your application)

Do you really need a relational DB?

Do you really need a relational DB? faster horse?

NoSQL advantages

Document oriented

...just store your object

Fast reads and writes

Scale horizontally

Recover from failure

But...

Different from RDBM

No transactions

No joins

Denormalized data

Still need to add: indexes

Still need to add: full text search

elasticsearch

Real time document store

Powerful full text search

Filters, geolocation...

Distributed by design

Fault tolerant

Easy sharding

Start small Scale massively

Why keep two datastores in sync?

Just use elasticsearch

with Elastic::Model

Store and query Moose objects

Exposes full power of elasticsearch

and takes care of the housekeeping

package MyApp::Post; use Moose; has 'title' => ( is =>
'rw', isa => 'Str' ); has 'content' => ( is => 'rw', isa => 'Str' ); has 'created' => ( is => 'rw', isa => 'DateTime', default => sub { DateTime->now } );

'rw', isa => 'Str' ); has 'content' => ( is => 'rw', isa => 'Str' ); has 'created' => ( is => 'rw', isa => 'DateTime', default => sub { DateTime->now } ); package MyApp::User; use Moose; has 'name' => ( is => 'rw', isa => 'Str' ); has 'email' => ( is => 'rw', isa => 'Str', required => 1 );

'rw', isa => 'Str' ); has 'content' => ( is => 'rw', isa => 'Str' ); has 'created' => ( is => 'rw', isa => 'DateTime', default => sub { DateTime->now } ); has 'user' => ( is => 'ro', isa => 'MyApp::User', ); package MyApp::User; use Moose; has 'name' => ( is => 'rw', isa => 'Str' ); has 'email' => ( is => 'rw', isa => 'Str', required => 1 );

package MyApp::Post; use Elastic::Doc; has 'title' => ( is =>
'rw', isa => 'Str' ); has 'content' => ( is => 'rw', isa => 'Str' ); has 'created' => ( is => 'rw', isa => 'DateTime', default => sub { DateTime->now } ); has 'user' => ( is => 'ro', isa => 'MyApp::User', ); package MyApp::User; use Elastic::Doc; has 'name' => ( is => 'rw', isa => 'Str' ); has 'email' => ( is => 'rw', isa => 'Str', required => 1 );

Some definitions... * index * type * doc * alias
Like a database Like a table Like a row in a table Like a symbolic link, points to one or more indices elasticsearch Elastic::Model * domain * namespace * model An index or an alias, used for CRUD Maps type <=> class for all associated domains Connects your app to elasticsearch.

We need a Model

package MyApp; use Elastic::Model;

package MyApp; use Elastic::Model; has_namespace 'myapp' => { };

package MyApp; use Elastic::Model; has_namespace 'myapp' => { user =>
'MyApp::User', post => 'MyApp::Post, };

'MyApp::User', post => 'MyApp::Post, }; # like table <=> class

Using our Model

use MyApp;

use MyApp; my $model = MyApp->new;

use MyApp; my $model = MyApp->new; my $namespace = $model->namespace('myapp');
# For index and alias management my $domain = $model->domain('myapp'); # For document CRUD my $view = $model->view; # For searching To do anything useful, we need:

Namespace: Create an index $namespace->index->create; my $namespace = $model->namespace('myapp'); *
create index 'myapp' * namespace:myapp => index:myapp

Namespace: Delete an index my $namespace = $model->namespace('myapp'); $namespace->index->delete;

Namespace: Create an alias my $namespace = $model->namespace('myapp'); $namespace->index('myapp_v1')->create; $namespace->alias->to('myapp_v1');
* alias:myapp => index:myapp_v1 * namespace:myapp => alias:myapp => index:myapp_v1

Domain: Create a user my $domain = $model->domain('myapp'); my $user
= $domain->new_doc( user => { name => 'Clinton', email => '[email protected]', } ); $user->save;

= $domain->create( user => { name => 'Clinton', email => '[email protected]', } ); $user->save;

= $domain->create( user => { name => 'Clinton', email => '[email protected]', id => 1, } ); say $user->id; # 1 say $user->type; # user

Domain: Create a post my $domain = $model->domain('myapp'); my $post
= $domain->create( post => { id => 2, title => 'To infinity and beyond', content => 'Elastic::Model persists Moose ' . . 'objects in elasticsearch', user => $user } );

Domain: Retrieve a doc my $domain = $model->domain('myapp'); my $post
= $domain->get( post => 2 ); my $user = $post->user; # stub object say $user->name; # full object # Clinton say $user->id; # still stub # 1

Domain: Update a doc my $domain = $model->domain('myapp'); $post->title('Awesome blog
post'); say $post->has_changed; # 1 say $post->has_changed('title'); # 1 say $post->old_value('title'); # To infinity and beyond $post->save;

optimistic version control

$version++ on every change

1: $post = $domain->get(post=>2); 2: $post = $domain->get(post=>2); 1: $post->title('Awesome
blog post'); 2: $post->title('Brilliant blog post'); 1: $post->save; 2: $post->save; *** CONFLICT ERROR ***

Dealing with conflicts

Ignore them $post->overwrite;

on_conflict handler

$post->save( on_conflict => sub { my ($old,$new) = @_; #
do something # to resolve conflict });

$post->save( on_conflict => sub { my ($old,$new) = @_; my
%changed = $old->old_values; $new->$_( $changed->{$_} ) for keys %changed; $new->save; $post = $new; });

Query docs: View $results = $model->view->search;

Views are reusable $posts = $model->view( type => 'post' );
$featured = $posts->filterb( featured => 1 );

Single domain $view = $domain->view;

Multi domain $view = $model->view;

Multi domain $view = $model->view; $view = $model->view->domain('foo','bar');

Multi type $view = $model->view; $view = $model->view->type('user','post');

my $view = $domain ->view ->type( 'post') ->filterb( created =>
{ gte => '2013-07-01' }, user => $user, ) ->queryb( title => 'awesome' ) ->sort( 'timestamp' ) ->size( 20 ) ->highlight( 'content' ) ->explain( 1 ); See "Terms of Endearment" on speakerdeck.com

First result $results = $view->first

$size results $results = $view->search;

Unbounded results $results = $view->scroll $results = $view->scan

Results are iterators $result = $results->next $result = $results->prev $result
= $results->first $result = $results->last $result = $results->shift

Result is: metadata + object say $result->object->title

my $results = $view->search; say "Total hits: " . $results->total;
say "Took: " . $results->took . "ms"; while ( my $result = $results->next ) { say "Title:" . $result->object->title; say "Snippets:" . join "\n", $result->highlight('content'); say "Score:" . $result->score; say "Debug:" . $result->explain; }

Just the object $object = $results->next_object

Just objects $results->as_objects; $object = $results->next;

Enough dull API!

Not just a doc store

*** POWERFUL *** search engine

BUT...

You can only get out what you put in

Prepare your data

Tell elasticsearch: * what fields you have * what data
they contain * how to index them

"Mapping" (like a database schema)

Moose gives us introspection (takes the pain away)

Examples: analyzed full text has 'title' => ( is =>
'rw', isa => 'Str', ); title: { type: "string" }

Examples: analyze and stem text has 'title' => ( is
=> 'rw', isa => 'Str', analyzer => 'english' ); title: { type: "string", analyzer: "english" }

Examples: analyze and stem text has 'title' => ( is
=> 'rw', isa => 'Str', analyzer => 'norwegian' ); title: { type: "string", analyzer: "norwegian" }

Examples: store the exact value has 'tag' => ( is
=> 'rw', isa => 'Str', index => 'not_analyzed' ); tag: { type: "string", index: "not_analyzed" }

Examples: complex data use MooseX::Types::Moose qw(Str); use MooseX::Types::Structured qw(Dict); has
'name' => ( is => 'rw', isa => Dict[ first => Str, last => Str, middle => Optional[Str], ], ); name: { type: "object", properties: { first: { type: 'string' }, last: { type: 'string' }, middle: { type: 'string'} } }

Examples: Elastic::Doc classes has 'user' => ( is => 'rw',
isa => 'MyApp::User', ); user: { type: "object", properties: { name: { type: 'string' }, email: { type: 'string' }, uid: { type: "object", properties: { index: {...}, type: {...}, id: {...}, routing: {...} } } } }

isa => 'MyApp::User', ); user: { type: "object", properties: { name: { type: 'string' }, email: { type: 'string' }, uid: { type: "object", properties: { index: {...}, type: {...}, id: {...}, routing: {...} } } } } Denormalised data!

isa => 'MyApp::User', exclude_attrs => ['email'] ); user: { type: "object", properties: { name: { type: 'string' }, email: { type: 'string' }, uid: { type: "object", properties: { index: {...}, type: {...}, id: {...}, routing: {...} } } } }

isa => 'MyApp::User', include_attrs => ['email'] ); user: { type: "object", properties: { name: { type: 'string' }, email: { type: 'string' }, uid: { type: "object", properties: { index: {...}, type: {...}, id: {...}, routing: {...} } } } }

isa => 'MyApp::User', include_attrs => [] ); user: { type: "object", properties: { name: { type: 'string' }, email: { type: 'string' }, uid: { type: "object", properties: { index: {...}, type: {...}, id: {...}, routing: {...} } } } }

Same data. Different purpose has 'title' => ( is =>
'rw', isa => 'Str', } title: { type: "string" } title => 'An AMAZING talk!' title: ['amazing','talk'] What do you sort on? 'amazing' or 'talk'

Multi-fields index the same data in different ways

'rw', isa => 'Str', }

'rw', isa => 'Str', multi => { untouched => { index => 'not_analyzed' } } }

'rw', isa => 'Str', multi => { untouched => { index => 'not_analyzed' } } } title => 'An AMAZING talk!' title: { title: ['amazing','talk'], untouched: "An AMAZING talk!" }

Let's TWEAK stuff!

How about AUTO-COMPLETE?

Don't use wildcards Slow & inefficient

Prepare your data: "Analysis"

With edge-ngrams

Analysis process "Édith Piaf" -> standard tokenizer -> ["Édith", "Piaf"]
-> lowercase token filter -> ["édith", "piaf"] -> ascii-folding token filter -> ["edith", "piaf"] -> edge-ngrams token filter -> ["e", "ed", "edi", "edit", "edith", "p", "pi", "pia", "piaf"] Perfect for partial matching!

'MyApp::User', type => 'MyApp::Post, }; Add a custom analyzer to our Model

'MyApp::User', type => 'MyApp::Post, }; has_filter 'my_edge_ngrams' => { type => 'edge_ngrams', min_gram => 1, max_gram => 15 }; Add a custom analyzer to our Model

'MyApp::User', type => 'MyApp::Post, }; has_filter 'my_edge_ngrams' => { type => 'edge_ngrams', min_gram => 1, max_gram => 15 }; has_analyzer 'autocomplete' => { tokenizer => 'standard', filter => ['lowercase','asciifolding', 'my_edge_ngrams'] }; Add a custom analyzer to our Model

Add analyzer to our Doc class has 'title' => (
is => 'rw', isa => 'Str', multi => { untouched => { index => 'not_analyzed' } } }

is => 'rw', isa => 'Str', multi => { untouched => { index => 'not_analyzed' }, autocomplete => { analyzer => 'autocomplete' } } }

is => 'rw', isa => 'Str', multi => { untouched => { index => 'not_analyzed' }, autocomplete => { analyzer => 'autocomplete' } } } title => 'An AMAZING talk!' title: { title: ['amazing','talk'], untouched: "An AMAZING talk!" }

is => 'rw', isa => 'Str', multi => { untouched => { index => 'not_analyzed' }, autocomplete => { analyzer => 'autocomplete' } } } title => 'An AMAZING talk!' title: { title: ['amazing','talk'], untouched: "An AMAZING talk!", autocomplete: [ 'a', 'am', 'ama', 'amaz', 'amazi', 'amazin', 'amazing', 't', 'ta', 'tal', 'talk' ] }

Apply your changes

Update the mapping AND the data

Reindex

$new = $namespace->index('myapp_v2'); $new->reindex('myapp'); $namespace->alias->to('myapp_v2'); $namespace->index('myapp_v1')->delete;

Autocomplete query

$view = $domain->view->queryb( );

$view = $domain->view->queryb( "title.autocomplete" => "amazing ta", );

$view = $domain->view->queryb( "title.autocomplete" => "amazing ta", ); Matches anything
starting with 'a' or 't' BOOH!

$view = $domain->view->queryb( "title.autocomplete" => { -match => { query
=> "amazing ta", } } );

=> "amazing ta", operator => "or" } } ); "a OR am OR ama OR amaz OR ... OR t OR ta"

=> "amazing ta", operator => "and" } } );

=> "amazing ta", operator => "and" } } ); Complete words should be more relevant

=> "amazing ta", operator => "and" } }, "title" => "amazing ta", );

$view = $domain->view->queryb([ "title.autocomplete" => { -match => { query
=> "amazing ta", operator => "and" } }, "title" => "amazing ta", ]);

Scaling

To infinity and beyond!

Basic unit of scale: the shard

An index has 1-or-more primary shards

Each primary has 0-or-more replica shards

Primaries scale total data

Replicas are for failover and to scale queries

Default: 5 primary shards with 1 replica each

5 * (1 + 1) = 10 shards

10 shards = 1 .. 10 servers

Can change number of replicas

CANNOT change number of primaries

So how do we scale?

Kagillion shards!

Umm, No.

Be a grower not a shower

At query time:

1 index x 10 shards == 10 indices x 1
shard

Two patterns:

Time based indices Index-per-user

* one index per month * write to alias: logs_current
* query alias: logs

$ns = $model->namespace('logs'); $ns->index('logs_2013_08')->create; $ns->alias('logs_current')->to('logs_2013_08'); $ns->alias->to('logs_2013_08'); $model->domain('logs_current')->create( log => \%data
); $model->domain('logs')->view->search;

New month, new index $ns->index('logs_2013_09')->create; $ns->alias('logs_current')->to('logs_2013_09'); $ns->alias->add('logs_2013_09');

Add alias for 2013 $ns->alias('logs_2013')->to( 'logs_2013_08', 'logs_2013_09', ... );

Time based indices Index-per-user

Users have their own data

Most searches are per-user

Ideal: Index-per-user

Expensive

Most users have little data

Some have LOTS!

Start with one index for all users

Use aliases to pretend

...aliases with... filters and routing

$ns->alias( 'bloggs_plumbers' )->to( myapp_v1 => { filterb => { client_id
=> 'bloggs_plumbers' }, routing => 'bloggs_plumbers' } );

Routing determines: which shard stores your data

Routing == bloggs_plumbers All user's data on same shard

CRUD -> hit one shard Queries -> hit one shard

SUPER efficient!

New client joins...

...called "Twitter"

6 months later...

$new = $ns->index('twitter_v1'); $new->reindex('twitter'); $ns->alias('twitter')->to('twitter_v1'); $ns->alias->add('twitter_v1');

What more do you need?

Go forth and HERD!

To infinity and beyond

To infinity and beyond

More Decks by Clinton Gormley

Other Decks in Programming

Featured

Transcript