Migrating from Hibernate Search to Elasticsearch

Migrating from Hibernate Search to Elasticsearch Experiences from Online Video
Platform Movideo. Christian Strzadala

Movideo architecture Distributed application architecture Deployed within the Microsoft Azure
Cloud Platform Core applications written in Java

How we initially implemented search at Movideo? Wanted a ‘out
of the box’ solution to use and Lucene is a proven search solution. Movideo utilises Hibernate ORM and the Hibernate Search project allowed for a simple add on of Lucene. No need to implement Lucene directly (confusing for us with no search experience before).

Movideo search architecture generation 1

What other beneﬁts did Lucene provide Movideo? Hibernate search allowed
the DB backed model to still be retrieved but by ID instead of a complex query. Reduced load off MySQL / SQL Server onto a local Lucene index. In turn, provided a NoSQL type solution.

So why the move from Hibernate Search? Lasted about 3
years in production with no major issues. Issues only started to occur when index become to large to copy to each instance. Movideo found that 2 x ~7GB index size, issues started to occur. Limitations around consistency and performance. Index replication was at quickest around 2-3min.

Why Elasticsearch? Remove the need for a local full Lucene
index. A fully ﬂedged service for search. Designed for cloud environments from the ground up. Represents documents as JSON. Easily parseable. Can query over http, no more ‘Luke’!

Movideo Elasticsearch cluster Initial 3 A4 x Linux Centos instances.
A4 instance sizes. Smaller instance sizes can experience degraded performance from shared network IO. Moved to SSD’s and have improved recovery time with much slow IO.

Azure, Elasticsearch and node discovery Multicast is disabled within Azure.
Currently using Unicast with setting the internal Azure hostname of other nodes. Azure Virtual Network required for internal communication between applications. Elasticsearch Azure plugin can be used for discovery of other nodes (Uses the Azure AppFabric API for discovery).

Sharding strategy 1 index for all types with 3 shards
and 2 replicas. Allowed for a copy of the whole index on each node. Node%1 0P 2R 1R Node%2 1P 0R 2R Node%3 2P 1R 0R

Releasing Strategy Allow for incremental releases. Due to dependency of
Lucene, there were version clashes to run single deployment with Hibernate Search and Elasticsearch implementation. 2 separate builds of the web were deployed within same Tomcat instance and Nginx used to redirect requests to speciﬁc deployment. Allows for metrics analysis to see if what we did improved performance.

Creating and Re-creating the Elasticsearch index Took the idea of
a ‘Search Indexer’ app from the previous Hibernate Search indexer. ‘Search Indexer’ reads batches of records from database, serialises into JSON and bulk indexes these into Elasticsearch. Alias is added on the current production index.

Show me the Data! Initially explicit mapping was used for
full control over the data structures. Nested documents are based on database relationships. Movideo workﬂow: If the nested document doesn’t change much as part of business logic, then the full document is stored. If the nested document changes as part of business logic, then a id reference is stored.

Stored Nested Document Example

ID Referenced Nested Document Example

Getting data into Elasticsearch at runtime All Movideo applications use
message queues to send index update messages. Search Processor application processes messages and indexes data into Elasticsearch. Use of queues give us durability around index updates.

Elasticsearch query for first REST service Use ‘filtered queries’ as
much as possible as filters are cacheable. If there is no need for relevance score, then use ‘match_all’ query with filters. We use a simple aggregation to get duration of all media, which cleaned up application code as well.

Only one query gives back everything the service needs? Get
outta here! Hibernate search would do a single query and then get the attached objects from the database by id. Found that the bottleneck isn’t the database for single queries, it’s the lazy loading object retrieval. Elasticsearch initial query will return JSON that gets deserialised into a model with id references to other related models. We use ‘multi-get’ queries to get the other models data from Elasticsearch.

So, the rest of the other model types were migrated
and we are done, right? Right?

Tags Tags are stored in a database table with a
join column. Tag services want unique results. Storing a document per tag row would have duplicates. So we can’t just do a row = document representation. ID NS Predicate Value Media ID 1 music artist Pearl Jam 123123123 2 music genre Rock 123123123 3 music artist Pearl Jam 82399239 4 music genre Grunge 82399239

A different approach to indexing tags Bulk Indexing Get range
of tags from database If exists, get existing document and create ‘elasticsearch’ model else create a new ‘elasticsearch’ model of tag Loop through other tags and determine if ‘elasticsearch’ document exists for the tag and if so, add model details Index into Elasticsearch Check Elasticsearch for tag by ID (full:tag=value-MEDIA)

A different approach to indexing tags Real Time When a
‘taggable' object is being updated by the ‘search processor’, index update message is sent for each tag on the object if exists, get existing document and create ‘elasticsearch’ model else create new ‘elasticsearch’ model When indexing, optimistic concurrency is upheld with a version check. If there is a version exception, the tag is retrieved again from Elasticsearch and attempted to be indexed. ‘Tag update message’ is processed by checking Elasticsearch for tag by custom ID

Movideo search architecture generation 2

Memory issues, what memory issues? After the ﬁnal services and
internal applications search API’s were migrated, we migrated to Elasticsearch 1.5.1. Everything looked great and we were feeling great. After 3 days live, all nodes began to have memory issues. OH NO!!!

Nested sorting and the filter cache Nested sorting requires the
parent and child documents to be in memory. If there are lots of child documents, this requires a lot more to be cached. Elasticsearch can’t currently get around this due to dependancy on Lucene 4.X. A fix is currently in master due to Lucene 5.X being used. Work arounds for now are to have more heap memory, re-design the index layout to not require nested sorting or manually clear filter cache.

Monitoring and alerts Marvel is just brilliant! Use Sense to
query your data and test queries (use an isolated cluster for experimenting with queries) _cat API should be used with alerting system to alert about ‘all the things’

Disaster Recovery Once Elasticsearch version 1.0.0 came out, so did
snapshot and restore API. Snapshot and restore API allowed to take a snapshot of index at ‘point in time’ and allow to restore later. We have used the Azure Snapshot and Restore plugin to allow snapshots directly to blob storage.

Improvements to go Where indexed data needs to be indexed
‘real time’, allow direct indexing into Elasticsearch and explicit ‘refresh’. Use ‘scrutineer by Aconex’ application to compare data between the database and Elasticsearch.

Thank you! A big thanks goes to the consultants at
Elastic particularly David Pilato and Mark Walkom A massive thank you to all the team at Movideo who helped throughout the migration.

Migrating from Hibernate Search to Elasticsearch

Migrating from Hibernate Search to Elasticsearch

Christian Strzadala

Other Decks in Technology

Featured

Transcript