Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés de la Peña at Big Data Spain 2015

1 Andrés de la Peña [email protected] GEOSPATIAL AND BITEMPORAL SEARCH
IN C* WITH PLUGGABLE LUCENE INDEX @a_de_la_pena

Pluggable Lucene based 2i Geospatial Search Bitemporal Indexes 1 2
3 CONTENTS

PLUGGABLE LUCENE 2i

primary key secondary indexes token ranges Throughput Expressiveness Cassandra query
methods #BDS15 4

primary key secondary indexes token ranges Cassandra query methods by
use case #BDS15 5 primary key secondary indexes token ranges Operational Analytics

Cassandra query methods trade offs #BDS15 6 • Pure-range queries
limited to partition • No Boolean logic • No Full text search • Sorting limited to partition • Full-table scan • High load • High latency • Low concurrency primary key secondary indexes token ranges primary key secondary indexes token ranges Operational Analytics

A third use case #BDS15 7 Analytics Operational Search •
Not as fast as primary key queries • Not as expressive as map reduce • Search can be used for both cases

#BDS15 8 CQL + Lucene A Lucene based secondary index
implementation

A Lucene based secondary index implementation • Proven stable and
fast indexing solution • Expressive queries - Multivariable, ranges, full text, sorting, top-k, etc. • Mature distributed search solutions built on top of it - Solr, ElasticSearch • Just a small embeddable library • Easily extensible • Published under the Apache License #BDS15 9

Cassandra query methods #BDS15 10 primary key token ranges primary
key secondary indexes token ranges primary key secondary indexes token ranges • Mid expressiveness • Mid latency • Mid load • Low expressiveness • Low latency • Low load • High expressiveness • High latency • High load Operational Analytics Search

A Lucene based secondary index implementation CLIENT C* node C*
node C* node Lucene index Lucene index Lucene index #BDS15 11 • Each node indexes its own data • Keep P2P architecture • Distribution and replication managed by C* • Just a single pluggable JAR file - CASSANDRA-8717 JVM JVM JVM

CREATE TABLE tweets ( id bigint, created timestamp, message text,
userid bigint, username text, PRIMARY KEY (userid, created, id) ); Create index • Built in the background in any moment • Real time updates • Mapping eases ETL • Language aware #BDS15 12 ALTER TABLE tweets ADD lucene TEXT; CREATE CUSTOM INDEX tweets_idx ON tweets (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '10', 'schema' : ' fields : { created : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : "english"}, userid : {type : "string"}, username : {type : "string"} } '};

SELECT * FROM tweets WHERE lucene = '{ filter :
{ type : "boolean", must : [ {type : "range", field : "created_at", lower : "2015/01/01"}, {type : "wildcard", field : "user", value : "a*"} ], not : [ {type : "match", field : "user", value : "andres"} ] }, sort : { fields: [ {field : "time", reverse : true}, {field : "user", reverse : false} ] } }' LIMIT 10000; Searching for rows #BDS15 13

Integrating Lucene & Spark CLIENT Spark master C* node C*
node C* node Lucene Lucene Lucene • Compute large amounts of data • Filtering push-down • Avoid systematic full scan • Reduces the amount of data to be processed #BDS15 14

Index performance in Spark #BDS15 15 0 500 1000 1500
2000 2500 0 20 40 60 80 100 seconds millions of collected rows index full scan

SPATIAL SEARCH

Lucene spatial module • Spatial4J shapes - Points, rectangles, circles,
etc. • Spatial search strategies - BBox, RecursivePrefixTree, PointVector, etc. • Not only geographical data - Numbers, dates • It can be combined with other searches #BDS15 17

Indexing geographical locations #BDS15 18 CREATE CUSTOM INDEX restaurants_idx ON
restaurants (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { location : { type : "geo_point", latitude : "lat", longitude : "lon" }, stars: {type : "integer" } } } '}; CREATE TABLE restaurants( name text PRIMARY KEY, stars bigint, lat double, lon double); • No native shape data types in CQL • Many-to-one column mapping • Just points. For now.

Bounding box search #BDS15 19 SELECT * FROM restaurants WHERE
lucene = '{ filter : { type : "geo_bbox", field : "location", min_latitude : 40.425978, max_latitude : 40.445886, min_longitude : -3.808252, max_longitude : -3.770999 } }';

Distance search #BDS15 20 SELECT * FROM restaurants WHERE lucene
= '{ filter : { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, min_distance : "100m", max_distance : "2km" } }';

Combining geospatial searches #BDS15 21 SELECT * FROM restaurants WHERE
lucene = '{ filter : { type : "boolean", must : [ { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, max_distance : "10km" }, { type : "range", field : "stars", lower : 2, upper : 4 } ] } }';

Lucene spatial is not only geospatial… #BDS15 22 • General
geometry • Numeric ranges - NumberRangePrefixTree • Date ranges/durations - DateRangePrefixTree

Temporal/Date durations #BDS15 23 • A pair composed by a
start-date and a stop-date - Can be indexed as points in a 2D space • David Smiley's DateRangePrefixTree - Levels for common date-ranges: years, months, days… - Spatial operations: intersects, is_within, contains 27 Nov 2015 29 Dec 2015 intersects is - within contains

Indexing date ranges #BDS15 24 CREATE CUSTOM INDEX breakdowns_idx ON
breakdowns (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { duration: { type : "date_range", from : "start_date", to : "stop_date", pattern : "yyyy-MM-dd" }, cause: {type : "string" } } } '}; CREATE TABLE breakdowns ( system text PRIMARY KEY, cause text, start_date timestamp, stop_date timestamp); • No native date range type in CQL • Many-to-one column mapping • Spatial operations

Searching for date ranges #BDS15 25 SELECT * FROM breakdowns
WHERE lucene = '{ filter : { type : "date_range", field : "duration", from : "2015-01-01", to : "2015-01-05", operation : "intersects" } }'; SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "date_range", field : "duration", from : "2015-01-01", to : "2015-01-05", operation : "is_within" }, { type : "match", field : "cause", value : "human error" } ] } }';

INDEXING BITEMPORAL DATA

The bitemporal data model #BDS15 27 • Stores WHAT and
WHEN • Support for corrections. • Reproducible business perspective history at a point of time. • Trace why a decision was made.

The bitemporal data model #BDS15 28 • Valid Time -
The application period - WHAT happened, the real life fact period • Transaction Time - The system period - WHEN the system consider it true

The bitemporal data model: example #BDS15 29 person city vt_from
vt_to tt_from tt_to John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994 John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞ John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞ John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001 John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞ Modified example from Wikipedia https://en.wikipedia.org/wiki/Temporal_database

A naïve approach #BDS15 30 CREATE CUSTOM INDEX census_idx ON
census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { vt_from : { type : "date", pattern : "yyyyMMdd" }, vt_to : { type : "date", pattern : "yyyyMMdd" }, tt_from : { type : "date", pattern : "yyyyMMdd" }, tt_to : { type : "date", pattern : "yyyyMMdd" } }} '}; Using 4 dates

A naive approach #BDS15 31 SELECT * FROM census WHERE
lucene = '{ filter : { type : "boolean", must : [ should : [ { type : "range", field : "vt_from", lower : "", upper : "", include_lower=true, include_upper=true }, { type : "range", field : "vt_to", lower : "", upper : "", include_lower=true, include_upper=true }, must : [ { type : "range", field : "vt_from", upper : "", include_upper=true}, { type : "range", field : "vt_to", lower : "", include_lower=true}] ], should : [ { type : "range", field : “tt_from", lower : "", upper : "", include_lower=true, include_upper=true }, { type : "range", field : “tt_to", lower : "", upper : "", include_lower=true, include_upper=true }, must : [ { type : "range", field : “tt_from", upper : "", include_upper=true}, { type : "range", field : “tt_to", lower : "", include_lower=true} ] ] ] } }' AND person = 'John Doe';

A naive approach: Issues #BDS15 32 • Very difficult to
understand/build the query. • Now value (∞) using Long.MAX_VALUE is costly.

A spatial approach #BDS15 33 CREATE CUSTOM INDEX census_idx ON
census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema' : '{ fields : { vt: { type : "date_range", pattern : "yyyyMMdd", from : "vt_from", to : "vt_to" }, tt: { type : "date_range", pattern : "yyyyMMdd", from : "tt_from", to : "tt_to" }, } } '}; Using 2 date ranges SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "date_range", field : "vt", from : "20150501", to : "99999999", operation : "intersects" }, { type : "date_range", field : "tt", from : "20150501", to : "9999999999", operation : "intersects" } ] } }';

A spatial approach: performance issues #BDS15 34 • Very difficult
to understand/build the query. • Now value (∞) using Long.MAX_VALUE is costly.

4R-Tree to the rescue #BDS15 35 • Based on Bliujute,
R., Jensen, C. S., & Slivinskas, G. (2000). Light-weight indexing of general bitemporal data • The Now Value is never stored. • The data is stored in 4 R-Trees. • Queries are transformed and distributed among the trees.

Point(vt_from, tt_from) Line(vt_from,vt_to,tt_to) Rectangle(vt_from,vt_to, tt_from,tt_to) Line(vt_from,vt_to,tt_to) 4R-Tree to the rescue:
storing data #BDS15 36 TT_TO==NOW && VT_TO==NOW TT_TO==NOW && VT_TO!=NOW TT_TO!=NOW && VT_TO==NOW TT_TO!=NOW && VT_TO!=NOW • R1 R2 R3 R4

4R-Tree to the rescue: searching data #BDS15 37 IF (TT_FROM!=NOW)
&& (TT_TO >= VT_FROM): searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO) U searchR3(max(TT_FROM,VT_FROM),TT_TO,0,VT_TO)U searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO) IF (TT_FROM!=NOW) && (TT_TO < VT_FROM): searchR2(0, TT_TO, VT_FROM,VT_TO) U searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO >= VT_FROM): searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO < VT_FROM): searchR2(0, TT_TO, VT_FROM,VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]=[0,MAX]): R1 U R2

4R-Tree to the rescue: #BDS15 38 • Problem!!! Lucene does
not have support for R-Tree • Our Solution: - Use 2 DateRangePrefixTrees for each R-Tree • Future Work: Experiment with other Lucene spatial trees and strategies.

The bitemporal data model: example #BDS15 39 Modified example from
Wikipedia https://en.wikipedia.org/wiki/Temporal_database person city vt_from vt_to tt_from tt_to John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994 John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞ John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞ John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001 John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞

Indexing bitemporal data #BDS15 40 CREATE CUSTOM INDEX census_idx ON
census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema' : '{ fields : { bitemporal : { type : "bitemporal", vt_from : "vt_from", vt_to : "vt_to", tt_from : "vt_from", tt_to : "tt_to", pattern : "yyyyMMdd" now_value : "99999999" }, city : { type : "string" } } } '}; CREATE TABLE census ( person text, city text, vt_from text, vt_to text, tt_from text, tt_to text, lucene text, PRIMARY KEY((person),vt_from,tt_from) );

Searching for bitemporal data, several queries #BDS15 41 SELECT *
FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "99999999", vt_to : "99999999", tt_from : "99999999", tt_to : "99999999" } }' AND person = 'John Doe'; Where does the system currently think that John lives right now? person city vt_from vt_to tt_from tt_to John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞

Searching for bitemporal data #BDS15 42 person city vt_from vt_to
tt_from tt_to John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ Where does the system currently think that John lived in 1999? SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "19990101", vt_to : "19991231", tt_from : "99999999", tt_to : "99999999" } }' AND person = 'John Doe';

#BDS15 43 On 01-Jan-2000, where did the system think John
was living back in 1999? SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "19990101", vt_to : "19991231", tt_from : “20000101", tt_to : “20000101" } }' AND person = 'John Doe'; person city vt_from vt_to tt_from tt_to John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 Searching for bitemporal data

#BDS15 44 SELECT * FROM users WHERE lucene = '{
filter : { type : "boolean", must : [ { type : "bitemporal", field : "bitemporal", vt_from : "99999999", vt_to : "99999999", tt_from : "99999999", tt_to : "99999999" }, { type : "match", field : "city", value : "smallville"} ]} }}'; Who currently lives at Smallville? Searching for bitemporal data

CONCLUSIONS

Conclusions • Pluggable Lucene features in Cassandra • Basic geospatial
features • Date/Time durations • Bitemporal data model indexing • Compatible with MapReduce frameworks • Preserves Cassandra's functionality #BDS15 46

github.com/stratio/cassandra-lucene-index • Published as plugin for Apache Cassandra • Apache
License Version 2.0 Its open source #BDS15 47

BIG DATA CHILD`S PLAY Andrés de la Peña [email protected] @a_de_la_pena

Geospatial and bitemporal search in C* with plu...

Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés de la Peña at Big Data Spain 2015

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript