Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés de la Peña at Big Data Spain 2015

Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés de la Peña at Big Data Spain 2015

Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra.

With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module.

Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-9.html

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

October 21, 2015
Tweet

Transcript

  1. None
  2. 1 Andrés de la Peña andres@stratio.com GEOSPATIAL AND BITEMPORAL SEARCH

    IN C* WITH PLUGGABLE LUCENE INDEX @a_de_la_pena
  3. Pluggable Lucene based 2i Geospatial Search Bitemporal Indexes 1 2

    3 CONTENTS
  4. PLUGGABLE LUCENE 2i

  5. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query

    methods #BDS15 4
  6. primary key secondary indexes token ranges Cassandra query methods by

    use case #BDS15 5 primary key secondary indexes token ranges Operational Analytics
  7. Cassandra query methods trade offs #BDS15 6 • Pure-range queries

    limited to partition • No Boolean logic • No Full text search • Sorting limited to partition • Full-table scan • High load • High latency • Low concurrency primary key secondary indexes token ranges primary key secondary indexes token ranges Operational Analytics
  8. A third use case #BDS15 7 Analytics Operational Search •

    Not as fast as primary key queries • Not as expressive as map reduce • Search can be used for both cases
  9. #BDS15 8 CQL + Lucene A Lucene based secondary index

    implementation
  10. A Lucene based secondary index implementation • Proven stable and

    fast indexing solution • Expressive queries - Multivariable, ranges, full text, sorting, top-k, etc. • Mature distributed search solutions built on top of it - Solr, ElasticSearch • Just a small embeddable library • Easily extensible • Published under the Apache License #BDS15 9
  11. Cassandra query methods #BDS15 10 primary key token ranges primary

    key secondary indexes token ranges primary key secondary indexes token ranges • Mid expressiveness • Mid latency • Mid load • Low expressiveness • Low latency • Low load • High expressiveness • High latency • High load Operational Analytics Search
  12. A Lucene based secondary index implementation CLIENT C* node C*

    node C* node Lucene index Lucene index Lucene index #BDS15 11 • Each node indexes its own data • Keep P2P architecture • Distribution and replication managed by C* • Just a single pluggable JAR file - CASSANDRA-8717 JVM JVM JVM
  13. CREATE TABLE tweets ( id bigint, created timestamp, message text,

    userid bigint, username text, PRIMARY KEY (userid, created, id) ); Create index • Built in the background in any moment • Real time updates • Mapping eases ETL • Language aware #BDS15 12 ALTER TABLE tweets ADD lucene TEXT; CREATE CUSTOM INDEX tweets_idx ON tweets (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '10', 'schema' : ' fields : { created : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : "english"}, userid : {type : "string"}, username : {type : "string"} } '};
  14. SELECT * FROM tweets WHERE lucene = '{ filter :

    { type : "boolean", must : [ {type : "range", field : "created_at", lower : "2015/01/01"}, {type : "wildcard", field : "user", value : "a*"} ], not : [ {type : "match", field : "user", value : "andres"} ] }, sort : { fields: [ {field : "time", reverse : true}, {field : "user", reverse : false} ] } }' LIMIT 10000; Searching for rows #BDS15 13
  15. Integrating Lucene & Spark CLIENT Spark master C* node C*

    node C* node Lucene Lucene Lucene • Compute large amounts of data • Filtering push-down • Avoid systematic full scan • Reduces the amount of data to be processed #BDS15 14
  16. Index performance in Spark #BDS15 15 0 500 1000 1500

    2000 2500 0 20 40 60 80 100 seconds millions of collected rows index full scan
  17. SPATIAL SEARCH

  18. Lucene spatial module • Spatial4J shapes - Points, rectangles, circles,

    etc. • Spatial search strategies - BBox, RecursivePrefixTree, PointVector, etc. • Not only geographical data - Numbers, dates • It can be combined with other searches #BDS15 17
  19. Indexing geographical locations #BDS15 18 CREATE CUSTOM INDEX restaurants_idx ON

    restaurants (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { location : { type : "geo_point", latitude : "lat", longitude : "lon" }, stars: {type : "integer" } } } '}; CREATE TABLE restaurants( name text PRIMARY KEY, stars bigint, lat double, lon double); • No native shape data types in CQL • Many-to-one column mapping • Just points. For now.
  20. Bounding box search #BDS15 19 SELECT * FROM restaurants WHERE

    lucene = '{ filter : { type : "geo_bbox", field : "location", min_latitude : 40.425978, max_latitude : 40.445886, min_longitude : -3.808252, max_longitude : -3.770999 } }';
  21. Distance search #BDS15 20 SELECT * FROM restaurants WHERE lucene

    = '{ filter : { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, min_distance : "100m", max_distance : "2km" } }';
  22. Combining geospatial searches #BDS15 21 SELECT * FROM restaurants WHERE

    lucene = '{ filter : { type : "boolean", must : [ { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, max_distance : "10km" }, { type : "range", field : "stars", lower : 2, upper : 4 } ] } }';
  23. Lucene spatial is not only geospatial… #BDS15 22 • General

    geometry • Numeric ranges - NumberRangePrefixTree • Date ranges/durations - DateRangePrefixTree
  24. Temporal/Date durations #BDS15 23 • A pair composed by a

    start-date and a stop-date - Can be indexed as points in a 2D space • David Smiley's DateRangePrefixTree - Levels for common date-ranges: years, months, days… - Spatial operations: intersects, is_within, contains 27 Nov 2015 29 Dec 2015 intersects is - within contains
  25. Indexing date ranges #BDS15 24 CREATE CUSTOM INDEX breakdowns_idx ON

    breakdowns (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { duration: { type : "date_range", from : "start_date", to : "stop_date", pattern : "yyyy-MM-dd" }, cause: {type : "string" } } } '}; CREATE TABLE breakdowns ( system text PRIMARY KEY, cause text, start_date timestamp, stop_date timestamp); • No native date range type in CQL • Many-to-one column mapping • Spatial operations
  26. Searching for date ranges #BDS15 25 SELECT * FROM breakdowns

    WHERE lucene = '{ filter : { type : "date_range", field : "duration", from : "2015-01-01", to : "2015-01-05", operation : "intersects" } }'; SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "date_range", field : "duration", from : "2015-01-01", to : "2015-01-05", operation : "is_within" }, { type : "match", field : "cause", value : "human error" } ] } }';
  27. INDEXING BITEMPORAL DATA

  28. The bitemporal data model #BDS15 27 • Stores WHAT and

    WHEN • Support for corrections. • Reproducible business perspective history at a point of time. • Trace why a decision was made.
  29. The bitemporal data model #BDS15 28 • Valid Time -

    The application period - WHAT happened, the real life fact period • Transaction Time - The system period - WHEN the system consider it true
  30. The bitemporal data model: example #BDS15 29 person city vt_from

    vt_to tt_from tt_to John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994 John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞ John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞ John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001 John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞ Modified example from Wikipedia https://en.wikipedia.org/wiki/Temporal_database
  31. A naïve approach #BDS15 30 CREATE CUSTOM INDEX census_idx ON

    census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { vt_from : { type : "date", pattern : "yyyyMMdd" }, vt_to : { type : "date", pattern : "yyyyMMdd" }, tt_from : { type : "date", pattern : "yyyyMMdd" }, tt_to : { type : "date", pattern : "yyyyMMdd" } }} '}; Using 4 dates
  32. A naive approach #BDS15 31 SELECT * FROM census WHERE

    lucene = '{ filter : { type : "boolean", must : [ should : [ { type : "range", field : "vt_from", lower : "", upper : "", include_lower=true, include_upper=true }, { type : "range", field : "vt_to", lower : "", upper : "", include_lower=true, include_upper=true }, must : [ { type : "range", field : "vt_from", upper : "", include_upper=true}, { type : "range", field : "vt_to", lower : "", include_lower=true}] ], should : [ { type : "range", field : “tt_from", lower : "", upper : "", include_lower=true, include_upper=true }, { type : "range", field : “tt_to", lower : "", upper : "", include_lower=true, include_upper=true }, must : [ { type : "range", field : “tt_from", upper : "", include_upper=true}, { type : "range", field : “tt_to", lower : "", include_lower=true} ] ] ] } }' AND person = 'John Doe';
  33. A naive approach: Issues #BDS15 32 • Very difficult to

    understand/build the query. • Now value (∞) using Long.MAX_VALUE is costly.
  34. A spatial approach #BDS15 33 CREATE CUSTOM INDEX census_idx ON

    census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema' : '{ fields : { vt: { type : "date_range", pattern : "yyyyMMdd", from : "vt_from", to : "vt_to" }, tt: { type : "date_range", pattern : "yyyyMMdd", from : "tt_from", to : "tt_to" }, } } '}; Using 2 date ranges SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "date_range", field : "vt", from : "20150501", to : "99999999", operation : "intersects" }, { type : "date_range", field : "tt", from : "20150501", to : "9999999999", operation : "intersects" } ] } }';
  35. A spatial approach: performance issues #BDS15 34 • Very difficult

    to understand/build the query. • Now value (∞) using Long.MAX_VALUE is costly.
  36. 4R-Tree to the rescue #BDS15 35 • Based on Bliujute,

    R., Jensen, C. S., & Slivinskas, G. (2000). Light-weight indexing of general bitemporal data • The Now Value is never stored. • The data is stored in 4 R-Trees. • Queries are transformed and distributed among the trees.
  37. Point(vt_from, tt_from) Line(vt_from,vt_to,tt_to) Rectangle(vt_from,vt_to, tt_from,tt_to) Line(vt_from,vt_to,tt_to) 4R-Tree to the rescue:

    storing data #BDS15 36 TT_TO==NOW && VT_TO==NOW TT_TO==NOW && VT_TO!=NOW TT_TO!=NOW && VT_TO==NOW TT_TO!=NOW && VT_TO!=NOW • R1 R2 R3 R4
  38. 4R-Tree to the rescue: searching data #BDS15 37 IF (TT_FROM!=NOW)

    && (TT_TO >= VT_FROM): searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO) U searchR3(max(TT_FROM,VT_FROM),TT_TO,0,VT_TO)U searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO) IF (TT_FROM!=NOW) && (TT_TO < VT_FROM): searchR2(0, TT_TO, VT_FROM,VT_TO) U searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO >= VT_FROM): searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO < VT_FROM): searchR2(0, TT_TO, VT_FROM,VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]=[0,MAX]): R1 U R2
  39. 4R-Tree to the rescue: #BDS15 38 • Problem!!! Lucene does

    not have support for R-Tree • Our Solution: - Use 2 DateRangePrefixTrees for each R-Tree • Future Work: Experiment with other Lucene spatial trees and strategies.
  40. The bitemporal data model: example #BDS15 39 Modified example from

    Wikipedia https://en.wikipedia.org/wiki/Temporal_database person city vt_from vt_to tt_from tt_to John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994 John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞ John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞ John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001 John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
  41. Indexing bitemporal data #BDS15 40 CREATE CUSTOM INDEX census_idx ON

    census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema' : '{ fields : { bitemporal : { type : "bitemporal", vt_from : "vt_from", vt_to : "vt_to", tt_from : "vt_from", tt_to : "tt_to", pattern : "yyyyMMdd" now_value : "99999999" }, city : { type : "string" } } } '}; CREATE TABLE census ( person text, city text, vt_from text, vt_to text, tt_from text, tt_to text, lucene text, PRIMARY KEY((person),vt_from,tt_from) );
  42. Searching for bitemporal data, several queries #BDS15 41 SELECT *

    FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "99999999", vt_to : "99999999", tt_from : "99999999", tt_to : "99999999" } }' AND person = 'John Doe'; Where does the system currently think that John lives right now? person city vt_from vt_to tt_from tt_to John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
  43. Searching for bitemporal data #BDS15 42 person city vt_from vt_to

    tt_from tt_to John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ Where does the system currently think that John lived in 1999? SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "19990101", vt_to : "19991231", tt_from : "99999999", tt_to : "99999999" } }' AND person = 'John Doe';
  44. #BDS15 43 On 01-Jan-2000, where did the system think John

    was living back in 1999? SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "19990101", vt_to : "19991231", tt_from : “20000101", tt_to : “20000101" } }' AND person = 'John Doe'; person city vt_from vt_to tt_from tt_to John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 Searching for bitemporal data
  45. #BDS15 44 SELECT * FROM users WHERE lucene = '{

    filter : { type : "boolean", must : [ { type : "bitemporal", field : "bitemporal", vt_from : "99999999", vt_to : "99999999", tt_from : "99999999", tt_to : "99999999" }, { type : "match", field : "city", value : "smallville"} ]} }}'; Who currently lives at Smallville? Searching for bitemporal data
  46. CONCLUSIONS

  47. Conclusions • Pluggable Lucene features in Cassandra • Basic geospatial

    features • Date/Time durations • Bitemporal data model indexing • Compatible with MapReduce frameworks • Preserves Cassandra's functionality #BDS15 46
  48. github.com/stratio/cassandra-lucene-index • Published as plugin for Apache Cassandra • Apache

    License Version 2.0 Its open source #BDS15 47
  49. BIG DATA CHILD`S PLAY Andrés de la Peña andres@stratio.com @a_de_la_pena