Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Workshop: Learning ElasticSearch

Anurag
July 11, 2013

Workshop: Learning ElasticSearch

Slides from ElasticSearch workshop conducted at The Fifth Elephant 2013, Bangalore.

Anurag

July 11, 2013
Tweet

More Decks by Anurag

Other Decks in Technology

Transcript

  1. Learning ElasticSearch — Fifth Elephant 2013, Bangalore. Anurag Patel Red

    Hat
  2. http://xinh.org/5el Also available at

  3. ElasticWho? ElasticSearch is a flexible and powerful open source, distributed

    real-time search and analytics engine.
  4. Features Real time analytics Distributed High availability Multi tenant architecture

    Full text Document oriented Schema free RESTful API Per-operation persistence
  5. Distributed Start small and scale horizontally out of the box.

    For more capacity, just add more nodes and let the cluster reorganize itself.
  6. High Availability ElasticSearch clusters detect and remove failed nodes, and

    reorganize themselves.
  7. Multi Tenancy A cluster can host multiple indices which can

    be queried independently, or as a group. $ c u r l - X P U T h t t p : / / l o c a l h o s t : 9 2 0 0 / p e o p l e $ c u r l - X P U T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s $ c u r l - X P U T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / d o c u m e n t / p r y - 0 . 5 . 9 $ c u r l - X G E T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / d o c u m e n t / p r y - 0 . 5 . 9
  8. Document Oriented Store complex real world entities in Elasticsearch as

    structured JSON documents. { " _ i d " : " p r y - 0 . 5 . 9 " , " _ i n d e x " : " g e m s " , " _ s o u r c e " : { " a u t h o r s " : [ " J o h n M a i r ( b a n i s t e r f i e n d ) " ] , " a u t o r e q u i r e " : n u l l , " b i n d i r " : " b i n " , " c e r t _ c h a i n " : [ ] , " d a t e " : " S u n F e b 2 0 1 1 : 0 0 : 0 0 U T C 2 0 1 1 " , " d e f a u l t _ e x e c u t a b l e " : n u l l , " d e s c r i p t i o n " : " a t t a c h a n i r b - l i k e s e s s i o n t o a n y o b j e c t a t r u n t i m e " , " e m a i l " : " j r m a i r @ g m a i l . c o m " } }
  9. RESTful API Almost any operation can be performed using a

    simple RESTful interface using JSON over HTTP. curl -X GET curl -X PUT curl -X POST curl -X DELETE
  10. Apache Lucene ElasticSearch is built on top of Apache Lucene.

    Lucene is a high performance, full-featured Information Retrieval library, written in Java.
  11. ElasticSearch Terminology

  12. Document $ curl -XGET http://localhost:9200/gems/document/pry-0.5.9 In ElasticSearch, everything is stored

    as a Document. Document can be addressed and retrieved by querying their attributes. { " _ i d " : " p r y - 0 . 5 . 9 " , " _ i n d e x " : " g e m s " , " _ s o u r c e " : { " a u t h o r s " : [ " J o h n M a i r ( b a n i s t e r f i e n d ) " ] , " a u t o r e q u i r e " : n u l l , " b i n d i r " : " b i n " , " c e r t _ c h a i n " : [ ] , " d a t e " : " S u n F e b 2 0 1 1 : 0 0 : 0 0 U T C 2 0 1 1 " , " d e f a u l t _ e x e c u t a b l e " : n u l l , " d e s c r i p t i o n " : " a t t a c h a n i r b - l i k e s e s s i o n t o a n y o b j e c t a t r u n t i m e " , " e m a i l " : " j r m a i r @ g m a i l . c o m " , " e x e c u t a b l e s " : [ " p r y " ] , " e x t e n s i o n s " : [ ] , " e x t r a _ r d o c _ f i l e s " : [ ] , " f i l e s " : [ " l i b / p r y / c o m m a n d s . r b " , " l i b / p r y / c o m m a n d _ b a s e . r b " , " l i b / p r y / c o m p l e t i o n . r b " , " l i b / p r y / c o r e _ e x t e n s i o n s . r b " , " l i b / p r y / h o o k s . r b " , " l i b / p r y / p r i n t . r b " , " l i b / p r y / p r o m p t s . r b " , " l i b / p r y / p r y _ c l a s s . r b " , " l i b / p r y / p r y _ i n s t a n c e . r b " , " l i b / p r y / v e r s i o n . r b " , " l i b / p r y . r b " , " e x a m p l e s / e x a m p l e _ b a s i c . r b " ,
  13. Document Types Lets us specify document properties, so we can

    differentiate the objects.
  14. Shard Each Shard is a separate native Lucene Index. Lets

    us overcome RAM limitations, hard disk capacity.
  15. Replica An exact copy of primary Shard. Helps in setting

    up HA, increases query throughput.
  16. Index ElasticSearch stores its data in logical Indices. Think of

    a table, collection or a database. An Index has atleast 1 primary Shard, and 0 or more Replicas.
  17. Cluster A collection of cooperating ElasticSearch nodes. Gives better availability

    and performance via Index Sharding and Replicas.
  18. ElasticSearch Workshop

  19. Download and start Download ElasticSearch from http://www.elasticsearch.org/download # s e

    r v i c e e l a s t i c s e a r c h s t a r t # / e t c / i n i t . d / e l a s t i c s e a r c h s t a r t # . / b i n / e l a s t i c s e a r c h - f
  20. ElasticSearch Plugins A site plugin to view contents of ElasticSearch

    cluster. Restart ElasticSearch. Plugins are detected and loaded on service startup. # c d / u s r / s h a r e / e l a s t i c s e a r c h # . / b i n / p l u g i n - i n s t a l l m o b z / e l a s t i c s e a r c h - h e a d # c d / o p t / e l a s t i c s e a r c h - 0 . 9 0 . 2 # . / b i n / p l u g i n - i n s t a l l m o b z / e l a s t i c s e a r c h - h e a d
  21. elasticsearch-head

  22. RESTful interface $ c u r l - X G

    E T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / ' { " o k " : t r u e , " s t a t u s " : 2 0 0 , " n a m e " : " D r a k e , F r a n k " , " v e r s i o n " : { " n u m b e r " : " 0 . 9 0 . 2 " , " s n a p s h o t _ b u i l d " : f a l s e , " l u c e n e _ v e r s i o n " : " 4 . 3 . 1 " } , " t a g l i n e " : " Y o u K n o w , f o r S e a r c h " }
  23. Create Index $ c u r l - X P

    U T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s ' { " o k " : t r u e , " a c k n o w l e d g e d " : t r u e }
  24. Cluster status $ c u r l - X G

    E T ' l o c a l h o s t : 9 2 0 0 / _ s t a t u s ' { " o k " : t r u e , " _ s h a r d s " : { " t o t a l " : 2 0 , " s u c c e s s f u l " : 1 0 , " f a i l e d " : 0 } , " i n d i c e s " : { " g e m s " : { " i n d e x " : { " p r i m a r y _ s i z e " : " 4 9 5 b " , " p r i m a r y _ s i z e _ i n _ b y t e s " : 4 9 5 , " s i z e " : " 4 9 5 b " , " s i z e _ i n _ b y t e s " : 4 9 5 } , " t r a n s l o g " : { " o p e r a t i o n s " : 0 } , " d o c s " : { " n u m _ d o c s " : 0 , " m a x _ d o c " : 0 , " d e l e t e d _ d o c s " : 0 } , " m e r g e s " : { " c u r r e n t " : 0 , " c u r r e n t _ d o c s " : 0 , " c u r r e n t _ s i z e " : " 0 b " , " c u r r e n t _ s i z e _ i n _ b y t e s " : 0 , " t o t a l " : 0 , " t o t a l _ t i m e " : " 0 s " , " t o t a l _ t i m e _ i n _ m i l l i s " : 0 , " t o t a l _ d o c s " : 0 , " t o t a l _ s i z e " : " 0 b " , " t o t a l _ s i z e _ i n _ b y t e s " : 0 } , . . . . . . . . .
  25. Pretty Output $ c u r l - X G

    E T ' l o c a l h o s t : 9 2 0 0 / _ s t a t u s ? p r e t t y ' $ c u r l - X G E T ' l o c a l h o s t : 9 2 0 0 / _ s t a t u s ' | p y t h o n - m j s o n . t o o l $ c u r l - X G E T ' l o c a l h o s t : 9 2 0 0 / _ s t a t u s ' | j s o n _ r e f o r m a t { " o k " : t r u e , " _ s h a r d s " : { " t o t a l " : 2 0 , " s u c c e s s f u l " : 1 0 , " f a i l e d " : 0 } , " i n d i c e s " : { " g e m s " : { " i n d e x " : { " p r i m a r y _ s i z e " : " 4 9 5 b " , " p r i m a r y _ s i z e _ i n _ b y t e s " : 4 9 5 , " s i z e " : " 4 9 5 b " , " s i z e _ i n _ b y t e s " : 4 9 5 } , . . .
  26. Delete Index $ c u r l - X D

    E L E T E ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s ' { " o k " : t r u e , " a c k n o w l e d g e d " : t r u e }
  27. Create custom Index { " s e t t i

    n g s " : { " i n d e x " : { " n u m b e r _ o f _ s h a r d s " : 6 , " n u m b e r _ o f _ r e p l i c a s " : 0 } } } $ c u r l - X P U T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s ' - d @ b o d y . j s o n { " o k " : t r u e , " a c k n o w l e d g e d " : t r u e }
  28. Index a document { " n a m e "

    : " p r y " , " p l a t f o r m " : " r u b y " , " r u b y g e m s _ v e r s i o n " : " 1 . 5 . 2 " , " d e s c r i p t i o n " : " a t t a c h a n i r b - l i k e s e s s i o n t o a n y o b j e c t a t r u n t i m e " , " e m a i l " : " a n u r a g @ e x a m p l e . c o m " , " h a s _ r d o c " : t r u e , " h o m e p a g e " : " h t t p : / / b a n i s t e r f i e n d . w o r d p r e s s . c o m " } $ c u r l - X P O S T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / ' - d @ b o d y . j s o n { " o k " : t r u e , " _ i n d e x " : " g e m s " , " _ t y p e " : " t e s t " , " _ i d " : " l s J g x i w E T 6 e g " , " _ v e r s i o n " : 1 }
  29. Get document $ c u r l - X G

    E T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / l s J g x i w E T 6 e g ' | p y t h o n - m j s o n . t o o l { " _ i d " : " l s J g x i w E T 6 e g " , " _ i n d e x " : " g e m s " , " _ s o u r c e " : { " d e s c r i p t i o n " : " a t t a c h a n i r b - l i k e s e s s i o n t o a n y o b j e c t a t r u n t i m e " , " e m a i l " : " a n u r a g @ e x a m p l e . c o m " , " h a s _ r d o c " : t r u e , " h o m e p a g e " : " h t t p : / / b a n i s t e r f i e n d . w o r d p r e s s . c o m " , " n a m e " : " p r y " , " p l a t f o r m " : " r u b y " , " r u b y g e m s _ v e r s i o n " : " 1 . 5 . 2 " } , " _ t y p e " : " t e s t " , " _ v e r s i o n " : 1 , " e x i s t s " : t r u e }
  30. Index another document { " n a m e "

    : " g r i t " , " p l a t f o r m " : " j r u b y " , " r u b y g e m s _ v e r s i o n " : " 2 . 5 . 0 " , " d e s c r i p t i o n " : " R u b y l i b r a r y f o r e x t r a c t i n g i n f o r m a t i o n f r o m a g i t r e p o s i t o r y . " , " e m a i l " : " m o j o m b o @ g i t h u b . c o m " , " h a s _ r d o c " : f a l s e , " h o m e p a g e " : " h t t p : / / g i t h u b . c o m / m o j o m b o / g r i t " } $ c u r l - X P O S T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / ' - d @ b o d y . j s o n { " o k " : t r u e , " _ i n d e x " : " g e m s " , " _ t y p e " : " t e s t " , " _ i d " : " i j U O H i 2 c Q c 2 " , " _ v e r s i o n " : 1 }
  31. Custom Document IDs IDs are unique across Index. Composed of

    DocumentType and ID. { " n a m e " : " g r i t " , " p l a t f o r m " : " j r u b y " , " r u b y g e m s _ v e r s i o n " : " 2 . 5 . 1 " , " d e s c r i p t i o n " : " R u b y l i b r a r y f o r e x t r a c t i n g i n f o r m a t i o n f r o m a g i t r e p o s i t o r y . " , " e m a i l " : " m o j o m b o @ g i t h u b . c o m " , " h a s _ r d o c " : f a l s e , " h o m e p a g e " : " h t t p : / / g i t h u b . c o m / m o j o m b o / g r i t " } $ c u r l - X P U T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / g r i t - 2 . 5 . 1 ' - d @ b o d y . j s o n { " o k " : t r u e , " _ i n d e x " : " g e m s " , " _ t y p e " : " t e s t " , " _ i d " : " g r i t - 2 . 5 . 1 " , " _ v e r s i o n " : 1 }
  32. Document Versions $ c u r l - X P

    U T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / g r i t - 2 . 5 . 1 ' - d @ b o d y . j s o n { " o k " : t r u e , " _ i n d e x " : " g e m s " , " _ t y p e " : " t e s t " , " _ i d " : " g r i t - 2 . 5 . 1 " , " _ v e r s i o n " : 2 }
  33. Searching Documents { " q u e r y "

    : { " t e r m " : { " n a m e " : " p r y " } } } $ c u r l - X P O S T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / _ s e a r c h - d @ b o d y . j s o n | p y t h o n - m j s o n . t o o l { " _ s h a r d s " : { " f a i l e d " : 0 , " s u c c e s s f u l " : 6 , " t o t a l " : 6 } , " h i t s " : { " h i t s " : [ { " _ i d " : " M W k K g z s M R g K " , " _ i n d e x " : " g e m s " , " _ s c o r e " : 1 . 4 0 5 4 6 5 1 , " _ s o u r c e " : { " d e s c r i p t i o n " : " a t t a c h a n i r b - l i k e s e s s i o n t o a n y o b j e c t a t r u n t i m e " , " e m a i l " : " a n u r a g @ e x a m p l e . c o m " , " h a s _ r d o c " : t r u e , " h o m e p a g e " : " h t t p : / / b a n i s t e r f i e n d . w o r d p r e s s . c o m " , " n a m e " : " p r y " , " p l a t f o r m " : " r u b y " , " r u b y g e m s _ v e r s i o n " : " 1 . 5 . 2 " } , " _ t y p e " : " t e s t " } ] , " m a x _ s c o r e " : 1 . 4 0 5 4 6 5 1 , " t o t a l " : 1
  34. Counting Documents { " t e r m " :

    { " n a m e " : " p r y " } } $ c u r l - X G E T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / _ c o u n t - d @ b o d y . j s o n { " _ s h a r d s " : { " f a i l e d " : 0 , " s u c c e s s f u l " : 6 , " t o t a l " : 6 } , " c o u n t " : 1 }
  35. Update a Document The partial document is merged using simple

    recursive merge. { " d o c " : { " p l a t f o r m " : " m a c r u b y " } } $ c u r l - X P O S T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / g r i t - 2 . 5 . 1 / _ u p d a t e - d @ b o d y . j s o n { " o k " : t r u e , " _ i n d e x " : " g e m s " , " _ t y p e " : " t e s t " , " _ i d " : " g r i t - 2 . 5 . 1 " , " _ v e r s i o n " : 4 }
  36. Update via Script { " s c r i p

    t " : " c t x . _ s o u r c e . p l a t f o r m = v m _ n a m e " , " p a r a m s " : { " v m _ n a m e " : " r u b i n i u s " } } $ c u r l - X P O S T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / g r i t - 2 . 5 . 1 / _ u p d a t e - d @ b o d y . j s o n { " o k " : t r u e , " _ i n d e x " : " g e m s " , " _ t y p e " : " t e s t " , " _ i d " : " g r i t - 2 . 5 . 1 " , " _ v e r s i o n " : 5 }
  37. Delete Document $ c u r l - X D

    E L E T E ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / t e s t / g r i t - 2 . 5 . 1 ' { " o k " : t r u e , " f o u n d " : t r u e , " _ i n d e x " : " g e m s " , " _ t y p e " : " t e s t " , " _ i d " : " g r i t - 2 . 5 . 1 " , " _ v e r s i o n " : 6 }
  38. Put Mapping { " g e m " : {

    " p r o p e r t i e s " : { " n a m e " : { " t y p e " : " s t r i n g " , " i n d e x " : " n o t _ a n a l y z e d " } , " p l a t f o r m " : { " t y p e " : " s t r i n g " , " i n d e x " : " n o t _ a n a l y z e d " } , " r u b y g e m s _ v e r s i o n " : { " t y p e " : " s t r i n g " , " i n d e x " : " n o t _ a n a l y z e d " } , " d e s c r i p t i o n " : { " t y p e " : " s t r i n g " , " s t o r e " : " y e s " } , " h a s _ r d o c " : { " t y p e " : " b o o l e a n " } } } } $ c u r l - X P U T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / g e m / _ m a p p i n g ' - d @ b o d y . j s o n $ c u r l - X G E T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / _ m a p p i n g ' | p y t h o n - m j s o n . t o o l
  39. Index Document with Mapping { " n a m e

    " : " g r i t " , " p l a t f o r m " : " r u b y " , " r u b y g e m s _ v e r s i o n " : " 2 . 5 . 1 " , " d e s c r i p t i o n " : " R u b y l i b r a r y f o r e x t r a c t i n g i n f o r m a t i o n f r o m a g i t r e p o s i t o r y . " , " e m a i l " : " m o j o m b o @ g i t h u b . c o m " , " h a s _ r d o c " : f a l s e , " h o m e p a g e " : " h t t p : / / g i t h u b . c o m / m o j o m b o / g r i t " } $ c u r l - X P U T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / g e m / g r i t - 2 . 5 . 1 ' - d @ b o d y . j s o n { " o k " : t r u e , " _ i n d e x " : " g e m s " , " _ t y p e " : " g e m " , " _ i d " : " g r i t - 2 . 5 . 1 " , " _ v e r s i o n " : 1 }
  40. Matching documents { " q u e r y "

    : { " m a t c h " : { " d e s c r i p t i o n " : " g i t r e p o s i t o r y " } } } $ c u r l - X P O S T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / g e m / _ s e a r c h - d @ b o d y . j s o n
  41. Highlighting { " q u e r y " :

    { " m a t c h " : { " d e s c r i p t i o n " : " g i t r e p o s i t o r y " } } , " h i g h l i g h t " : { " f i e l d s " : { " d e s c r i p t i o n " : { } } } } $ c u r l - X P O S T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / g e m / _ s e a r c h - d @ b o d y . j s o n " h i g h l i g h t " : { " d e s c r i p t i o n " : [ " R u b y l i b r a r y f o r e x t r a c t i n g i n f o r m a t i o n f r o m a < e m > g i t < / e m > < e m > r e p o s i t o r y < / e m > . " ] }
  42. Search Facets { " q u e r y "

    : { " m a t c h _ a l l " : { } } , " f a c e t s " : { " g e m _ n a m e s " : { " t e r m s " : { " f i e l d " : " n a m e " } } } } $ c u r l - X P O S T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / _ s e a r c h - d @ b o d y . j s o n . . . " f a c e t s " : { " g e m _ n a m e s " : { " _ t y p e " : " t e r m s " , " m i s s i n g " : 0 , " o t h e r " : 0 , " t e r m s " : [ { " c o u n t " : 2 , " t e r m " : " p r y " } , { " c o u n t " : 2 , " t e r m " : " g r i t " } , { " c o u n t " : 1 , " t e r m " : " a b c " } ] , " t o t a l " : 5 } } ,
  43. (Lab) Analyzing Aadhaar's Datasets

  44. Download Public Dataset Download from Aadhaar Public Data Portal at

    https://data.uidai.gov.in
  45. Download Tools $ git clone https://github.com/gnurag/aadhaar

  46. Prepare Data & Configure # g e m i n

    s t a l l y a j l - r u b y t i r e a c t i v e s u p p o r t $ g i t c l o n e h t t p s : / / g i t h u b . c o m / g n u r a g / a a d h a a r $ c d a a d h a a r / d a t a $ u n z i p U I D A I - E N R - D E T A I L - 2 0 1 2 1 0 0 1 . z i p $ c d . . / b i n $ v i a a d h a a r . r b
  47. Configuration A A D H A A R _ D

    A T A _ D I R = " / p a t h / t o / a a d h a a r / d a t a " E S _ U R L = " h t t p : / / l o c a l h o s t : 9 2 0 0 " E S _ I N D E X = ' a a d h a a r ' E S _ T Y P E = " U I D " B A T C H _ S I Z E = 1 0 0 0
  48. Index $ ruby aadhaar.rb

  49. Running Examples $ curl -XPOST http://localhost:9200/aadhaar/UID/_search -d @template.json | python

    -mjson.tool
  50. Additional Notes

  51. Index Aliases Group multiple Indexes, and query them together. c

    u r l - X P O S T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / _ a l i a s e s ' - d ' { " a c t i o n s " : [ { " a d d " : { " i n d e x " : " i n d e x 1 " , " a l i a s " : " m a s t e r - a l i a s " } } { " a d d " : { " i n d e x " : " i n d e x 2 " , " a l i a s " : " m a s t e r - a l i a s " } } ] } ' c u r l - X P O S T ' h t t p : / / l o c a l h o s t : 9 2 0 0 / _ a l i a s e s ' - d ' { " a c t i o n s " : [ { " r e m o v e " : { " i n d e x " : " i n d e x 2 " , " a l i a s " : " m a s t e r - a l i a s " } } ] } '
  52. Document Routing Control which Shard the document will be placed

    and queried from.
  53. Parents & Children $ c u r l - X

    P U T h t t p : / / l o c a l h o s t : 9 2 0 0 / g e m s / g e m / r o x m l ? p a r e n t = r e x m l - d ' { " t a g " : " s o m e t h i n g " } '
  54. Custom Analyzers

  55. Boosting Search Results

  56. ElasticSearch Ecosystem A wide range of site plugins, analyzers, river

    plugins available from the community.
  57. THE END / @gnurag github