Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Querying rich text with Lux - XQuery for Lucene

Querying rich text with Lux - XQuery for Lucene

talk given at Lucene/Solr Revolution Dublin 2013

Michael Sokolov

November 06, 2013
Tweet

Other Decks in Technology

Transcript

  1. !   Overview  of  Lux !   Why  we  need

     want  a  rich(er)  query  language !   Implementation  Stories !   Indexing  tagged  text !   Storing  documents  in  Lucene !   Lazy  searching !   Demo The  plan  for  this  talk
  2. !  XQuery  in  Solr !   Query  optimizer !  

    Efficient  XML  document  format !   XQuery  function  library !   as  a  Java  library  (Lucene  only) !   as  Solr  plugins !   as  a  standalone  App  Server What  is  Lux?
  3. !   maybe  it  was  once  –  10  year  s

     ago? !   Legacy  stuff:  DTDs,  namespaces,  etc !   arcane  Java  programming  interfaces !   Don’t  we  use  JSON  now? !   so  why  do  we  care  about  it? XML  is  not  cool
  4. !   There’s  a  huge  amount  of  it  out  there

    !   HTML  is  XML,  or  can  be !   Lux  is  about  making  it  easy  (and  free)  to  deal   with  XML But  it  still  maZers
  5. !   We  make  content-­‐‑rich  sites: !   our  own

     site:  safaribooksonline.com !   our  clients  sites:  oed.com,  degruyter.com,   oxfordreference.com,  … !   Publishers  provide  us  with  content !   we  debug  content  problems !   we  add  new  features  nimbly !   Piles  of  random  data  (XML,  mostly) Why  did  we  make  it?
  6. !   Complex  queries  over  semi-­‐‑structured  data,  typically   documents

    !   You  don’t  need  it  for  edismax-­‐‑style  “quick”  search !   or  highly-­‐‑structured  data !   XQuery  comes  with  a  rich  function  library; !   rich  string,  numeric  and  date  functions !   extensions  for  HTTP,  filesystem,  zip How  can  XQuery  help?
  7. How  does  Lux  work? XML  Indexer Evaluator Tinybin   storage

    Optimizer Lazy   Searcher Compiler Tagged Highlighter Saxon  XQuery   XSLT  Processor XQuery   Function   Library UpdateProcessor QueryComponent ResponseWriter QParserPlugin DispatchFilter External   Field  Codec Tagged   TokenStream XML  text   fields XPath  fields
  8. !   “hamlet”   !   “hamlet”  in  //title !

      “hamlet”  in  //scene/title,  //speaker,  etc… !   XQuery,  but  we  need  an  index !   DIH  XPathEntityProcessor !   But  are  XPath  indexes  enough? XML  is  text  with  context
  9. !   In  which  speeches  does  Hamlet  talk  about  poison?

    !   +speaker:Hamlet  +line:poison !   Works  great  if  we  indexed  speaker  and  line  for  each   speech !   What  if  we  only  indexed  at  the  scene  level?   !   What  if  we  just  indexed  speech  text  as  a  field? !   XPath  indexes  are  precise  and  fine-­‐‑grained !   Great  when  you  know  exactly  what  you  need How  do  we  index  context?
  10. Contextual  Indexes <play> <title>Hamlet</title> <act act=”1”> <scene act=”1” scene=”1”> <title>SCENE

    I. Elsinore ... </title> Index Values Tags title, act, @act   Tag  Paths /play, /play/title, /play/act, /play/act/@act   Text hamlet,  scene,  elsinore   Tagged  Text play:hamlet,  title:hamlet,  @act:1   XPath user-­‐defined  Xpath  2.0  expression;  eg:     count(//line),     replace(//title,  'SCENE|ACT  \S+','')  
  11. !   Tagged  Text,  Path  index !   Imprecise,  generic

     indexes,  but  more  context   than  just  full  text !   XQuery  post-­‐‑processing  to  patch  over  the  gaps !   Query  optimizer  applies  indexes !   For  when  you  don’t  want  to  sweat  the  details:   ad  hoc  queries,  content  analysis  and  debugging General  purpose  indexes
  12. TaggedTokenStream Zext:scene\:hamlet                pos=1

    Zext:speech\:hamlet            pos=1 Zext:speaker\:hamlet        pos=1 Zext:scene\:to                                  pos=2 Zext:speech\:to                              pos=2 … … scene speech speaker Hamlet … scene speech line To … scene speech line be Tokens  emiZed <scene><speech> <speaker>Hamlet</speaker> <line>To be or not to be, … </line> !   Wraps  an  existing  Analyzer  (for  the  text) !   Responds  to  XML  events  (start  element,  etc) !   Maintains  a  tag  name  stack !   Emits  each  token  prefixed  by  enclosing  tags
  13. !   XPath:      //speech[speaker=“Hamlet”][contains(.,”poison”)] !   “optimized”  XQuery:

         lux:search(“+<speaker:Hamlet  +<speech:poison”)                        //speech  [speaker=“Hamlet”]  [contains(.,”poison”)] !   Lucene  Query:      tagged_text:(+speaker\:Hamlet  +speech\:poison) TagQueryParser
  14. !   Generic  JSON  index !   Overlapping  tags  (part-­‐‑of-­‐‑speech,

     phrase-­‐‑labeling,  NLP) !   citation  classification  w/probabilistic  labeling !   One  stored  field  for  all  the  text  makes  highlighting  easier !   One  Lucene  field  means  you  can  use  PhraseQuery,  eg:          PhraseQuery(<speaker:hamlet  <speech:to)  finds  all                                  speeches  by  hamlet  starting  with  “to”. Tagged  token  examples
  15. !   stored  document    =  100% !   qnames

     =  +1.3% !   paths  =  +2.4% !   text  tokens  =  18% !   tagged  text  (opaque)  =  18% !   tagged  text  (all  transparent)  =  71% What’s  the  cost?
  16. subsequence(      for  $doc  in  collection()[.//SPEAKER=“Hamlet”]    order  by

     $doc/lux:key(“title”)    return  $doc,  1000,  20)     subsequence  (    lux:search(“<SPEAKER:Hamlet”,  “title”,   1000)  [.//SPEAKER=“Hamlet”]   ,  1,  20)   Query  optimization
  17. !   Lux  uses  Lucene  as  its  primary  document  store

    !   Lux  tinybin  (based  on  Saxon  TinyTree)  storage   format  avoids  XML  parsing  overhead !   Experimental  new  codec  stores  fields  as  files Document  storage
  18. !   Problem:  “big”  stored  fields !   Text  documents

     get  stored  for  highlighting !   Take  time  to  copy  when  merging !   Can  we  do  beZer  by  storing  as  files,  but   managing  w/Lucene? “Big”  binary  stored  fields
  19. !   Real-­‐‑time  deletes !   Track  deletions  when  merging

    !   Keep  commits  with  IndexDeletionPolicy !   Delete  unmerged  (empty)  segments !   Off-­‐‑line  deletes !   Cleanup  tool  traverses  entire  index Deleting  is  complicated
  20. !   2-­‐‑3x  write  speedup  for  unindexed  stored  fields !

      a  bit  slower  in  the  worst  case !   But,  text  analysis  can  take  most  of  the  time !   Net:  useful  if  you  are  storing  large  binaries Codec  Performance   (preliminary)
  21. !   custom  DispatchFilter  provides: !   HTTP  request/response  handling

     in  XQuery !   file  uploads,  redirects !   Ability  to  roll  your  own:  cookies,  authentication !   Rapid  prototyping,  testing  query  performance,   relevance,  in  an  application  seZing App  Server
  22. !   Yes,  but  did  you  remember  to  index  all

     the   fields  you  need  in  advance? !   Yes,  but  did  you  want  to  format  the  result  into  a   nice  report  *using  your  query  language*? !   Yes,  but  did  you  want  access  to  a  complete   XPath  2.0  implementation  in  your  indexer? Isn’t  Solr  enough?
  23. !   Find  some  sample  content  with  a  new  tag

     we  need   to  support !   Perform  complex  updates  to  patch  broken  content !   Troubleshoot  content !   Explore  unfamiliar  content !   Write  prototypes  and  admin  tools  entirely  in  HTML,   JS  and  XQuery !   Demo:  hZp://localhost:8080 Example  uses  
  24. !   Downloads  and  Documentation  at   hZp://luxdb.org   !

      Source  code  at  hZp://github.com/msokolov/lux !   Freely  available  under  OSS  license  (MPL  2) !   Contributions  welcome !   Thank  you,  Safari  Books! Thank  You!