Querying rich text with Lux - XQuery for Lucene

{ Querying Rich Text with Lucene XQuery Michael Sokolov
Senior Architect Safari Books Online

!   Overview of Lux !   Why we need
want a rich(er) query language !   Implementation Stories !   Indexing tagged text !   Storing documents in Lucene !   Lazy searching !   Demo The plan for this talk

!  XQuery in Solr !   Query optimizer !  
Eﬃcient XML document format !   XQuery function library !   as a Java library (Lucene only) !   as Solr plugins !   as a standalone App Server What is Lux?

to ﬁnd something Search

Query to get an answer

!   maybe it was once – 10 year s
ago? !   Legacy stuﬀ: DTDs, namespaces, etc !   arcane Java programming interfaces !   Don’t we use JSON now? !   so why do we care about it? XML is not cool

!   There’s a huge amount of it out there
!   HTML is XML, or can be !   Lux is about making it easy (and free) to deal with XML But it still maZers

!   We make content-‐‑rich sites: !   our own
site: safaribooksonline.com !   our clients sites: oed.com, degruyter.com, oxfordreference.com, … !   Publishers provide us with content !   we debug content problems !   we add new features nimbly !   Piles of random data (XML, mostly) Why did we make it?

!   Complex queries over semi-‐‑structured data, typically documents
!   You don’t need it for edismax-‐‑style “quick” search !   or highly-‐‑structured data !   XQuery comes with a rich function library; !   rich string, numeric and date functions !   extensions for HTTP, ﬁlesystem, zip How can XQuery help?

How does Lux work? XML Indexer Evaluator Tinybin storage
Optimizer Lazy Searcher Compiler Tagged Highlighter Saxon XQuery XSLT Processor XQuery Function Library UpdateProcessor QueryComponent ResponseWriter QParserPlugin DispatchFilter External Field Codec Tagged TokenStream XML text ﬁelds XPath ﬁelds

!   “hamlet” !   “hamlet” in //title !
  “hamlet” in //scene/title, //speaker, etc… !   XQuery, but we need an index !   DIH XPathEntityProcessor !   But are XPath indexes enough? XML is text with context

!   In which speeches does Hamlet talk about poison?
!   +speaker:Hamlet +line:poison !   Works great if we indexed speaker and line for each speech !   What if we only indexed at the scene level? !   What if we just indexed speech text as a ﬁeld? !   XPath indexes are precise and ﬁne-‐‑grained !   Great when you know exactly what you need How do we index context?

Contextual Indexes <play> <title>Hamlet</title> <act act=”1”> <scene act=”1” scene=”1”> <title>SCENE
I. Elsinore ... </title> Index Values Tags title, act, @act Tag Paths /play, /play/title, /play/act, /play/act/@act Text hamlet, scene, elsinore Tagged Text play:hamlet, title:hamlet, @act:1 XPath user-‐defined Xpath 2.0 expression; eg: count(//line), replace(//title, 'SCENE|ACT \S+','')

!   Tagged Text, Path index !   Imprecise, generic
indexes, but more context than just full text !   XQuery post-‐‑processing to patch over the gaps !   Query optimizer applies indexes !   For when you don’t want to sweat the details: ad hoc queries, content analysis and debugging General purpose indexes

TaggedTokenStream Zext:scene\:hamlet pos=1
Zext:speech\:hamlet pos=1 Zext:speaker\:hamlet pos=1 Zext:scene\:to pos=2 Zext:speech\:to pos=2 … … scene speech speaker Hamlet … scene speech line To … scene speech line be Tokens emiZed <scene><speech> <speaker>Hamlet</speaker> <line>To be or not to be, … </line> !   Wraps an existing Analyzer (for the text) !   Responds to XML events (start element, etc) !   Maintains a tag name stack !   Emits each token preﬁxed by enclosing tags

!   XPath: //speech[speaker=“Hamlet”][contains(.,”poison”)] !   “optimized” XQuery:
lux:search(“+<speaker:Hamlet +<speech:poison”) //speech [speaker=“Hamlet”] [contains(.,”poison”)] !   Lucene Query: tagged_text:(+speaker\:Hamlet +speech\:poison) TagQueryParser

!   Generic JSON index !   Overlapping tags (part-‐‑of-‐‑speech,
phrase-‐‑labeling, NLP) !   citation classification w/probabilistic labeling !   One stored field for all the text makes highlighting easier !   One Lucene field means you can use PhraseQuery, eg: PhraseQuery(<speaker:hamlet <speech:to) finds all speeches by hamlet starting with “to”. Tagged token examples

!   stored document = 100% !   qnames
= +1.3% !   paths = +2.4% !   text tokens = 18% !   tagged text (opaque) = 18% !   tagged text (all transparent) = 71% What’s the cost?

subsequence( for $doc in collection()[.//SPEAKER=“Hamlet”] order by
$doc/lux:key(“title”) return $doc, 1000, 20) subsequence ( lux:search(“<SPEAKER:Hamlet”, “title”, 1000) [.//SPEAKER=“Hamlet”] , 1, 20) Query optimization

!   Lux uses Lucene as its primary document store
!   Lux tinybin (based on Saxon TinyTree) storage format avoids XML parsing overhead !   Experimental new codec stores ﬁelds as ﬁles Document storage

!   Problem: “big” stored fields !   Text documents
get stored for highlighting !   Take time to copy when merging !   Can we do beZer by storing as files, but managing w/Lucene? “Big” binary stored fields

ExternalFieldCodec large stored ﬁelds small stored ﬁelds

!   Real-‐‑time deletes !   Track deletions when merging
!   Keep commits with IndexDeletionPolicy !   Delete unmerged (empty) segments !   Oﬀ-‐‑line deletes !   Cleanup tool traverses entire index Deleting is complicated

!   2-‐‑3x write speedup for unindexed stored ﬁelds !
  a bit slower in the worst case !   But, text analysis can take most of the time !   Net: useful if you are storing large binaries Codec Performance (preliminary)

!   custom DispatchFilter provides: !   HTTP request/response handling
in XQuery !   ﬁle uploads, redirects !   Ability to roll your own: cookies, authentication !   Rapid prototyping, testing query performance, relevance, in an application seZing App Server

!   Yes, but did you remember to index all
the ﬁelds you need in advance? !   Yes, but did you want to format the result into a nice report *using your query language*? !   Yes, but did you want access to a complete XPath 2.0 implementation in your indexer? Isn’t Solr enough?

!   Find some sample content with a new tag
we need to support !   Perform complex updates to patch broken content !   Troubleshoot content !   Explore unfamiliar content !   Write prototypes and admin tools entirely in HTML, JS and XQuery !   Demo: hZp://localhost:8080 Example uses

!   Downloads and Documentation at hZp://luxdb.org !
  Source code at hZp://github.com/msokolov/lux !   Freely available under OSS license (MPL 2) !   Contributions welcome !   Thank you, Safari Books! Thank You!

Querying rich text with Lux - XQuery for Lucene

Querying rich text with Lux - XQuery for Lucene

Michael Sokolov

Other Decks in Technology

Featured

Transcript

{ Querying Rich Text with Lucene XQuery Michael Sokolov

!   Overview of Lux !   Why we need

!  XQuery in Solr !   Query optimizer !

to ﬁnd something Search

Query to get an answer

!   maybe it was once – 10 year s

!   There’s a huge amount of it out there

!   We make content-‐‑rich sites: !   our own

!   Complex queries over semi-‐‑structured data, typically documents

How does Lux work? XML Indexer Evaluator Tinybin storage

!   “hamlet” !   “hamlet” in //title !

!   In which speeches does Hamlet talk about poison?

Contextual Indexes <play> <title>Hamlet</title> <act act=”1”> <scene act=”1” scene=”1”> <title>SCENE

!   Tagged Text, Path index !   Imprecise, generic

TaggedTokenStream Zext:scene\:hamlet pos=1

!   XPath: //speech[speaker=“Hamlet”][contains(.,”poison”)] !   “optimized” XQuery:

!   Generic JSON index !   Overlapping tags (part-‐‑of-‐‑speech,

!   stored document = 100% !   qnames

subsequence( for $doc in collection()[.//SPEAKER=“Hamlet”] order by

!   Lux uses Lucene as its primary document store

!   Problem: “big” stored ﬁelds !   Text documents

ExternalFieldCodec large stored ﬁelds small stored ﬁelds

!   Real-‐‑time deletes !   Track deletions when merging

!   2-‐‑3x write speedup for unindexed stored ﬁelds !

!   custom DispatchFilter provides: !   HTTP request/response handling

!   Yes, but did you remember to index all

!   Find some sample content with a new tag

!   Downloads and Documentation at hZp://luxdb.org !