Slide 1

Slide 1 text

Challenges in maintaing a high-performance Search-Engine written in Java Simon Willnauer Apache Lucene Core Committer & PMC Chair /

Slide 2

Slide 2 text

Who am I? •Lucene Core Committer •Project Management Committee Chair (PMC) •Apache Member •Co-Founder BerlinBuzzwords •Co-Founder on / 2

Slide 3

Slide 3 text •Community Portal targeting OpenSource Search 3

Slide 4

Slide 4 text

Agenda •Its all about performance ...eerrr community •It’s Java so its fast? •Challenges we faced and solved in the last years •Testing, Performance, Concurrency and Resource Utilization •Questions 4

Slide 5

Slide 5 text

Lets talk about Lucene •Apache TLP since 2001 •Grandfather of projects like Mahout, Hadoop, Nutch, Tika •Used by thousands of applications world wide •Apache 2.0 licensed •Core has Zero-Dependency •Developed and Maintained by Volunteers 5

Slide 6

Slide 6 text

Just a search engine - so what’s the big deal? •True - Just software! •Massive community - with big expectations (YOU?) •Mission critical for lots of companies •End-user expects instant results independent of the request complexity •New features sometimes require major changes •Our contract is trust - we need to maintain trust! 6

Slide 7

Slide 7 text

Trust & Passion •~ 30 committers (~ 10 active, some are payed to work on Lucene) •All technical communication are public (JIRA, Mailinglist, IRC) •Consensus is king! - usually :) •No lead developer or architect •No stand-ups, meetings or roadmap •Up to 10k mails per month •No passion, no progress! •The Apache way: Community over Code 7

Slide 8

Slide 8 text

Community? / 0-Dependencies? 8 yesterday today tomorrow

Slide 9

Slide 9 text

What are folks looking for? 9 Performance API - Stability Innovation Confident Users New Users Happy Users Lucene 3x Lucene 4

Slide 10

Slide 10 text

Maintaining a Library - it’s tricky! •hides a lot technical details •synchronization, data-structures, algorithms •Most of the users don’t look behind the scenes •If you don’t look behind the scenes, it’s on us to make sure that the library is as efficient as possible... •A good reputation needs to be maintained.... •Lucene 4 promises a lot but it has to prove itself 10

Slide 11

Slide 11 text

But lets talk about fun challenges... •Written in Java... ;) •A ton of different use-cases •Nothing is impossible - or why LowercaseFilter supports supplementary characters •There are Bugs, there are always Bugs! •Mission Critical Software - shipped for free 11

Slide 12

Slide 12 text

We are working in Java so.... •No need to know the machine & your environment •Use JDK Collections, they are fast •Short Lived Objects are Free •Concurrency is straight forward •IO is trivial •Method Calls are fast - there is a JIT, no? •Unicode is there and it works •No memory problems - there is a GC, right? 12

Slide 13

Slide 13 text

Really? 13 No!

Slide 14

Slide 14 text

Mechanical Sympathy (Jackie Steward / Martin Thompson) 14 “The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.” Henry Peteroski

Slide 15

Slide 15 text

Where to focus on? 15 Impact on GC Space/CPU Utilization Concurrency Compression Do we need 2 bytes per Character? Any exploitable data properties Amount of Objects (Long & Short Living) JVM memory allocation Cost & Need of a Multiple Writers Model CPU Cache Utilization Cost of a Monitor / CAS Need of mutability Can we specialized a data-strucutres Can we allow stack allocation? Can I reuse this object in the short term? Compute up-front? Environment JIT - Friendliness? Concrete or Virtual?

Slide 16

Slide 16 text

What we focus on... 16 Impact on GC Space/CPU Utilization Concurrency Compression Materialize strings to bytes Strings can share prefix & suffix Data Structures with Constant number of objects Guarantee continuous memory allocation Single Writer - Multiple Readers Materialized Data structures for Java HEAP Write, Commit, Merge Write Once & Read - Only Finite State Transducers / Machines No Java Collections where scale is an issue UTF-8 by default or custom encoding MemoryMap | NIO Exploit FS / OS Caches Prevent False Sharing Environment prevent invokeVirtual where possible Write, Commit, Merge carefully test what JIT likes

Slide 17

Slide 17 text

Making things fast and stable is... 17 Go tell it your boss! an engineering effort! & takes time!

Slide 18

Slide 18 text

But is it necessary? •Yes & No - it all depends on finding the hotspots •Measure & Optimize for you use-case. •Data-structures are not general purpose (like the don’t support deletes) •Follow the 80 / 20 rule •Enforce Efficiency by design •Java Iterators are a good example of how not to do it! •Remember you OS is highly optimized, make use of it! 18

Slide 19

Slide 19 text

Enough high level - concrete problems please! •Challenge: Idle is no-good! •Challenge: One Data-Structure to rule them all? •Challenge: How how to test a library •Challenge: What’s needed for a 20000% performance improvement 19

Slide 20

Slide 20 text

Challenge: Idle is no-good •Building an index is a CPU & IO intensive task •Lucene is full of indexes (thats basically all it does) •Ultimate Goal is to scale up with CPUs and saturate IO at the same time •Keep your code complexity in mind •Other people might need to maintain / extend this 20 Don’t go crazy!

Slide 21

Slide 21 text

Here is the problem 21 WTF?

Slide 22

Slide 22 text

A closer look... 22 d d d d d do d d d d d do d d d d d do d d d d d do d d d d d do Thread State DocumentsWriter IndexWriter Thread State Thread State Thread State Thread State do do do do do doc merge segments in memory Flush to Disk Merge on flush Multi-Threaded Single-Threaded Directory Answer: it gives you threads a break and it’s having a drink with your slow-as-s**t IO System

Slide 23

Slide 23 text

Our Solution 23 d d d d d do d d d d d do d d d d d do d d d d d do d d d d d do DWPT DocumentsWriter IndexWriter DWPT DWPT DWPT DWPT Flush to Disk Multi-Threaded Directory

Slide 24

Slide 24 text

The Result 24 Indexing Ingest Rate over time with Lucene 4.0 & DWPT Indexing 7 Million 4kb wikipedia documents vs. 620 sec on 3.x

Slide 25

Slide 25 text

Challenge: One Data-Structure to Rule them all? •Like most other systems writing datastructures to disk Lucene didn’t expose it for extension •Major problem for researchers, engineers who know what they are doing •Special use-cases need special solutions •Unique ID Field usually is a 1 to 1 key to document mapping •Holding a posting list pointer is a wasteful •Term lookup + disk seek vs. Term lookup + read •Research is active in this area (integer encoding for instance) 25

Slide 26

Slide 26 text

10000 ft view 26 IndexWriter IndexReader Directory FileSystem

Slide 27

Slide 27 text

Introducing an extra layer 27 IndexWriter IndexReader Flex API Directory FileSystem Codec

Slide 28

Slide 28 text

For Backwards Compatibility you know? 28 Available Codecs segment title Lucene 4 Lucene 4 id segment title Lucene 3 Lucene 3 id Index Writer ? Lucene 5 Lucene 4 ? segment title Lucene 5 Lucene 5 id << merge >> Index Lucene 3 ? Index Reader Index << read >>

Slide 29

Slide 29 text

Using the right tool for the job.. 29 Switching to Memory PostingsFormat

Slide 30

Slide 30 text

Using the right tool for the job.. 30 Switching to BlockTreeTermIndex

Slide 31

Slide 31 text

Challenge: How to test a library 31 •A library typically has: •lots of interfaces & abstract classes •tons of parameters •needs to handle user input gracefully •Ideally we test all combinations of Interfaces, parameters and user inputs? •Yeah - right!

Slide 32

Slide 32 text

What’s wrong with Unit-Test •Short answer: Nothing! •But... •1 Run == 1000 Runs? (only cover regression?) •Boundaries are rarely reached •Waste of CPU cycles •Test usually run against a single implementation •How to test against the full Unicode-Range? 32

Slide 33

Slide 33 text

An Example 33 The method to test: The test: The result:

Slide 34

Slide 34 text

Can it fail? 34 It can! ...after 53139 Runs •Boundaries are everywhere •There is no positive value for Integer.MIN •But how to repeat / debug?

Slide 35

Slide 35 text

Solution: A Randomized UnitTest Framework •Disclaimer: this stuff has been around for ages - not our invention! •Random selection of: •Interface Implementations •Input Parameters like # iterations, # threads, # cache sizes, intervals, ... •Random Valid Unicode Strings (Breaking JVM for fun and profit) •Throttling IO •Random Low Level Data-Strucutures •And many more... 35

Slide 36

Slide 36 text

Make sure your unit tests fail - eventually! •Framework is build for Lucene •Currently factored out into a general purpose framework •Check it out on: •Wanna help the Lucene Project? •Run our tests and report the failure! 36

Slide 37

Slide 37 text

Challenge: What’s needed for a 20k% Performance improvement. 37 BEER! FUN! COFFEE!

Slide 38

Slide 38 text

The Problem: Fuzzy Search •Retrieve all documents containing a given term within a Levenshtein Distance of <= 2 •Given: a sorted dictionary of terms •Trivial Solution: Brute Force - filter(terms, LD(2, queryTerm)) •Problem: it’s damn slow! •O(t) terms examined, t=number of terms in all docs for that field. Exhaustively compares each term. We would prefer O(log2t) instead. •O(n2) comparison function, n=length of term. Levenshtein dynamic programming. We would prefer O(n) instead. 38

Slide 39

Slide 39 text

Solution: Turn Queries into Automatons •Read a crazy Paper about building Levenshtein Automaton and implement it. (sounds easy - right?) •Only explore subtrees that can lead to an accept state of some finite state machine. •AutomatonQuery traverses the term dictionary and the state machine in parallel •Imagine the index as a state machine that recognizes Terms and transduces matching Documents. •AutomatonQuery represents a user’s search needs as a FSM. •The intersection of the two emits search results 39

Slide 40

Slide 40 text

Solution: Turn Queries into Finite State Machines 40 Finite-State Queries in Lucene Robert Muir Example DFA for “dogs” Levenshtein Distance 1 \u0000-f, g ,h-n, o, p-\uffff Accepts: “dugs” d o g

Slide 41

Slide 41 text

Turns out to be a massive improvement! 41 In Lucene 3 this is about 0.1 - 0.2 QPS

Slide 42

Slide 42 text

Berlin Buzzwords 2012 42 •Conference on High-Scalability, NoSQL and Search •600+ Attendees, 50 Sessions, Trainings etc. •

Slide 43

Slide 43 text

Who we are? Who builds Lucene? •There is no Who! •There is no such thing as “the creator of Lucene” •It’s a true team effort •That and only that makes Lucene what it is today! 43

Slide 44

Slide 44 text

Questions anybody? 44 ?