way is impossible without sacrifices at the full moon. • MARC isn’t a sensible way, but it’s the best we’ve got. • marcout.exe splurges 99% good stuff, 1% nonsense. 6
which takes care of it all. • Requires some TLC on the output to deal with character encoding (MARC-8 sucks). • Build a huge array of stuff for each record, using MARC tags as index names. 7
but; • The vast majority of users don’t care about the MARC record. • They just want the information needed to find and cite a work. • Extract this ‘simple’ information in its own array. 8
Accepts pure JSON as the input method. • PHP library accepts nested arrays and JSONifies them for you. • Makes getting data into the database dead easy. 9
JSON. • {“bib”:21084} • {“simple”:{“title”:“Problems with badgers?”}} • Users can use preformed query fields, or potentially write their own. • No need to add complex query builders to APIs. 11
up to 15MB per second per core. • Searches 1,000,000 record, 1.2GB testing index at 500 searches a second. • Largest known index is 5 billion records. • Powers Craigslist. 13
Mongo (about 4 seconds). • Sphinx will also happily index SQL databases. • Tell the Sphinx indexer to reindex it (0.5 seconds). • We have around 64585 records in our test set, searchable on title, author, ISBN and Dewey number. 14
exact form, field-specific, strict order, proximity, quorum, phrase... • Supports custom field weightings. • Even does SQL queries. • Average search completes in under 0.0005 secs. 15
of distributed indexes. • Can also be used to provide ‘universal search’, since indexes can be non-homogenous. • ePrints, blog posts, journals and more are indexed individually but can be searched collectively. 16
Copac, MOSAIC, LibraryThing...) • E-Journals integration (including full TOC searching) • EPrints integration (including full summary, and possibly full text searching) 22