Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Overview of Jerome

Overview of Jerome

The slides from my quick presentation at Chips and Mash on the University of Lincoln's unproject Jerome.

Nick Jackson

July 30, 2010
Tweet

More Decks by Nick Jackson

Other Decks in Technology

Transcript

  1. A quick overview Nick Jackson, Alex Bilbie, Paul Stainthorp, Chris

    Leach @jacksonj04, @alexbilbie, @pstainthorp, @chrisl1953
  2. If I had asked people what they wanted, they would

    have said faster horses. Henry Ford 4
  3. Horizon • Getting data out of Horizon in a sensible

    way is impossible without sacrifices at the full moon. • MARC isn’t a sensible way, but it’s the best we’ve got. • marcout.exe splurges 99% good stuff, 1% nonsense. 6
  4. Parsing It All • File_MARC is a PEAR (PHP) library

    which takes care of it all. • Requires some TLC on the output to deal with character encoding (MARC-8 sucks). • Build a huge array of stuff for each record, using MARC tags as index names. 7
  5. Sensible Output • Storing the whole parsed MARC is useful,

    but; • The vast majority of users don’t care about the MARC record. • They just want the information needed to find and cite a work. • Extract this ‘simple’ information in its own array. 8
  6. MongoDB 101 • Document database: No fixed data structure. •

    Accepts pure JSON as the input method. • PHP library accepts nested arrays and JSONifies them for you. • Makes getting data into the database dead easy. 9
  7. Mongo Makes APIs Easy • Mongo also accepts queries as

    JSON. • {“bib”:21084} • {“simple”:{“title”:“Problems with badgers?”}} • Users can use preformed query fields, or potentially write their own. • No need to add complex query builders to APIs. 11
  8. Sphinx 101 • Lightning fast search server. • Indexes at

    up to 15MB per second per core. • Searches 1,000,000 record, 1.2GB testing index at 500 searches a second. • Largest known index is 5 billion records. • Powers Craigslist. 13
  9. XML in, Index Out • Export an XML file from

    Mongo (about 4 seconds). • Sphinx will also happily index SQL databases. • Tell the Sphinx indexer to reindex it (0.5 seconds). • We have around 64585 records in our test set, searchable on title, author, ISBN and Dewey number. 14
  10. Searching... • Supports quite complex query forms. • OR, NOT,

    exact form, field-specific, strict order, proximity, quorum, phrase... • Supports custom field weightings. • Even does SQL queries. • Average search completes in under 0.0005 secs. 15
  11. Distributed Goodness • Sphinx adds horizontal scaling in the form

    of distributed indexes. • Can also be used to provide ‘universal search’, since indexes can be non-homogenous. • ePrints, blog posts, journals and more are indexed individually but can be searched collectively. 16
  12. Everything you're about to see on screen was generated on

    the computer inside this bag... Steve Jobs 18
  13. To The Web! • Portal: Specific work • Portal: Live

    search • API: Specific work • API: Works matching a Dewey number • API: Searching 20
  14. What’s Next For Jerome? • 3rd party integration (Amazon, Google,

    Copac, MOSAIC, LibraryThing...) • E-Journals integration (including full TOC searching) • EPrints integration (including full summary, and possibly full text searching) 22
  15. Want More? • Follow our progress on Twitter: #Jerome •

    Read our blog: http://jerome.blogs.lincoln.ac.uk 23