www.bl.uk 2 Context • Three collections: – Selective since 2004 – Legal Deposit since 2013 – Historical 1996-2013 from IA • Iterative Development: – Work directly with researchers – Today’s historical research tools provide tomorrow’s reading rooms • Using Solr to support: – Discovery – Preservation – Analytics
www.bl.uk 3 Discovery • Web archives tend to be messy – Lots of poor quality content, e.g. from crawler traps. – Spam, e.g. link spam from link farms. – Utility of PageRank over time is unclear • Faceted search – Invest in developing facets to allow filtering rather than PageRank or boosts to rank results. – e.g. basic facets from embedded metadata: • Last-Modified, Author, etc.
www.bl.uk 6 Discovery: Text features • No stemming or lemmatization – Researchers hated it • Natural language detection – e.g. gov.uk + fr • Postcode-based geoindex • Sentiment analysis • Similarity hashing via ssdeep – To detect similar texts
www.bl.uk 14 Analytics • Researcher Expectations – “How big is the UK Web?” • From Crawl To Web – Crawl schedule, parameters, logs. – "Files over 10MB are not archived” – De-duplication handling critical – Can't forget HTTP 30x, 40x, 50x • Compensate via normalisation strategies – c.f. Google Books Ngram
www.bl.uk 15 Technical Architecture • Core indexer can run from CLI or Hadoop – Makes development much easier • Hadoop indexer has two modes: – SolrCloud: • Performance acceptable as long as shards map to cores and there's good I/O (1 billion, 1 server, 1 week) • Memory issues relating to query complexity – Direct to HDFS: • Really fast for moderate data volumes • Slows down as shards grow
www.bl.uk 16 Scale • 1996-2010 Tranch of the IA dataset: – 2.5 Billion HTTP 200 URLs • Performance issues: – Data quality – Robustness – Configuration errors • Currently re-indexing: – with better duplicate handling – on three dedicated servers
www.bl.uk 17 Open Collaboration • Fully open source stack: – webarchive-discovery indexer – Begun developing an analytics UI • Keen to collaborate – This community faces a common problem: • But not a core SolrCloud/ElasticSearch use case – Danish SolrCloud on SSD discovered via Solr mailing list • http://sbdevel.wordpress.com/2013/12/06/danish- webscale/