Large-Scale Web Archive Discovery & Analytics Using Apache Solr

Large-Scale Web Archive Discovery & Analytics Using Apache Solr Andrew
Jackson UK Web Archive Technical Lead

www.bl.uk 2 Context •  Three collections: –  Selective since 2004
–  Legal Deposit since 2013 –  Historical 1996-2013 from IA •  Iterative Development: –  Work directly with researchers –  Today’s historical research tools provide tomorrow’s reading rooms •  Using Solr to support: –  Discovery –  Preservation –  Analytics

www.bl.uk 3 Discovery • Web archives tend to be messy – Lots
of poor quality content, e.g. from crawler traps. – Spam, e.g. link spam from link farms. – Utility of PageRank over time is unclear • Faceted search – Invest in developing facets to allow filtering rather than PageRank or boosts to rank results. – e.g. basic facets from embedded metadata: • Last-Modified, Author, etc.

www.bl.uk 4 Discovery: HTML Links (also)

www.bl.uk 5 Discovery: Embedded Licenses

www.bl.uk 6 Discovery: Text features • No stemming or lemmatization – Researchers
hated it • Natural language detection – e.g. gov.uk + fr • Postcode-based geoindex • Sentiment analysis • Similarity hashing via ssdeep – To detect similar texts

www.bl.uk 7 Discovery: Image features • Basic properties: – width, height, pixel
count • Face detection – Number of faces & location • Dominant colour extraction – ‘Characteristic’ colours

www.bl.uk 8 Preservation •  Format analysis: – Using extended MIME types
(inc. version + charset): • Served • Apache Tika • DROID – First-four-bytes – File extension • Examples – Understanding Unidentified Resources

www.bl.uk 9 HTML Versions Over Time

www.bl.uk 10 Preservation •  Deeper characterisation – Software identifiers – (X)HTML: Elements
Used – XML: Root Namespace – PDF: Apache Preflight – Apache Tika's parse errors – Will consider adding: • DRMLint (SCAPE) • JHOVE

www.bl.uk 11 Elements Over Time

www.bl.uk 12 PDF/A Validation Errors

www.bl.uk 13 Parse Errors

www.bl.uk 14 Analytics • Researcher Expectations – “How big is the UK
Web?” • From Crawl To Web – Crawl schedule, parameters, logs. – "Files over 10MB are not archived” – De-duplication handling critical – Can't forget HTTP 30x, 40x, 50x • Compensate via normalisation strategies – c.f. Google Books Ngram

www.bl.uk 15 Technical Architecture • Core indexer can run from CLI
or Hadoop – Makes development much easier • Hadoop indexer has two modes: – SolrCloud: • Performance acceptable as long as shards map to cores and there's good I/O (1 billion, 1 server, 1 week) • Memory issues relating to query complexity – Direct to HDFS: • Really fast for moderate data volumes • Slows down as shards grow

www.bl.uk 16 Scale • 1996-2010 Tranch of the IA dataset: – 2.5
Billion HTTP 200 URLs • Performance issues: – Data quality – Robustness – Configuration errors • Currently re-indexing: – with better duplicate handling – on three dedicated servers

www.bl.uk 17 Open Collaboration • Fully open source stack: – webarchive-discovery indexer
– Begun developing an analytics UI • Keen to collaborate – This community faces a common problem: • But not a core SolrCloud/ElasticSearch use case – Danish SolrCloud on SSD discovered via Solr mailing list • http://sbdevel.wordpress.com/2013/12/06/danish- webscale/

www.bl.uk 18 Thank you

Large-Scale Web Archive Discovery & Analytics U...

Large-Scale Web Archive Discovery & Analytics Using Apache Solr

Andy Jackson

More Decks by Andy Jackson

Other Decks in Research

Featured

Transcript