Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale Web Archive Discovery & Analytics Using Apache Solr

Large-Scale Web Archive Discovery & Analytics Using Apache Solr

Presentation given at the 2014 IIPC GA

Andy Jackson

May 20, 2014
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. Large-Scale Web Archive
    Discovery & Analytics Using
    Apache Solr
    Andrew Jackson
    UK Web Archive Technical Lead

    View Slide

  2. www.bl.uk 2
    Context
    •  Three collections:
    –  Selective since 2004
    –  Legal Deposit since 2013
    –  Historical 1996-2013 from IA
    •  Iterative Development:
    –  Work directly with researchers
    –  Today’s historical research tools
    provide tomorrow’s reading rooms
    •  Using Solr to support:
    –  Discovery
    –  Preservation
    –  Analytics

    View Slide

  3. www.bl.uk 3
    Discovery
    • Web archives tend to be messy
    – Lots of poor quality content, e.g. from crawler traps.
    – Spam, e.g. link spam from link farms.
    – Utility of PageRank over time is unclear
    • Faceted search
    – Invest in developing facets to allow filtering rather than
    PageRank or boosts to rank results.
    – e.g. basic facets from embedded metadata:
    • Last-Modified, Author, etc.

    View Slide

  4. www.bl.uk 4
    Discovery: HTML Links
    (also)

    View Slide

  5. www.bl.uk 5
    Discovery: Embedded Licenses

    View Slide

  6. www.bl.uk 6
    Discovery: Text features
    • No stemming or lemmatization
    – Researchers hated it
    • Natural language detection
    – e.g. gov.uk + fr
    • Postcode-based geoindex
    • Sentiment analysis
    • Similarity hashing via ssdeep
    – To detect similar texts

    View Slide

  7. www.bl.uk 7
    Discovery: Image features
    • Basic properties:
    – width, height, pixel count
    • Face detection
    – Number of faces & location
    • Dominant colour extraction
    – ‘Characteristic’ colours

    View Slide

  8. www.bl.uk 8
    Preservation
    •  Format analysis:
    – Using extended MIME types (inc. version + charset):
    • Served
    • Apache Tika
    • DROID
    – First-four-bytes
    – File extension
    • Examples
    – Understanding Unidentified Resources

    View Slide

  9. www.bl.uk 9
    HTML Versions Over Time

    View Slide

  10. www.bl.uk 10
    Preservation
    •  Deeper characterisation
    – Software identifiers
    – (X)HTML: Elements Used
    – XML: Root Namespace
    – PDF: Apache Preflight
    – Apache Tika's parse errors
    – Will consider adding:
    • DRMLint (SCAPE)
    • JHOVE

    View Slide

  11. www.bl.uk 11
    Elements Over Time

    View Slide

  12. www.bl.uk 12
    PDF/A Validation Errors

    View Slide

  13. www.bl.uk 13
    Parse Errors

    View Slide

  14. www.bl.uk 14
    Analytics
    • Researcher Expectations
    – “How big is the UK Web?”
    • From Crawl To Web
    – Crawl schedule, parameters, logs.
    – "Files over 10MB are not archived”
    – De-duplication handling critical
    – Can't forget HTTP 30x, 40x, 50x
    • Compensate via normalisation strategies
    – c.f. Google Books Ngram

    View Slide

  15. www.bl.uk 15
    Technical Architecture
    • Core indexer can run from CLI or Hadoop
    – Makes development much easier
    • Hadoop indexer has two modes:
    – SolrCloud:
    • Performance acceptable as long as shards map to cores
    and there's good I/O (1 billion, 1 server, 1 week)
    • Memory issues relating to query complexity
    – Direct to HDFS:
    • Really fast for moderate data volumes
    • Slows down as shards grow

    View Slide

  16. www.bl.uk 16
    Scale
    • 1996-2010 Tranch of the IA dataset:
    – 2.5 Billion HTTP 200 URLs
    • Performance issues:
    – Data quality
    – Robustness
    – Configuration errors
    • Currently re-indexing:
    – with better duplicate handling
    – on three dedicated servers

    View Slide

  17. www.bl.uk 17
    Open Collaboration
    • Fully open source stack:
    – webarchive-discovery indexer
    – Begun developing an analytics UI
    • Keen to collaborate
    – This community faces a common problem:
    • But not a core SolrCloud/ElasticSearch use case
    – Danish SolrCloud on SSD discovered via Solr mailing list
    • http://sbdevel.wordpress.com/2013/12/06/danish-
    webscale/

    View Slide

  18. www.bl.uk 18
    Thank you

    View Slide