Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The HathiTrust Researcn Center (HTRC): Explorat...

The HathiTrust Researcn Center (HTRC): Exploration of the World's First Massive Digital Library

This is our talk for the Digital HPS Workshop held in Bloomington, IN on Sept 6, 2013.

Robert H. McDonald

September 05, 2013
Tweet

More Decks by Robert H. McDonald

Other Decks in Education

Transcript

  1. The HathiTrust Research Center (HTRC): Exploration of the World’s First

    Massive Digital Library Digital HPS Workshop | IMU | 09.06.13 Beth Plale – @bplale - IU School of Informatics and Computing/IU D2I Center Robert H. McDonald - @mcdonald - IU Libraries/IU D2I Center Miao Chen – IU D2I Center Tweet US - @HathiTrust #HTRC
  2. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HathiTrust Digital Library •

    HathiTrust Digital Library is a partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. – http://www.hathitrust.org – IU is a founding member of the HathiTrust along with University of Michigan, University of California, and the University of Virginia.
  3. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust  HathiTrust is large

    corpus providing opportunity for new forms of computation investigation.  The bigger the data, the less able we are to move it to a researcher’s desktop machine  Future research on large collections will require computation moves to the data, not vice versa
  4. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HTRC Mission • Public

    research arm of the HathiTrust Digital Library • Help researchers world-wide to accomplish tera- scale text data-mining and analysis – Develop cutting-edge software tools for processing, analyzing text – Develop cyberinfrastructure to enable HPC access to the HathiTrust Digital Library • Established: July, 2011 • Collaborative center: Indiana University & University of Illinois
  5. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust Non-Consumptive Research Paradigm •

    No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. • Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
  6. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HTRC Governance • Reports

    to the HathiTrust Board of Governors • HTRC Executive Committee – J. Stephen Downie (Co-director), Professor and Associate Dean for Research, University of Illinois GSLIS – Beth Plale (Co-director and Chair), Director Data To Insight Center and professor in the School of Informatics and Computing at Indiana University – Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana University – Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning & Policy at the University of Illinois – John Unsworth, Vice Provost for Library & Technology Services and Chief Information Officer at Brandeis University • HTRC Advisory Board (See members next slide) • Google Public Domain agreement – in place for IU and UIUC
  7. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HTRC Advisory Board •

    Cathy Blake, University of Illinois, Urbana-Champaign • Beth Cate, Indiana University • Greg Crane, Tufts University • Laine Farley, California Digital Library • Brian Geiger, University of California at Riverside • David Greenbaum, University of California at Berkeley • Fotis Jannidis, University of Wurzberg, Germany • Matthew Jockers, Stanford University • Jim Neal, Columbia University • Bill Newman, Indiana University • Bethany Nowviskie, University of Virginia • Andrey Rzhetsky, University of Chicago • Pat Steele, University of Maryland • Craig Stewart, Indiana University • David Theo Goldberg, University of California at Irvine • John Towns, National Center for Supercomputing Applications • Madelyn Wessel, University of Virginia
  8. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HTRC Timeline • Phase

    I: 18-month development cycle – Began 01 July 2011 – Demo of capability September 2012 (14 mo mark) at HTRC UnCamp I • Phase II: broad availability of resource, begins 31 March 2013 – New HTRC Asst. Director for Education and Outreach (Miao Chen) – New listserv to drive user input: htrc-usergroup-l @ list.indiana.edu
  9. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HTRC Next Steps •

    Phase 2 availability of resource 31 March 2013 • Thanks to: Photos from HTRC UnCamp 9.10.12 at Indiana University
  10. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HTRC Phase 2: Current

    Thrusts • Grow HTRC User-base – Outreach and Engagement • Input from HTRC Advisory Board • Input from HT BOG – Town Hall Groups at DH, JCDL, JADH, DPLA, Educause – Online Town Hall Groups (forthcoming) • Develop New Specifications from User-Based Agile Development Methodology • Develop and Integrate Sloan Cloud Components into the HTRC Infrastructure
  11. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust • Sandbox stack (resides

    at UIUC): non- google corpus (250,000 volumes), open access. • Production stack (resides at IU): v0.5 in place. Uses Oauth security. Public domain corpus. Shares Cassandra/Solr with dev stack. Minimal compute resources available. • Development stack (resides at IU): shares Cassandra/Solr with prod stack. Supports v0.1 of HTRC Sloan Cloud for non- consumptive support • Sandbox stack (at UIUC): v1.0 stack but against non-google corpus • Production stack (at IU): v1.0 reflects extensive testing. Oauth for security. Public domain corpus. Share Cassandra/Solr with dev stack. Support for parallel execution. • Development stack (at IU): share Cassandra/Solr with prod stack. New services. V0.2 of Sloan non-consumptive support. Begin dev for InCommon and auditing. • Sandbox stack (at UIUC): v1.5; against non- google corpus • Production stack (at IU): v1.5. Supports inCommon in anticipation of copyright works. Public domain corpus. Separate Cassandra/Solr; public domain corpus • Development stack (at IU): InCommon, auditing, and v1.0 of Sloan non- consumptive support. Security audit on development stack; verify ready for copyright materials • Sandbox stack: retire (?) • Production stack (at UIUC or IU): v2.0. Supports inCommon in anticipation of copyright works. Public domain corpus. Separate Cassandra and Solr for public domain corpus. • Development stack (at IU or UIUC): dev stack ready for copyright materials. Deliver: Mar 31, 2013 Deliver: Jun 30, 2013 Deliver: Sep 30, 2013 Deliver: Nov 30, 2013 HTRC Tech Stack Deployment Timeline
  12. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust • Philosophy: computation moves

    to data • Web services architecture and protocols • Registry of services and algorithms • Solr full text indexes • noSQL store as volume store • openID authentication • Portal front-end, programmatic access • SEASR mining algos
  13. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust Agent framework Page/volume tree

    (file system) Volume store (Cassandra) SEASR analytics service Task deployment WSO2 registry services, collections, data capsule images Solr index HathiTrust corpus rsync HTRC Data API v0.1 NCSA local resources Programmatic access e.g., WS02 Identity Server University of Michigan Meandre Orchestration Agent instance Agent instance Agent instance Agent instance Non-consumptive Data capsules Big Red II/IU Quarry 14 Blacklight Volume store (Cassandra) Volume store (Cassandra) NSF XSEDE Portal
  14. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust Analysis View Data API

    access interface Portal Security (OAuth2 WSO2 IS) Algorithms and Worksets Registry (WSO2 GR) Applicatio n submission Audit Cassandra cluster volume store Solr index Entity Extraction Topic Modelin g OpenNLP Token count Latent semantic analysis High level apps Compute resources Storage resources Blacklight User VM
  15. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HTRC Quick View Faceted

    Search Workset Builder Algorithm Viewer Workset Details
  16. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust VM Image Manager VM

    Image Store VM Image Builder VM Manager VM instance Secure Virtual Cloud SSH Non-consumptive Output Storage Researcher HTRC Non- Consumptive Research Access Request for VM
  17. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust Metadata Enhancement • Current

    metadata fields are MARC-based – E.g. publication date, authors, title, subject • MARC fields are fundamental • Needed more fields of users’ interest for granular analytics (Metadata Enhancement) • Solicit user requirements and prioritize for implementation – Mainly digital humanities uses now
  18. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust Top Metadata Enhancement Items

    • 1st round user requirement collection, top 3 items were metadata related: – Word frequency count and document length for a volume – Metadata de-duplication – Author Gender Analysis • We have added word count and gender fields to HTRC metadata, and more are being planned and investigated.
  19. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust Other Metadata Enhancement Items

    • Stats analysis: tf-idf • Readability score • Language • Topic modeling (e.g. LDA probability) • Genre • Era of compilation • Book length (e.g. short or long) • Concordance index (indexing with context)
  20. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust HTRC Upcoming Events •

    HathiTrust Research Center UnCamp – Sept 8-9, 2013 – University of Illinois – SOLD OUT • JADH 2013 – Sept. 19-21, 2013 – Kyoto, Japan • DPLAfest 2013 – Oct. 25, 2013 – Boston, MA • Ohio State University – Library Symposium – Oct. 2013 • Educause Annual Conference 2013 – Oct. 16, 2013 – Anaheim, CA
  21. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust Thank You • This

    presentation was made possible with content provided by many HTRC colleagues John Unsworth, J. Stephen Downie, Robert H. McDonald, Beth Sandore, Yiming Sun, Miao Chen, Guangchen Ruan, Loretta Auvil, Kirk Hess, and many others… • The HTRC Non-Consumptive Research Grant is graciously funded by the Alfred P. Sloan Foundation • IU D2I-PTI is graciously funded by The Lilly Endowment, Inc. • HTRC - http://www.hathitrust.org/htrc • IU D2I Center - http://d2i.indiana.edu/ • UIUC GSLIS - http://www.lis.illinois.edu/
  22. 09.06.13 Digital HPS Workshop #HTRC @HathiTrust Contact Information • General

    Contact Info. – Beth Plale, Chair-HTRC Executive Committee • [email protected] • Robert H. McDonald, HTRC Executive Committee • [email protected] • Requests for capability, interest – Miao Chen, HTRC Asst. Director of Education and Outreach • [email protected]