Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The HathiTrust Researcn Center (HTRC): Exploration of the World's First Massive Digital Library

The HathiTrust Researcn Center (HTRC): Exploration of the World's First Massive Digital Library

Seminar overview sldes of HTRC for the Catapult Seminar on 11.11.13 in Bloomington, IN

Robert H. McDonald

November 11, 2013
Tweet

More Decks by Robert H. McDonald

Other Decks in Education

Transcript

  1. The HathiTrust Research Center (HTRC): Exploration of the World’s First

    Massive Digital Library Catapult Workshop | 11/11/13 Bloomington, IN Robert H. McDonald - @mcdonald - IU Libraries/IU D2I Center Beth Plale – @bplale - IU School of Informatics and Computing/IU D2I Center Miao Chen – IU D2I Center Tweet US - @HathiTrust #HTRC
  2. 11.11.13 Catapult Workshop #HTRC @HathiTrust HathiTrust Digital Library • HathiTrust

    Digital Library is a partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. – http://www.hathitrust.org – IU is a founding member of the HathiTrust along with University of Michigan, University of California, and the University of Virginia.
  3. 11.11.13 Catapult Workshop #HTRC @HathiTrust  HathiTrust is large corpus

    providing opportunity for new forms of computation investigation.  The bigger the data, the less able we are to move it to a researcher’s desktop machine  Future research on large collections will require computation moves to the data, not vice versa
  4. 11.11.13 Catapult Workshop #HTRC @HathiTrust HTRC Mission • Public research

    arm of the HathiTrust Digital Library • Help researchers world-wide to accomplish tera- scale text data-mining and analysis – Develop cutting-edge software tools for processing, analyzing text – Develop cyberinfrastructure to enable HPC access to the HathiTrust Digital Library • Established: July, 2011 • Collaborative center: Indiana University & University of Illinois
  5. 11.11.13 Catapult Workshop #HTRC @HathiTrust Non-Consumptive Research Paradigm • No

    action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. • Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
  6. 11.11.13 Catapult Workshop #HTRC @HathiTrust HTRC Governance • Reports to

    the HathiTrust Board of Governors • HTRC Executive Committee – J. Stephen Downie (Co-director), Professor and Associate Dean for Research, University of Illinois GSLIS – Beth Plale (Co-director and Chair), Director Data To Insight Center and professor in the School of Informatics and Computing at Indiana University – Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana University – Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning & Policy at the University of Illinois – John Unsworth, Vice Provost for Library & Technology Services and Chief Information Officer at Brandeis University • HTRC Advisory Board (See members next slide) • Google Public Domain agreement – in place for IU and UIUC
  7. 11.11.13 Catapult Workshop #HTRC @HathiTrust HTRC Advisory Board • Cathy

    Blake, University of Illinois, Urbana-Champaign • Beth Cate, Indiana University • Greg Crane, Tufts University • Laine Farley, California Digital Library • Brian Geiger, University of California at Riverside • David Greenbaum, University of California at Berkeley • Fotis Jannidis, University of Wurzberg, Germany • Matthew Jockers, Stanford University • Jim Neal, Columbia University • Bill Newman, Indiana University • Bethany Nowviskie, University of Virginia • Andrey Rzhetsky, University of Chicago • Pat Steele, University of Maryland • Craig Stewart, Indiana University • David Theo Goldberg, University of California at Irvine • John Towns, National Center for Supercomputing Applications • Madelyn Wessel, University of Virginia
  8. 11.11.13 Catapult Workshop #HTRC @HathiTrust HTRC Timeline • Phase I:

    18-month development cycle – Began 01 July 2011 – Demo of capability September 2012 (14 mo mark) at HTRC UnCamp I • Phase II: broad availability of resource, begins 31 March 2013 – New HTRC Asst. Director for Education and Outreach (Miao Chen) – New listserv to drive user input: htrc-usergroup-l @ list.indiana.edu
  9. 11.11.13 Catapult Workshop #HTRC @HathiTrust HTRC Next Steps • Phase

    2 availability of resource 31 March 2013 • Thanks to: Photos from HTRC UnCamp 9.10.12 at Indiana University
  10. 11.11.13 Catapult Workshop #HTRC @HathiTrust HTRC Phase 2: Current Thrusts

    • Grow HTRC User-base – Outreach and Engagement • Input from HTRC Advisory Board • Input from HT BOG – Town Hall Groups at DH, JCDL, JADH, DPLA, Educause – Online Town Hall Groups (forthcoming) • Develop New Specifications from User-Based Agile Development Methodology • Develop and Integrate Sloan Cloud Components into the HTRC Infrastructure
  11. 11.11.13 Catapult Workshop #HTRC @HathiTrust • Sandbox stack (resides at

    UIUC): non- google corpus (250,000 volumes), open access. • Production stack (resides at IU): v0.5 in place. Uses Oauth security. Public domain corpus. Shares Cassandra/Solr with dev stack. Minimal compute resources available. • Development stack (resides at IU): shares Cassandra/Solr with prod stack. Supports v0.1 of HTRC Sloan Cloud for non- consumptive support • Sandbox stack (at UIUC): v1.0 stack but against non-google corpus • Production stack (at IU): v1.0 reflects extensive testing. Oauth for security. Public domain corpus. Share Cassandra/Solr with dev stack. Support for parallel execution. • Development stack (at IU): share Cassandra/Solr with prod stack. New services. V0.2 of Sloan non-consumptive support. Begin dev for InCommon and auditing. • Sandbox stack (at UIUC): v1.5; against non- google corpus • Production stack (at IU): v1.5. Supports inCommon in anticipation of copyright works. Public domain corpus. Separate Cassandra/Solr; public domain corpus • Development stack (at IU): InCommon, auditing, and v1.0 of Sloan non- consumptive support. Security audit on development stack; verify ready for copyright materials • Sandbox stack: retire (?) • Production stack (at UIUC or IU): v2.0. Supports inCommon in anticipation of copyright works. Public domain corpus. Separate Cassandra and Solr for public domain corpus. • Development stack (at IU or UIUC): dev stack ready for copyright materials. Deliver: Mar 31, 2013 Deliver: Jun 30, 2013 Deliver: Sep 30, 2013 Deliver: Nov 30, 2013 HTRC Tech Stack Deployment Timeline
  12. 11.11.13 Catapult Workshop #HTRC @HathiTrust • Philosophy: computation moves to

    data • Web services architecture and protocols • Registry of services and algorithms • Solr full text indexes • noSQL store as volume store • openID authentication • Portal front-end, programmatic access • SEASR mining algos
  13. 11.11.13 Catapult Workshop #HTRC @HathiTrust Agent framework Page/volume tree (file

    system) Volume store (Cassandra) SEASR analytics service Task deployment WSO2 registry services, collections, data capsule images Solr index HathiTrust corpus rsync HTRC Data API v0.1 NCSA local resources Programmatic access e.g., WS02 Identity Server University of Michigan Meandre Orchestration Agent instance Agent instance Agent instance Agent instance Non-consumptive Data capsules Big Red II/IU Quarry 15 Blacklight Volume store (Cassandra) Volume store (Cassandra) NSF XSEDE Portal
  14. 11.11.13 Catapult Workshop #HTRC @HathiTrust Analysis View Data API access

    interface Portal Security (OAuth2 WSO2 IS) Algorithms and Worksets Registry (WSO2 GR) Applicatio n submission Audit Cassandra cluster volume store Solr index Entity Extraction Topic Modelin g OpenNLP Token count Latent semantic analysis High level apps Compute resources Storage resources Blacklight User VM
  15. 11.11.13 Catapult Workshop #HTRC @HathiTrust VM Image Manager VM Image

    Store VM Image Builder VM Manager VM instance Secure Virtual Cloud SSH Non-consumptive Output Storage Researcher HTRC Non- Consumptive Research Access Request for VM
  16. 11.11.13 Catapult Workshop #HTRC @HathiTrust Thank You • This presentation

    was made possible with content provided by many HTRC colleagues Beth Plale, John Unsworth, J. Stephen Downie, Robert H. McDonald, Beth Sandore, Yiming Sun, Miao Chen, Guangchen Ruan, Loretta Auvil, Kirk Hess, and many others… • The HTRC Non-Consumptive Research Grant is graciously funded by the Alfred P. Sloan Foundation • IU D2I-PTI is graciously funded by The Lilly Endowment, Inc. • HTRC - http://www.hathitrust.org/htrc • IU D2I Center - http://d2i.indiana.edu/ • UIUC GSLIS - http://www.lis.illinois.edu/
  17. 11.11.13 Catapult Workshop #HTRC @HathiTrust Contact Information • General Contact

    Info. – Beth Plale, Chair-HTRC Executive Committee • [email protected] • Robert H. McDonald, HTRC Executive Committee • [email protected] • Requests for capability, interest – Miao Chen, HTRC Asst. Director of Education and Outreach • [email protected]