Massive Digital Library Catapult Workshop | 11/11/13 Bloomington, IN Robert H. McDonald - @mcdonald - IU Libraries/IU D2I Center Beth Plale – @bplale - IU School of Informatics and Computing/IU D2I Center Miao Chen – IU D2I Center Tweet US - @HathiTrust #HTRC
Digital Library is a partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. – http://www.hathitrust.org – IU is a founding member of the HathiTrust along with University of Michigan, University of California, and the University of Virginia.
providing opportunity for new forms of computation investigation. The bigger the data, the less able we are to move it to a researcher’s desktop machine Future research on large collections will require computation moves to the data, not vice versa
arm of the HathiTrust Digital Library • Help researchers world-wide to accomplish tera- scale text data-mining and analysis – Develop cutting-edge software tools for processing, analyzing text – Develop cyberinfrastructure to enable HPC access to the HathiTrust Digital Library • Established: July, 2011 • Collaborative center: Indiana University & University of Illinois
action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. • Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
the HathiTrust Board of Governors • HTRC Executive Committee – J. Stephen Downie (Co-director), Professor and Associate Dean for Research, University of Illinois GSLIS – Beth Plale (Co-director and Chair), Director Data To Insight Center and professor in the School of Informatics and Computing at Indiana University – Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana University – Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning & Policy at the University of Illinois – John Unsworth, Vice Provost for Library & Technology Services and Chief Information Officer at Brandeis University • HTRC Advisory Board (See members next slide) • Google Public Domain agreement – in place for IU and UIUC
Blake, University of Illinois, Urbana-Champaign • Beth Cate, Indiana University • Greg Crane, Tufts University • Laine Farley, California Digital Library • Brian Geiger, University of California at Riverside • David Greenbaum, University of California at Berkeley • Fotis Jannidis, University of Wurzberg, Germany • Matthew Jockers, Stanford University • Jim Neal, Columbia University • Bill Newman, Indiana University • Bethany Nowviskie, University of Virginia • Andrey Rzhetsky, University of Chicago • Pat Steele, University of Maryland • Craig Stewart, Indiana University • David Theo Goldberg, University of California at Irvine • John Towns, National Center for Supercomputing Applications • Madelyn Wessel, University of Virginia
18-month development cycle – Began 01 July 2011 – Demo of capability September 2012 (14 mo mark) at HTRC UnCamp I • Phase II: broad availability of resource, begins 31 March 2013 – New HTRC Asst. Director for Education and Outreach (Miao Chen) – New listserv to drive user input: htrc-usergroup-l @ list.indiana.edu
• Grow HTRC User-base – Outreach and Engagement • Input from HTRC Advisory Board • Input from HT BOG – Town Hall Groups at DH, JCDL, JADH, DPLA, Educause – Online Town Hall Groups (forthcoming) • Develop New Specifications from User-Based Agile Development Methodology • Develop and Integrate Sloan Cloud Components into the HTRC Infrastructure
UIUC): non- google corpus (250,000 volumes), open access. • Production stack (resides at IU): v0.5 in place. Uses Oauth security. Public domain corpus. Shares Cassandra/Solr with dev stack. Minimal compute resources available. • Development stack (resides at IU): shares Cassandra/Solr with prod stack. Supports v0.1 of HTRC Sloan Cloud for non- consumptive support • Sandbox stack (at UIUC): v1.0 stack but against non-google corpus • Production stack (at IU): v1.0 reflects extensive testing. Oauth for security. Public domain corpus. Share Cassandra/Solr with dev stack. Support for parallel execution. • Development stack (at IU): share Cassandra/Solr with prod stack. New services. V0.2 of Sloan non-consumptive support. Begin dev for InCommon and auditing. • Sandbox stack (at UIUC): v1.5; against non- google corpus • Production stack (at IU): v1.5. Supports inCommon in anticipation of copyright works. Public domain corpus. Separate Cassandra/Solr; public domain corpus • Development stack (at IU): InCommon, auditing, and v1.0 of Sloan non- consumptive support. Security audit on development stack; verify ready for copyright materials • Sandbox stack: retire (?) • Production stack (at UIUC or IU): v2.0. Supports inCommon in anticipation of copyright works. Public domain corpus. Separate Cassandra and Solr for public domain corpus. • Development stack (at IU or UIUC): dev stack ready for copyright materials. Deliver: Mar 31, 2013 Deliver: Jun 30, 2013 Deliver: Sep 30, 2013 Deliver: Nov 30, 2013 HTRC Tech Stack Deployment Timeline
data • Web services architecture and protocols • Registry of services and algorithms • Solr full text indexes • noSQL store as volume store • openID authentication • Portal front-end, programmatic access • SEASR mining algos
system) Volume store (Cassandra) SEASR analytics service Task deployment WSO2 registry services, collections, data capsule images Solr index HathiTrust corpus rsync HTRC Data API v0.1 NCSA local resources Programmatic access e.g., WS02 Identity Server University of Michigan Meandre Orchestration Agent instance Agent instance Agent instance Agent instance Non-consumptive Data capsules Big Red II/IU Quarry 15 Blacklight Volume store (Cassandra) Volume store (Cassandra) NSF XSEDE Portal
interface Portal Security (OAuth2 WSO2 IS) Algorithms and Worksets Registry (WSO2 GR) Applicatio n submission Audit Cassandra cluster volume store Solr index Entity Extraction Topic Modelin g OpenNLP Token count Latent semantic analysis High level apps Compute resources Storage resources Blacklight User VM
Store VM Image Builder VM Manager VM instance Secure Virtual Cloud SSH Non-consumptive Output Storage Researcher HTRC Non- Consumptive Research Access Request for VM
was made possible with content provided by many HTRC colleagues Beth Plale, John Unsworth, J. Stephen Downie, Robert H. McDonald, Beth Sandore, Yiming Sun, Miao Chen, Guangchen Ruan, Loretta Auvil, Kirk Hess, and many others… • The HTRC Non-Consumptive Research Grant is graciously funded by the Alfred P. Sloan Foundation • IU D2I-PTI is graciously funded by The Lilly Endowment, Inc. • HTRC - http://www.hathitrust.org/htrc • IU D2I Center - http://d2i.indiana.edu/ • UIUC GSLIS - http://www.lis.illinois.edu/