The HathiTrust Research Center: Building Shared Computational Resources To Mine The Largest Academic Digital Library Corpus
This is the presentation that we gave at #EDU13 on Oct 16, 2013 with other members of the HTRC Executive Committee John Unsworth | @unsworth and Beth Sandore Namachchivaya.
to Mine the Largest Academic Digital Library Corpus Robert H. McDonald – Indiana University Beth Sandore Namachchivaya – University of Illinois John Unsworth – Brandeis University Educause Annual Mee9ng Anaheim, CA October 16, 2013 Tweet Us: #HTRC #SESS037 #EDU13
Allegheny College Arizona State University Baylor University Boston College Boston University California Digital Library Carnegie Mellon University Columbia University Cornell University Dartmouth College Duke University Emory University Florida State University Getty Research Institute Harvard University Library Indiana University Iowa State University Johns Hopkins University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University New York Public Library New York University North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Syracuse University Texas A&M University Tufts University Universidad Complutense de Madrid University of Alabama University of Arizona University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Connecticut University of Delaware University of Florida University of Illinois at Chicago The University of Iowa University of Maryland University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahama University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Virginia Tech Wake Forest University Washington University Yale University Library
Numbers • 10,819,596 total volumes • 5,672,046 book 9tles • 281,890 serial 9tles • 3,786,858,600 pages • 485 terabytes • 128 miles • 8,791 tons • 3,469,225 volumes(~32% of total) in the public domain
Use • Search, collec9ons, online access • APIs and data feeds – Data API – Bibliographic API – “Hathifiles” inventory files – OAI • Computa9onal Research – Distribu9on of datasets – Protocol-‐based access – Research Center
HTRC • Provide a persistent and sustainable structure to enable scholars to ask and answer new ques9ons. – Leverage data storage and computa9onal infrastructure at Indiana & Illinois – S9mulate community development of new func9onality and tools – Use tools to enable discoveries that would not be possible without the HTRC • Enable scholars to fully u9lize content of HathiTrust Library while preven9ng intellectual property misuse within U.S. copyright law. – Provide a secure computa9onal and data environment for scholars to perform research using HathiTrust Digital Library.
of Governors • Execu9ve CommiSee • Execu9ve Director HathiTrust University of Illinois Indiana University HathiTrust Research Center University of Michigan Data Copy #1 Data Copy #2
• Reports to the HathiTrust Board of Governors • HTRC Execu9ve CommiSee – J. Stephen Downie (Co-‐director), Professor and Associate Dean for Research, University of Illinois GSLIS – Beth Plale (Co-‐director and Chair), Director Data To Insight Center and professor in the School of Informa9cs and Compu9ng at Indiana University – Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana University – Beth Sandore Namachchivaya, Associate University Librarian for Informa9on Technology Planning & Policy at the University of Illinois – John Unsworth, Vice Provost for Library & Technology Services and Chief Informa9on Officer at Brandeis University • HTRC Advisory Board (See members next slide) • Google Public Domain agreement – in place for IU and UIUC
Board • Cathy Blake, University of Illinois, Urbana-‐Champaign • Beth Cate, Indiana University • Greg Crane, Tums University • Laine Farley, California Digital Library • Brian Geiger, University of California at Riverside • David Greenbaum, University of California at Berkeley • Fo9s Jannidis, University of Wurzberg, Germany • MaShew Jockers, Stanford University • Jim Neal, Columbia University • Bill Newman, Indiana University • Bethany Nowviskie, University of Virginia • Andrey Rzhetsky, University of Chicago • Pat Steele, University of Maryland • Craig Stewart, Indiana University • David Theo Goldberg, University of California at Irvine • John Towns, Na9onal Center for Supercompu9ng Applica9ons • Madelyn Wessel, University of Virginia
In-‐copyright or undetermined 70% Public Domain (worldwide) 15% U.S. Federal Government Documents (worldwide) 4% Public Domain (US) 10% Open Access .1% Crea9ve Commons .01% "Public Domain” 30%
Michigan 45% California 33% Wisconsin 5% Cornell 4% NYPL 3% Princeton 3% Indiana 2% Columbia 1% Harvard 1% LC 1% Madrid 1% Minnesota 1% Chicago 0% Duke 0% Illinois 0% NCSU 0% Northwestern 0% Penn State 0% Purdue 0% UNC-‐Chapel Hill 0% Utah State 0% Virginia 0% Yale 0%
48% German 9% French 7% Spanish 5% Chinese 4% Russian 4% Japanese 3% Italian 3% Arabic 2% La9n 1% Remaining Languages 14% Language Distribu9on The top 10 languages make up ~86% of all content
Bibliographic Data Content Package Indiana Michigan Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-‐text Search PageTurner APIs Collec9ons Holdings Data Datasets
• Strongly bound to US copyright issues with constant vigilance of the interna9onal scene • Status determina9ons via: – Bibliographic metadata – Automa9c and manual rights determina9on
Determina9on • Conducted on all works at 9me of ingest and when records are modified – Public domain worldwide • US works published before 1923, US federal government publica9ons, non-‐US works published prior to 1872 – Public domain in the United States • Non-‐US works published prior to 1923
Determina9on • IMLS-‐funded CRMS project – US-‐published works 1923-‐1963 – Conformance with formali9es – Expanding to non-‐US works – Double-‐blind review with expert review for conflicts – Staff at 4 HathiTrust partner ins9tu9ons (15 will take part in non-‐US) – As of February 2012 ~190,000 reviewed, more than 100,000 opened • Rights Holder Permissions
pd copyright public domain 2 ic copyright in-copyright 3 opb copyright out-of-print and brittle (implies in-copyright) 4 orph copyright copyright-orphaned (implies in-copyright) 5 und copyright undetermined copyright status 6 umall access available to UM affiliates and walk-in patrons (all campuses) 7 world access available to everyone in the world 8 nobody access available to nobody; blocked for all users 9 pdus copyright public domain only when viewed in the US 10 cc-by copyright Creative Commons Attribution 11 cc-by-nd copyright Creative Commons Attribution-NoDerivatives 12 cc-by-nc-nd copyright Creative Commons Attribution-NonCommercial-NoDerivatives 13 cc-by-nc copyright Creative Commons Attribution-NonCommercial 14 cc-by-nc-sa copyright Creative Commons Attribution-NonCommercial-ShareAlike 15 cc-by-sa copyright Creative Commons Attribution-ShareAlike 16 orphcand copyright orphan candidate - in 90-day holding period (implies in- copyright) 17 cc-zero copyright Creative Commons Zero license (implies pd) 18 und-world copyright Undetermined copyright status and permitted as world-viewable by the depositor 19 Ic-us copyright In copyright in the US Rights ASributes
1 bib bibliographically-derived by automatic processes 2 ncn no printed copyright notice 3 con contractual agreement with copyright holder on file 4 ddd due diligence documentation on file 5 man manual access control override; see note for details 6 pvt private personal information visible 7 ren copyright renewal research was conducted 8 nfi needs further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered) 9 cdpp title page or verso contain copyright date and/or place of publication information not in bib record 10 cip condition review and in-print status research was conducted 11 unp unpublished work 12 gfv Google viewability set at VIEW_FULL 13 crms derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details 14 add author death date research was conducted or notification was received from authoritative source 15 exp expiration of copyright term for non-US work with corporate author 16 Del Deleted from repository; see note for details 17 Gatt Non-US public domain work restored to in-copyright in the US by GATT
work Searchable (bibliographic and full-‐text) Viewable* Full-‐PDF download (Data API) Print on Demand Print disabiliWes* PreservaWon uses (SecWon 108)* Public domain worldwide Worldwide Worldwide Partners only if scanned by Google, if not, worldwide. Worldwide Partners worldwide N/A Public domain (US) – Non-‐US works published between 1872 and 1923. Worldwide When accessed from with the United States Partners in the US if scanned by Google, if not, anyone US Available within the United States Partners in the US; partners worldwide where similar laws in effect N/A Works that rights holders have opened access to in HathiTrust Worldwide Worldwide Worldwide (if digi9zed by Google, full-‐PDF only available if opened with CC license) Worldwide with permission Partners worldwide N/A Works that are in-‐copyright or of undetermined status Worldwide Not available Not available Not available Partners in the US; partners worldwide where similar laws in effect Partners in the US; partner worldwide where similar laws in effect Orphan works Worldwide To par9cipa9ng partners Not available Not available Partners in the US Partners in the US; partners worldwide where similar laws in effect * Note: Access to in-‐copyright works is subject to condi9ons on Terms of Access slide. See here also.
services and algorithms • Solr full text indexes • noSQL store as volume store • openID authen9ca9on • Portal front-‐end, programma9c access • Data mining algorithms
framework Page/volume tree (file system) Volume store (Cassandra) SEASR analy9cs service Task deployment WSO2 registry services, collec9ons, data capsule images Solr index HathiTrust corpus rsync HTRC Data API v0.1 NCSA local resources Programma9c access e.g., WS02 Iden9ty Server University of Michigan Meandre Orchestra9on Agent instance Agent instance Agent instance Agent instance Non-consumptive Data capsules Big Red II/IU Quarry 33 Blacklight Volume store (Cassandra) Volume store (Cassandra) NSF XSEDE Portal
VM Image Manager VM Image Store VM Image Builder VM Manager VM instance Secure Virtual Cloud SSH Non-‐consump9ve Output Storage Researcher HTRC Research Access Request for VM
'anthropomorphism', and 'compara9ve psychology’. This set contains lots of books that are not of par9cular interest -‐-‐ e.g., books on theology, college course catalogs. Challenge: Find the philosophical arguments in haystack of sentences Colin Allen Professor, Cogni4ve Science Indiana University Digging into Data 2011
at level of genes General study: understanding of how phenotypes, such as human healthy diversity and maladies, are implemented at level of genes. Why HTRC: capture proper9es of language automa9cally -‐-‐ for text transforma9ons and informa9on extrac9on. Generalize gramma9cal and idioma9c paSerns as related to systems biology. Andrey Rzhetsky Professor, Department of Medicine University of Chicago
and Proposals involving HTRC • Zdenek Zdrahal, “DiscoveryCORE, Discovering Hidden Rela9onships in Seman9cally Connected Resources”, NEH Digging Into Data Challenge. • MaShew Wilken, NotreDame, “Literary Geography at Scale”, American Council of Learned Socie9es (ACLS). • Ichiro Fujinaga, “Single Interface for Music Score Searching and Analysis (SIMSSA)” to SSHRC, Canada. Pending. • Andrew Piper, Text Mining the Novel: Establishing the Founda9ons of a New Discipline, SSHRC, Canada. • Robert Liffe, University of Sussex, Textual Genomics Project (TTGP), United Kingdom Arts and Humani9es Research Council. • Edie Rasmussen. From Indexer’s Legacy to Scholar’s Desktop. • Adam Farquhar, The Bri9sh Library. IRIS, Arts and Humani9es Research Council grant.
for Scholarly Analysis Funded at $493,000 by the Andrew W. Mellon Founda9on; Co-‐PIs: J. Stephen Downie, Tim Cole, Beth Plale; 1 July 2013 -‐ 30 June 2015. Goals: 1) enriching the metadata in the HathiTrust corpus 2) augmen9ng string-‐based metadata with URIs to leverage discovery and sharing through external services, and 3) formalizing the no9on of collec9ons and worksets in the context of the HathiTrust Research Center. Includes an open, compe99ve Request for Proposals in November 2013, with the intent to fund four prototyping projects that will build tools for enriching and augmen9ng metadata for the HathiTrust corpus.
Cloud for Secure Text-‐ Mining at Scale Funded at $606,000 by The Alfred P. Sloan Founda9on; Beth Plale, Indiana University, PI; Atul Prakash, University of Michigan, Co-‐PI; Fall 2011 -‐ Spring 2013. Goal: Prototype a system that enables secure text mining to be carried out at scale using public cloud resources, including: 1. a somware cloud infrastructure based on OpenStack 2. mechanisms for managing a secure virtual machine We plan The Sloan Cloud will provide users with dedicated virtual machines that are pre-‐configured with appropriate tools and provide secure access to remote data that cannot be funneled through the VM to outside filesystems.
• This presenta9on was made possible with content provided by many HTRC colleagues John Unsworth, J. Stephen Downie, Beth Plale, Robert H. McDonald, Beth Sandore, Yiming Sun, Miao Chen, Guangchen Ruan, LoreSa Auvil, Kirk Hess, and many others… • The HTRC Non-‐Consump9ve Research Grant is graciously funded by the Alfred P. Sloan Founda9on • IU D2I-‐PTI is graciously funded by The Lilly Endowment, Inc. • HTRC -‐ hSp://www.hathitrust.org/htrc • IU D2I Center -‐ hSp://d2i.indiana.edu/ • UIUC GSLIS -‐ hSp://www.lis.illinois.edu/
Speakers: Robert H. McDonald, Indiana University [email protected] | @mcdonald Beth Sandore Namachchivaya, University of Illinois [email protected] John Unsworth, Brandeis University [email protected] | @unsworth Requests for assistance: Miao Chen, HTRC Educa9on and Outreach [email protected]