Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The HathiTrust Research Center: Building Shared Computational Resources To Mine The Largest Academic Digital Library Corpus

The HathiTrust Research Center: Building Shared Computational Resources To Mine The Largest Academic Digital Library Corpus

This is the presentation that we gave at #EDU13 on Oct 16, 2013 with other members of the HTRC Executive Committee John Unsworth | @unsworth and Beth Sandore Namachchivaya.

Robert H. McDonald

October 16, 2013
Tweet

More Decks by Robert H. McDonald

Other Decks in Education

Transcript

  1. The  HathiTrust  Research  Center:     Building  Shared  Computa9onal  Resources

     to  Mine   the  Largest  Academic  Digital  Library  Corpus   Tweet  Us:  #HTRC  #SESS037  #EDU13  
  2. The  HathiTrust  Research  Center:     Building  Shared  Computa9onal  Resources

     to  Mine   the  Largest  Academic  Digital  Library  Corpus   Robert  H.  McDonald  –  Indiana  University   Beth  Sandore  Namachchivaya  –  University  of  Illinois   John  Unsworth  –  Brandeis  University     Educause  Annual  Mee9ng   Anaheim,  CA   October  16,  2013   Tweet  Us:  #HTRC  #SESS037  #EDU13  
  3. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   HathiTrust  Partnership

      Allegheny College Arizona State University Baylor University Boston College Boston University California Digital Library Carnegie Mellon University Columbia University Cornell University Dartmouth College Duke University Emory University Florida State University Getty Research Institute Harvard University Library Indiana University Iowa State University Johns Hopkins University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University New York Public Library New York University North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Syracuse University Texas A&M University Tufts University Universidad Complutense de Madrid University of Alabama University of Arizona University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz   The University of Chicago University of Connecticut University of Delaware University of Florida University of Illinois at Chicago The University of Iowa University of Maryland University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahama University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Virginia Tech Wake Forest University Washington University Yale University Library
  4. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   HathiTrust  Mission

        To  contribute  to  the  common  good  by  collec9ng,   organizing,  preserving,  communica9ng,  and  sharing   the  record  of  human  knowledge  
  5. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   HathiTrust  Services

      •  Long-­‐term  preserva9on   – Bit-­‐level  and  migra9on   •  Bibliographic  search   •  Full-­‐text  search   •  Reading  and  download  capabili9es   •  Print  on  demand   •  Collec9ons   •  Datasets   •  HathiTrust  Research  Center    
  6. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   HathiTrust  “Wow”

     Numbers   •  10,819,596  total  volumes   •  5,672,046  book  9tles   •  281,890  serial  9tles   •  3,786,858,600  pages   •  485  terabytes   •  128  miles   •  8,791  tons   •  3,469,225  volumes(~32%  of  total)  in  the  public   domain  
  7. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Discovery  and

     Use   •  Search,  collec9ons,  online  access   •  APIs  and  data  feeds   – Data  API   – Bibliographic  API   – “Hathifiles”  inventory  files   – OAI   •  Computa9onal  Research   – Distribu9on  of  datasets   – Protocol-­‐based  access   – Research  Center  
  8. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Goals  for

     HTRC     •  Provide  a  persistent  and  sustainable  structure  to   enable  scholars  to  ask  and  answer  new  ques9ons.     –  Leverage  data  storage  and  computa9onal  infrastructure  at   Indiana  &  Illinois   –  S9mulate  community  development  of  new  func9onality  and  tools   –  Use  tools  to  enable  discoveries  that  would  not  be  possible   without  the  HTRC     •  Enable  scholars  to  fully  u9lize  content  of  HathiTrust   Library  while  preven9ng  intellectual  property  misuse   within  U.S.  copyright  law.     –  Provide  a  secure  computa9onal  and  data  environment  for   scholars  to  perform  research  using  HathiTrust  Digital  Library.      
  9. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   •  Board

     of  Governors     •  Execu9ve  CommiSee   •  Execu9ve  Director   HathiTrust   University   of     Illinois   Indiana   University     HathiTrust   Research   Center     University   of   Michigan   Data   Copy   #1   Data   Copy   #2  
  10. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   HTRC  Governance

      •  Reports  to  the  HathiTrust  Board  of  Governors   •  HTRC  Execu9ve  CommiSee   –  J.  Stephen  Downie  (Co-­‐director),  Professor  and  Associate   Dean  for  Research,  University  of  Illinois  GSLIS   –  Beth  Plale  (Co-­‐director  and  Chair),  Director  Data  To  Insight   Center  and  professor  in  the  School  of  Informa9cs  and   Compu9ng  at  Indiana  University     –  Robert  H.  McDonald,  Associate  Dean  of  Libraries/Deputy   Director  Data  to  Insight  Center  at  Indiana  University   –  Beth  Sandore  Namachchivaya,  Associate  University  Librarian   for  Informa9on  Technology  Planning  &  Policy  at  the   University  of  Illinois     –  John  Unsworth,  Vice  Provost  for  Library  &  Technology   Services  and  Chief  Informa9on  Officer  at  Brandeis  University   •  HTRC  Advisory  Board  (See  members  next  slide)   •  Google  Public  Domain  agreement  –  in  place  for  IU  and   UIUC  
  11. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   HTRC  Advisory

     Board   •  Cathy  Blake,  University  of  Illinois,  Urbana-­‐Champaign   •  Beth  Cate,  Indiana  University   •  Greg  Crane,  Tums  University   •  Laine  Farley,  California  Digital  Library   •  Brian  Geiger,  University  of  California  at  Riverside   •  David  Greenbaum,  University  of  California  at  Berkeley   •  Fo9s  Jannidis,  University  of  Wurzberg,  Germany   •  MaShew  Jockers,  Stanford  University   •  Jim  Neal,  Columbia  University   •  Bill  Newman,  Indiana  University   •  Bethany  Nowviskie,  University  of  Virginia   •  Andrey  Rzhetsky,  University  of  Chicago   •  Pat  Steele,  University  of  Maryland   •  Craig  Stewart,  Indiana  University   •  David  Theo  Goldberg,  University  of  California  at  Irvine   •  John  Towns,  Na9onal  Center  for  Supercompu9ng  Applica9ons   •  Madelyn  Wessel,  University  of  Virginia  
  12. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Hathifiles  

    •  Tab-­‐delimited  inventory  files   •  Aggregated  monthly   •  Daily  incremental  files   •  Contain   – Iden9fiers   – Limited  bibliographic  informa9on   – Rights,  language,  gov  docs  status  informa9on  
  13. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Content  Distribu9on

      In-­‐copyright  or   undetermined   70%   Public  Domain   (worldwide)   15%   U.S.  Federal   Government   Documents   (worldwide)   4%   Public   Domain   (US)   10%     Open  Access   .1%   Crea9ve  Commons     .01%   "Public   Domain”   30%  
  14. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Content  Sources

      Michigan   45%   California   33%   Wisconsin   5%   Cornell   4%   NYPL   3%   Princeton   3%   Indiana   2%   Columbia   1%   Harvard   1%   LC   1%   Madrid   1%   Minnesota   1%   Chicago   0%   Duke   0%   Illinois   0%   NCSU   0%   Northwestern   0%   Penn  State   0%   Purdue   0%   UNC-­‐Chapel  Hill   0%   Utah  State   0%   Virginia   0%   Yale   0%  
  15. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Dates  

    !"""#!""$% &"'% &$$"#&$$$% &('% &$)"#&$)$% &*'% &$+"#&$+$% &,'% &$-"#&$-$% &&'% &$*"#&$*$% -'% &$("#&$($% ('% &$,"#&$,$% ('% &$!"#&$!$% ('% &$&"#&$&$% ('% &$""#&$"$% ('% &)*"#&)$$% )'% &)""#&)($% ,'% &+""#&+$$% &'% &-""#&-$$% "'% &*""#&*$$% "'% "#&*""% "'%
  16. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   English  

    48%   German   9%   French   7%   Spanish   5%   Chinese   4%   Russian   4%   Japanese   3%   Italian   3%   Arabic   2%   La9n   1%   Remaining   Languages   14%   Language  Distribu9on   The  top  10  languages  make  up   ~86%  of  all  content    
  17. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Source  

    Bibliographic   Data   Content  Package   Indiana   Michigan   Bib  Data   Data  Management   Rights   Data   Storage   Access   Ingest   Catalog   Full-­‐text  Search   PageTurner   APIs   Collec9ons   Holdings   Data   Datasets  
  18. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   How  is

     it  available?   •  Web  interfaces   •  APIs   – Data  API   – Bib  API   •  Data  feeds  and  distribu9on   – Hathifiles   – OAI   – Datasets   •  Soon:  Virtual  Machines  
  19. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Copyright  

    •  Strongly  bound  to  US  copyright  issues  with   constant  vigilance  of  the  interna9onal  scene       •  Status  determina9ons  via:   – Bibliographic  metadata   – Automa9c  and  manual  rights  determina9on  
  20. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Automa9c  Rights

     Determina9on   •  Conducted  on  all  works  at  9me  of  ingest  and   when  records  are  modified   – Public  domain  worldwide   •  US  works  published  before  1923,  US  federal   government  publica9ons,  non-­‐US  works  published   prior  to  1872   – Public  domain  in  the  United  States   •  Non-­‐US  works  published  prior  to  1923  
  21. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Manual  Rights

     Determina9on   •  IMLS-­‐funded  CRMS  project   –  US-­‐published  works  1923-­‐1963   –  Conformance  with  formali9es   –  Expanding  to  non-­‐US  works   –  Double-­‐blind  review  with  expert  review  for  conflicts   –  Staff  at  4  HathiTrust  partner  ins9tu9ons  (15  will  take   part  in  non-­‐US)   –  As  of  February  2012  ~190,000  reviewed,  more  than   100,000  opened   •  Rights  Holder  Permissions  
  22. id   name   type   dscr   1  

    pd   copyright   public domain   2   ic   copyright   in-copyright   3   opb   copyright   out-of-print and brittle (implies in-copyright)   4   orph   copyright   copyright-orphaned (implies in-copyright)   5   und   copyright   undetermined copyright status   6   umall   access   available to UM affiliates and walk-in patrons (all campuses)   7   world   access   available to everyone in the world   8   nobody   access   available to nobody; blocked for all users   9   pdus   copyright   public domain only when viewed in the US   10   cc-by   copyright   Creative Commons Attribution   11   cc-by-nd   copyright   Creative Commons Attribution-NoDerivatives   12   cc-by-nc-nd   copyright   Creative Commons Attribution-NonCommercial-NoDerivatives   13   cc-by-nc   copyright   Creative Commons Attribution-NonCommercial   14   cc-by-nc-sa   copyright   Creative Commons Attribution-NonCommercial-ShareAlike   15   cc-by-sa   copyright   Creative Commons Attribution-ShareAlike   16   orphcand   copyright   orphan candidate - in 90-day holding period (implies in- copyright)   17   cc-zero   copyright   Creative Commons Zero license (implies pd)   18 und-world copyright Undetermined copyright status and permitted as world-viewable by the depositor 19 Ic-us copyright In copyright in the US Rights  ASributes  
  23. Rights  Determina9on  Reason  Codes   id   name   dscr

      1   bib   bibliographically-derived by automatic processes   2   ncn   no printed copyright notice   3   con   contractual agreement with copyright holder on file   4   ddd   due diligence documentation on file   5   man   manual access control override; see note for details   6   pvt   private personal information visible   7   ren   copyright renewal research was conducted   8   nfi   needs further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered)   9   cdpp   title page or verso contain copyright date and/or place of publication information not in bib record   10   cip   condition review and in-print status research was conducted   11   unp   unpublished work   12   gfv   Google viewability set at VIEW_FULL   13   crms   derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details   14   add   author death date research was conducted or notification was received from authoritative source     15   exp   expiration of copyright term for non-US work with corporate author     16 Del Deleted from repository; see note for details 17 Gatt Non-US public domain work restored to in-copyright in the US by GATT
  24. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Type  of

     work   Searchable   (bibliographic   and  full-­‐text)   Viewable*   Full-­‐PDF   download   (Data  API)   Print  on   Demand   Print   disabiliWes*   PreservaWon   uses  (SecWon   108)*   Public  domain   worldwide     Worldwide   Worldwide   Partners  only  if   scanned  by   Google,  if  not,   worldwide.   Worldwide   Partners   worldwide     N/A   Public  domain   (US)  –  Non-­‐US   works  published   between  1872   and  1923.     Worldwide   When  accessed   from  with  the   United  States   Partners  in  the   US  if  scanned   by  Google,  if   not,  anyone  US   Available  within   the  United   States   Partners  in  the   US;  partners   worldwide   where  similar   laws  in  effect   N/A   Works  that   rights  holders   have  opened   access  to  in   HathiTrust   Worldwide   Worldwide   Worldwide  (if   digi9zed  by   Google,  full-­‐PDF   only  available  if   opened  with  CC   license)   Worldwide  with   permission   Partners   worldwide     N/A   Works  that  are   in-­‐copyright  or   of   undetermined   status   Worldwide   Not  available   Not  available   Not  available   Partners  in  the   US;  partners   worldwide   where  similar   laws  in  effect   Partners  in  the   US;  partner   worldwide   where  similar   laws  in  effect   Orphan  works Worldwide To   par9cipa9ng   partners Not  available Not  available Partners  in  the   US Partners  in  the   US;  partners   worldwide   where  similar   laws  in  effect *  Note:  Access  to  in-­‐copyright  works  is  subject  to  condi9ons  on  Terms  of  Access  slide.  See  here  also.  
  25. •  Web  services  architecture  and  protocols   •  Registry  of

     services  and  algorithms   •  Solr  full  text  indexes   •  noSQL  store  as  volume  store   •  openID  authen9ca9on   •  Portal  front-­‐end,  programma9c  access   •  Data  mining  algorithms  
  26. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Agent  

    framework   Page/volume   tree  (file  system)   Volume  store     (Cassandra)   SEASR  analy9cs   service   Task     deployment   WSO2  registry   services,  collec9ons,  data   capsule  images   Solr    index   HathiTrust   corpus   rsync HTRC  Data  API  v0.1   NCSA  local  resources   Programma9c   access    e.g.,   WS02   Iden9ty   Server       University of Michigan Meandre   Orchestra9on   Agent   instance   Agent   instance   Agent   instance   Agent   instance   Non-consumptive Data capsules Big  Red  II/IU  Quarry   33   Blacklight Volume  store     (Cassandra)   Volume  store     (Cassandra)   NSF  XSEDE   Portal
  27. HTRC   Complexity  hiding  interface   All  the  complexity  

    Tabular  info   Sta9s9cal  plots   Spa9al  plots   Request  
  28. Complexity  hiding  interface   Other  data   (dic9onaries,   wiki

     data)     Subsets  of   corpus   HTRC     Text  mining   algorithms  
  29. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc      

    VM   Image   Manager   VM   Image   Store   VM   Image   Builder   VM   Manager   VM   instance   Secure   Virtual   Cloud   SSH   Non-­‐consump9ve   Output  Storage   Researcher   HTRC  Research   Access   Request   for  VM  
  30. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   1  

    Select  volumes  for  analysis   2   Select  algorithm   3   View/download  results   Named  En99es   Word  frequencies   Topic  models  
  31. 1315  volumes  selected  using  a  keyword  search  for  ‘Darwin',  ‘Romanes',

      'anthropomorphism',  and  'compara9ve  psychology’.  This  set  contains  lots  of  books   that  are  not  of  par9cular  interest  -­‐-­‐  e.g.,  books  on  theology,  college  course  catalogs.       Challenge:  Find  the  philosophical  arguments  in  haystack  of  sentences   Colin  Allen   Professor,  Cogni4ve  Science   Indiana  University   Digging  into  Data  2011  
  32. Yearly values of ratio between two wordlists in three different

    genres. 4,275 volumes. 1700-1899 Ted Underwood, Dept of English, UIUC
  33. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Phenotypes  implemented

     at  level  of   genes   General  study:  understanding  of    how   phenotypes,  such  as  human  healthy  diversity   and  maladies,  are  implemented  at  level  of   genes.     Why  HTRC:    capture  proper9es  of   language  automa9cally  -­‐-­‐  for  text   transforma9ons  and  informa9on  extrac9on.   Generalize  gramma9cal  and  idioma9c  paSerns   as  related  to  systems  biology.     Andrey  Rzhetsky   Professor,  Department  of  Medicine     University  of  Chicago  
  34. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Other  Grants

     and  Proposals  involving   HTRC   •  Zdenek  Zdrahal,  “DiscoveryCORE,  Discovering  Hidden  Rela9onships  in   Seman9cally  Connected  Resources”,  NEH  Digging  Into  Data  Challenge.   •  MaShew  Wilken,  NotreDame,  “Literary  Geography  at  Scale”,  American   Council  of  Learned  Socie9es  (ACLS).       •  Ichiro  Fujinaga,  “Single  Interface  for  Music  Score  Searching  and  Analysis   (SIMSSA)”  to  SSHRC,  Canada.  Pending.   •  Andrew  Piper,  Text  Mining  the  Novel:  Establishing  the  Founda9ons  of  a   New  Discipline,  SSHRC,  Canada.     •  Robert  Liffe,  University  of  Sussex,  Textual  Genomics  Project  (TTGP),  United   Kingdom  Arts  and  Humani9es  Research  Council.   •  Edie  Rasmussen.  From  Indexer’s  Legacy  to  Scholar’s  Desktop.   •  Adam  Farquhar,  The  Bri9sh  Library.  IRIS,  Arts  and  Humani9es  Research   Council  grant.  
  35. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Workset  Crea4on

     for  Scholarly   Analysis   Funded  at  $493,000  by  the  Andrew  W.  Mellon  Founda9on;   Co-­‐PIs:  J.  Stephen  Downie,  Tim  Cole,  Beth  Plale;  1  July  2013  -­‐   30  June  2015.    Goals:   1)  enriching  the  metadata  in  the  HathiTrust  corpus   2)  augmen9ng  string-­‐based  metadata  with  URIs  to  leverage   discovery  and  sharing  through  external  services,  and   3)  formalizing  the  no9on  of  collec9ons  and  worksets  in  the   context  of  the  HathiTrust  Research  Center.     Includes  an  open,  compe99ve  Request  for  Proposals  in   November  2013,  with  the  intent  to  fund  four  prototyping   projects  that  will  build  tools  for  enriching  and  augmen9ng   metadata  for  the  HathiTrust  corpus.      
  36. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   HTRC  Sloan

     Cloud  for  Secure  Text-­‐ Mining  at  Scale   Funded  at  $606,000  by  The  Alfred  P.  Sloan  Founda9on;  Beth   Plale,  Indiana  University,  PI;  Atul  Prakash,  University  of   Michigan,  Co-­‐PI;  Fall  2011  -­‐  Spring  2013.       Goal:  Prototype  a  system  that  enables  secure  text  mining  to  be   carried  out  at  scale  using  public  cloud  resources,  including:   1.  a  somware  cloud  infrastructure  based  on  OpenStack   2.  mechanisms  for  managing  a  secure  virtual  machine  We  plan       The  Sloan  Cloud  will  provide  users  with  dedicated  virtual   machines  that  are  pre-­‐configured  with  appropriate  tools  and   provide  secure  access  to  remote  data  that  cannot  be  funneled   through  the  VM  to  outside  filesystems.      
  37. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Thank  You

      •  This  presenta9on  was  made  possible  with  content   provided  by  many  HTRC  colleagues  John  Unsworth,  J.   Stephen  Downie,  Beth  Plale,  Robert  H.  McDonald,  Beth   Sandore,  Yiming  Sun,  Miao  Chen,  Guangchen  Ruan,   LoreSa  Auvil,  Kirk  Hess,  and  many  others…   •  The  HTRC  Non-­‐Consump9ve  Research  Grant  is   graciously  funded  by  the  Alfred  P.  Sloan  Founda9on   •  IU  D2I-­‐PTI  is  graciously  funded  by  The  Lilly  Endowment,   Inc.   •  HTRC  -­‐  hSp://www.hathitrust.org/htrc   •  IU  D2I  Center  -­‐  hSp://d2i.indiana.edu/   •  UIUC  GSLIS  -­‐  hSp://www.lis.illinois.edu/    
  38. Tweet  Us:  #HTRC  #SESS037  #EDU13   h5p://www.hathitrust.org/htrc   Contact  Informa9on

      Speakers:     Robert  H.  McDonald,  Indiana  University      [email protected]  |  @mcdonald   Beth  Sandore  Namachchivaya,  University  of  Illinois     [email protected]   John  Unsworth,  Brandeis  University     [email protected]  |  @unsworth     Requests  for  assistance:   Miao  Chen,  HTRC  Educa9on  and  Outreach    [email protected]    
  39. The  HathiTrust  Research  Center:     Building  Shared  Computa9onal  Resources

     to  Mine   the  Largest  Academic  Digital  Library  Corpus   Tweet  Us:  #HTRC  #SESS037  #EDU13