Analysing locally gathered logfiles to determine users' accesses to subscribed e-resources

Analysing locally gathered logfiles to determine users' accesses to subscribed e-resources

Speech done at ICOLC at Lisbonne about ezPAARSE/AnalogIST project.

ezPAARSE is a software released under a GPL-compatible licence able to mine, analyse and enrich the logs generated by reverse proxies (ezProxy, Biblio PAM, Squid, Apache) which record access to academic and scientific publishers' platforms.

AnalogIST is a collaborative space allowing users to share analyses of academic and scientific publishers' platforms.

Ae5979732c49cae7b741294a1d3a8682?s=128

Stéphane Gully

October 21, 2014
Tweet

Transcript

  1. 1.

    ANALYSING LOCALLY GATHERED LOGFILES TO DETERMINE USERS’ ACCESSES TO SUBSCRIBED

    E- RESOURCES http://ezpaarse.couperin.org http://analogist.couperin.org cecilia.fabry@inist.fr stephane.gully@inist.fr https://github.com/ezpaarse-project/ezpaarse
  2. 2.

    ICOLC 2014 - Lisbon - 2014/10/21 1- The Context :

    A Need for Evaluation 2- Gathering Local Data 3- Parsers and Analyses 4- AnalogIST and ezPAARSE 5- Results and Visualization 6- Project Organization Presentation outline 1
  3. 3.

    ICOLC 2014 - Lisbon - 2014/10/21 1- The Context :

    A Need for Evaluation 2- Gathering Local Data 3- Parsers and Analyses 4- AnalogIST and ezPAARSE 5- Results and Visualization 6- Project Organization 2
  4. 4.

    ICOLC 2014 - Lisbon - 2014/10/21 About some well-known facts

    • $25 billion global revenue in 2012, + 4-5 % / year • The 4 biggest publishers make up half the market • For 10 years the price of most journals increases from 3% to 5% / year • 1.5 billion articles downloaded per year and by 10M users The Scientific and Technical Information Market We need to assess and evaluate the use of these e-resources 1. The Context : A need for evaluation 3
  5. 5.

    ICOLC 2014 - Lisbon - 2014/10/21 What we’ve currently got

    … are not available … are available and COUNTER-compliant … are available but not COUNTER- compliant Publisher provided statistics 1. The Context : A need for evaluation 4
  6. 6.

    ICOLC 2014 - Lisbon - 2014/10/21 A possible solution :

    → locally-gathered usage quantification Vendors are the only source These numbers just offer mere quantification → We need to assess these numbers → We need to qualify them 1. The Context : A need for evaluation Publisher provided statistics limitations 5
  7. 7.

    ICOLC 2014 - Lisbon - 2014/10/21 1- The Context :

    A Need for Evaluation 2- Gathering Local Data 3- Parsers and Analyses 4- AnalogIST and ezPAARSE 5- Results and Visualization 6- Project Organization 6
  8. 8.
  9. 9.

    ICOLC 2014 - Lisbon - 2014/10/21 Where ezPAARSE comes into

    play 1 3 2 2. Gathering usage data locally 8
  10. 10.

    ICOLC 2014 - Lisbon - 2014/10/21 1- The Context :

    A Need for Evaluation 2- Gathering Local Data 3- Parsers and Analyses 4- AnalogIST and ezPAARSE 5- Results and Visualization 6- Project Organization 9
  11. 11.

    ICOLC 2014 - Lisbon - 2014/10/21 Example of a URL

    structuration http://pdn.sciencedirect.com/science? _ob=MiamiImageURL&_cid=271664&_user=4046427&_pii=S0001 457512000747&_check=y&_origin=browse&_zone=rslt_list_item& _coverDate=2012-07-31&wchp=dGLbVlt- zSkWb&md5=f5d8d157ccda6d597cb466af123dbff3/1-s2.0- S0001457512000747-main.pdf 3. Parsers and analyses 10
  12. 12.

    ICOLC 2014 - Lisbon - 2014/10/21 Example of a URL

    structuration http://pdn.sciencedirect.com/science? _ob=MiamiImageURL&_cid=271664&_user=4046427&_pii=S0001 457512000747&_check=y&_origin=browse&_zone=rslt_list_item& _coverDate=2012-07-31&wchp=dGLbVlt- zSkWb&md5=f5d8d157ccda6d597cb466af123dbff3/1-s2.0- S0001457512000747-main.pdf ISSN & type of the downloaded file 3. Parsers and analyses 11
  13. 13.

    ICOLC 2014 - Lisbon - 2014/10/21 http://www.sciencedirect.com/science/journal/00014575 ISSN By manually

    trying the URL, we find a HTML table of contents 3. Parsers and analyses Example of a URL structuration 12
  14. 14.

    ICOLC 2014 - Lisbon - 2014/10/21 http://www.cairn.info/load_pdf.php?ID_ARTICLE=RFG_218_0009 We know it’s

    a PDF but we only get a publisher- specific identifier. We need a correspondence table: the Publisher Knowledge Base (ideally a KBART formated file) Publisher id ISSN RFG 0338-4551 LMS 0027-2671 ... 3. Parsers and analyses Example of a URL structuration 13
  15. 16.

    ICOLC 2014 - Lisbon - 2014/10/21 ...we need one parser

    for each platform Platforms covered Today ezPAARSE covers about 55 platforms 3. Parsers and analyses 15
  16. 17.

    ICOLC 2014 - Lisbon - 2014/10/21 Opaque URLs : session

    ids, encryption… (Ex: old Springer platform) http://www.springerlink.com/content/j5q872410p510m63/fulltext.pdf Publisher IDs, needing to be linked to a knowledge base (ex: Cairn) http://www.cairn.info/load_pdf.php?ID_ARTICLE=RFG_218_0009 - Opaque URLs (session ids, encryption…) - Knowledge bases having to be manually edited Some limitations apply 3. Parsers and analyses 16
  17. 18.

    ICOLC 2014 - Lisbon - 2014/10/21 1- The Context :

    A Need for Evaluation 2- Gathering Local Data 3- Parsers and Analyses 4- AnalogIST and ezPAARSE 5- Results and Visualization 6- Project Organization 17
  18. 19.

    ICOLC 2014 - Lisbon - 2014/10/21 the software ez :

    easy / PAARSE : Progiciel d'Analyse des Accès aux RessourceS Electroniques = Software for Analysing the Accesses to Online Resources - as a local installation - as an online service (SaaS) Free (libre) software Cross platform Available online here : http://ezpaarse. couperin.org the wiki portal Analyse des Logs de l'IST = Analysing the logs of Scientific and Technical Information The place where: → we gather the platform analyses → we host our French national ezPAARSE installation http://analogist.couperin.org AnalogIST and ezPAARSE 4. AnalogIST and ezPAARSE 18
  19. 20.

    ICOLC 2014 - Lisbon - 2014/10/21 AnalogIST and ezPAARSE Through

    a web form With the command line (cURL) Use the web form to create the command line suiting your needs. 4. AnalogIST and ezPAARSE 19
  20. 21.

    ICOLC 2014 - Lisbon - 2014/10/21 Example of an ezPAARSE

    output KBART fields geoip fields Deduplicated consultation events: COUNTER recommendation Text file (CSV or JSON format) 4. AnalogIST and ezPAARSE 20
  21. 22.

    ICOLC 2014 - Lisbon - 2014/10/21 1- The Context :

    A Need for Evaluation 2- Gathering Local Data 3- Parsers and Analyses 4- AnalogIST and ezPAARSE 5- Results and Visualization 6- Project Organization 21
  22. 24.
  23. 25.

    ICOLC 2014 - Lisbon - 2014/10/21 Enrichment of results Link

    from external tables to ezPAARSE results to enrich data Indicators and dashboards User Data Prices Data on journal (SGB) Scientific discipline Language Usage events from ezPAARSE 5. ezPAARSE : using the results 24
  24. 26.

    ICOLC 2014 - Lisbon - 2014/10/21 Who (student, researcher, staff)

    consults what? (UL) Repartition of consultations of paid content (books, journals, law references…) by user type at the University of Lorraine 5. ezPAARSE : using the results 25
  25. 27.

    ICOLC 2014 - Lisbon - 2014/10/21 Consultations by research unit

    in Astronomy Consultations of articles from Jan 2014 to October 2014 by research units in Astronomy at CNRS 5. ezPAARSE : using the results 26
  26. 28.

    ICOLC 2014 - Lisbon - 2014/10/21 5. ezPAARSE : using

    the results 27 Domain : Astronomy & Astrophysics Springer / Nature / ScienceDirect / IOP in same proportions Difference of usages between two scientific domains
  27. 29.

    ICOLC 2014 - Lisbon - 2014/10/21 Difference of usages between

    two scientific domains 5. ezPAARSE : using the results 28 A lot of ScienceDirect and a little of Springer / Nature / IOP Domain : Geosciences
  28. 31.

    ICOLC 2014 - Lisbon - 2014/10/21 1- The Context :

    A Need for Evaluation 2- Gathering Local Data 3- Parsers and Analyses 4- AnalogIST and ezPAARSE 5- Results and Visualization 6- Project Organization 30
  29. 33.

    ICOLC 2014 - Lisbon - 2014/10/21 The French SCRUM team

    6. Project organization 32 Paris Nancy
  30. 34.

    ICOLC 2014 - Lisbon - 2014/10/21 In conclusion • ezPAARSE

    is free and open source • Simple to install and to use • Innovative technologies (NodeJS, AngularJS ...) • Feel free to test • send us log samples • give us feedback ! 33
  31. 35.

    ICOLC 2014 - Lisbon - 2014/10/21 Any Questions? http://ezpaarse.couperin.org http://analogist.couperin.org

    https://twitter.com/ezpaarse nuage de tag avec termes appropriés https://github.com/ezpaarse-project/ezpaarse
  32. 37.

    ICOLC 2014 - Lisbon - 2014/10/21 Detection of an anomaly

    (CNRS) The consultation peak corresponds to an abuse of an e-resource. Detection allows to react promptly to this incident. 5. ezPAARSE : using the results 31
  33. 38.

    ICOLC 2014 - Lisbon - 2014/10/21 AnalogIST and ezPAARSE Univ

    1 Univ 2 ... local installations collaborative space + global installation 4. AnalogIST and ezPAARSE 19
  34. 39.

    ICOLC 2014 - Lisbon - 2014/10/21 http://analogist.couperin.org/platforms/analyse- helper/start The rest

    is automatically processed The URL is the only information you need to enter dokuwiki syntax generated
  35. 41.

    ICOLC 2014 - Lisbon - 2014/10/21 What do we count?

    Serials E-books Law databases Inst. repositories Articles (ARTICLE) Book by title (BOOK) Law encyclopedia (ENCYCLOPEDIES) PHD_THESIS Abstract (ABS) Chapter, section (BOOK_SECTION) Law memento (FORMULES) MD_THESIS Table of contents (TOC) Book series (BOOKSERIE) Law manual (BROCHES) MASTER_THESIS Reference (REF) Manuals, handbooks (HANDBOOK) Law codes (CODES) Article preview (for ex. “Look inside” function of SpringerLink) (PREVIEW) Article in basket/personal folder (BOOKMARK) - The availability of these items depend on the elements present in the URL - The Law databases currently covered are only French ones 3. Parsers and analyses