Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysing locally gathered logfiles to determine users' accesses to subscribed e-resources

Analysing locally gathered logfiles to determine users' accesses to subscribed e-resources

Speech done at ICOLC at Lisbonne about ezPAARSE/AnalogIST project.

ezPAARSE is a software released under a GPL-compatible licence able to mine, analyse and enrich the logs generated by reverse proxies (ezProxy, Biblio PAM, Squid, Apache) which record access to academic and scientific publishers' platforms.

AnalogIST is a collaborative space allowing users to share analyses of academic and scientific publishers' platforms.

Stéphane Gully

October 21, 2014
Tweet

More Decks by Stéphane Gully

Other Decks in Technology

Transcript

  1. ANALYSING LOCALLY GATHERED LOGFILES TO
    DETERMINE USERS’ ACCESSES TO SUBSCRIBED E-
    RESOURCES
    http://ezpaarse.couperin.org
    http://analogist.couperin.org
    [email protected]
    [email protected]
    https://github.com/ezpaarse-project/ezpaarse

    View Slide

  2. ICOLC 2014 - Lisbon - 2014/10/21
    1- The Context : A Need for Evaluation
    2- Gathering Local Data
    3- Parsers and Analyses
    4- AnalogIST and ezPAARSE
    5- Results and Visualization
    6- Project Organization
    Presentation outline
    1

    View Slide

  3. ICOLC 2014 - Lisbon - 2014/10/21
    1- The Context : A Need for Evaluation
    2- Gathering Local Data
    3- Parsers and Analyses
    4- AnalogIST and ezPAARSE
    5- Results and Visualization
    6- Project Organization
    2

    View Slide

  4. ICOLC 2014 - Lisbon - 2014/10/21
    About some well-known facts
    ● $25 billion global revenue in 2012, + 4-5 % / year
    ● The 4 biggest publishers make up half the market
    ● For 10 years the price of most journals increases from 3% to
    5% / year
    ● 1.5 billion articles downloaded per year and by 10M users
    The Scientific and Technical Information Market
    We need to assess and evaluate the use
    of these e-resources
    1. The Context : A need for evaluation
    3

    View Slide

  5. ICOLC 2014 - Lisbon - 2014/10/21
    What we’ve currently got
    … are not available
    … are available and COUNTER-compliant
    … are available but not COUNTER- compliant
    Publisher provided statistics
    1. The Context : A need for evaluation
    4

    View Slide

  6. ICOLC 2014 - Lisbon - 2014/10/21
    A possible solution :
    → locally-gathered usage quantification
    Vendors are the only source
    These numbers just offer mere quantification
    → We need to assess these
    numbers
    → We need to qualify them
    1. The Context : A need for evaluation
    Publisher provided statistics limitations
    5

    View Slide

  7. ICOLC 2014 - Lisbon - 2014/10/21
    1- The Context : A Need for Evaluation
    2- Gathering Local Data
    3- Parsers and Analyses
    4- AnalogIST and ezPAARSE
    5- Results and Visualization
    6- Project Organization
    6

    View Slide

  8. ICOLC 2014 - Lisbon - 2014/10/21
    The reverse proxy
    2. Gathering usage data locally
    7

    View Slide

  9. ICOLC 2014 - Lisbon - 2014/10/21
    Where ezPAARSE comes into play
    1
    3
    2
    2. Gathering usage data locally
    8

    View Slide

  10. ICOLC 2014 - Lisbon - 2014/10/21
    1- The Context : A Need for Evaluation
    2- Gathering Local Data
    3- Parsers and Analyses
    4- AnalogIST and ezPAARSE
    5- Results and Visualization
    6- Project Organization
    9

    View Slide

  11. ICOLC 2014 - Lisbon - 2014/10/21
    Example of a URL structuration
    http://pdn.sciencedirect.com/science?
    _ob=MiamiImageURL&_cid=271664&_user=4046427&_pii=S0001
    457512000747&_check=y&_origin=browse&_zone=rslt_list_item&
    _coverDate=2012-07-31&wchp=dGLbVlt-
    zSkWb&md5=f5d8d157ccda6d597cb466af123dbff3/1-s2.0-
    S0001457512000747-main.pdf
    3. Parsers and analyses
    10

    View Slide

  12. ICOLC 2014 - Lisbon - 2014/10/21
    Example of a URL structuration
    http://pdn.sciencedirect.com/science?
    _ob=MiamiImageURL&_cid=271664&_user=4046427&_pii=S0001
    457512000747&_check=y&_origin=browse&_zone=rslt_list_item&
    _coverDate=2012-07-31&wchp=dGLbVlt-
    zSkWb&md5=f5d8d157ccda6d597cb466af123dbff3/1-s2.0-
    S0001457512000747-main.pdf
    ISSN & type of the downloaded file
    3. Parsers and analyses
    11

    View Slide

  13. ICOLC 2014 - Lisbon - 2014/10/21
    http://www.sciencedirect.com/science/journal/00014575
    ISSN
    By manually trying the URL,
    we find a HTML table of contents
    3. Parsers and analyses
    Example of a URL structuration
    12

    View Slide

  14. ICOLC 2014 - Lisbon - 2014/10/21
    http://www.cairn.info/load_pdf.php?ID_ARTICLE=RFG_218_0009
    We know it’s a PDF but
    we only get a publisher-
    specific identifier.
    We need a correspondence table:
    the Publisher Knowledge Base
    (ideally a KBART formated file)
    Publisher id ISSN
    RFG 0338-4551
    LMS 0027-2671
    ...
    3. Parsers and analyses
    Example of a URL structuration
    13

    View Slide

  15. ICOLC 2014 - Lisbon - 2014/10/21
    http://pdn.sciencedirect.com/science?
    _ob=MiamiImageURL&_cid=271664&_user=4046427&_pii=S000145751200074
    7&_check=y&_origin=browse&_zone=rslt_list_item&_coverDate=2012-07-
    31&wchp=dGLbVlt-zSkWb&md5=f5d8d157ccda6d597cb466af123dbff3/1-s2.0-
    S0001457512000747-main.pdf
    /_pii=S([0-9]{0,7}[0-9X])/i
    How to parse the URL?
    3. Parsers and analyses
    14

    View Slide

  16. ICOLC 2014 - Lisbon - 2014/10/21
    ...we need one parser for each platform
    Platforms covered
    Today ezPAARSE covers about
    55 platforms
    3. Parsers and analyses
    15

    View Slide

  17. ICOLC 2014 - Lisbon - 2014/10/21
    Opaque URLs : session ids, encryption… (Ex: old Springer platform)
    http://www.springerlink.com/content/j5q872410p510m63/fulltext.pdf
    Publisher IDs, needing to be linked to a knowledge base (ex: Cairn)
    http://www.cairn.info/load_pdf.php?ID_ARTICLE=RFG_218_0009
    - Opaque URLs (session ids, encryption…)
    - Knowledge bases having to be manually edited
    Some limitations apply
    3. Parsers and analyses
    16

    View Slide

  18. ICOLC 2014 - Lisbon - 2014/10/21
    1- The Context : A Need for Evaluation
    2- Gathering Local Data
    3- Parsers and Analyses
    4- AnalogIST and ezPAARSE
    5- Results and Visualization
    6- Project Organization
    17

    View Slide

  19. ICOLC 2014 - Lisbon - 2014/10/21
    the software
    ez : easy / PAARSE : Progiciel d'Analyse des
    Accès aux RessourceS Electroniques
    = Software for Analysing the Accesses to
    Online Resources
    - as a local installation
    - as an online service (SaaS)
    Free (libre) software
    Cross platform
    Available online here : http://ezpaarse.
    couperin.org
    the wiki portal
    Analyse des Logs de l'IST = Analysing
    the logs of Scientific and Technical
    Information
    The place where:
    → we gather the platform analyses
    → we host our French national ezPAARSE
    installation
    http://analogist.couperin.org
    AnalogIST and ezPAARSE
    4. AnalogIST and ezPAARSE
    18

    View Slide

  20. ICOLC 2014 - Lisbon - 2014/10/21
    AnalogIST and ezPAARSE
    Through a web form With the command line (cURL)
    Use the web form to create the command line suiting
    your needs.
    4. AnalogIST and ezPAARSE
    19

    View Slide

  21. ICOLC 2014 - Lisbon - 2014/10/21
    Example of an ezPAARSE output
    KBART fields
    geoip fields
    Deduplicated consultation events:
    COUNTER recommendation
    Text file
    (CSV or JSON format)
    4. AnalogIST and ezPAARSE
    20

    View Slide

  22. ICOLC 2014 - Lisbon - 2014/10/21
    1- The Context : A Need for Evaluation
    2- Gathering Local Data
    3- Parsers and Analyses
    4- AnalogIST and ezPAARSE
    5- Results and Visualization
    6- Project Organization
    21

    View Slide

  23. ICOLC 2014 - Lisbon - 2014/10/21
    Office rendering macros
    5. ezPAARSE : using the results
    22

    View Slide

  24. ICOLC 2014 - Lisbon - 2014/10/21
    Exploiting the Results with
    23
    5. ezPAARSE : using the results

    View Slide

  25. ICOLC 2014 - Lisbon - 2014/10/21
    Enrichment of results
    Link from external tables to ezPAARSE results to enrich data
    Indicators
    and
    dashboards
    User
    Data
    Prices
    Data on
    journal
    (SGB)
    Scientific
    discipline
    Language
    Usage events
    from
    ezPAARSE
    5. ezPAARSE : using the results
    24

    View Slide

  26. ICOLC 2014 - Lisbon - 2014/10/21
    Who (student, researcher, staff) consults what? (UL)
    Repartition of consultations of paid content (books, journals, law
    references…) by user type at the University of Lorraine
    5. ezPAARSE : using the results
    25

    View Slide

  27. ICOLC 2014 - Lisbon - 2014/10/21
    Consultations by research unit in Astronomy
    Consultations of articles from Jan 2014 to
    October 2014 by research units in
    Astronomy at CNRS
    5. ezPAARSE : using the results
    26

    View Slide

  28. ICOLC 2014 - Lisbon - 2014/10/21
    5. ezPAARSE : using the results
    27
    Domain : Astronomy & Astrophysics
    Springer / Nature / ScienceDirect / IOP in same proportions
    Difference of usages between two scientific domains

    View Slide

  29. ICOLC 2014 - Lisbon - 2014/10/21
    Difference of usages between two scientific domains
    5. ezPAARSE : using the results
    28
    A lot of ScienceDirect
    and a little of Springer / Nature / IOP
    Domain : Geosciences

    View Slide

  30. ICOLC 2014 - Lisbon - 2014/10/21
    Geolocation of consultations (CNRS)
    5. ezPAARSE : using the results
    29

    View Slide

  31. ICOLC 2014 - Lisbon - 2014/10/21
    1- The Context : A Need for Evaluation
    2- Gathering Local Data
    3- Parsers and Analyses
    4- AnalogIST and ezPAARSE
    5- Results and Visualization
    6- Project Organization
    30

    View Slide

  32. ICOLC 2014 - Lisbon - 2014/10/21
    Agile development process
    6. Project organization
    SCRUM
    31

    View Slide

  33. ICOLC 2014 - Lisbon - 2014/10/21
    The French SCRUM team
    6. Project organization
    32
    Paris
    Nancy

    View Slide

  34. ICOLC 2014 - Lisbon - 2014/10/21
    In conclusion
    ● ezPAARSE is free and open source
    ● Simple to install and to use
    ● Innovative technologies (NodeJS, AngularJS ...)
    ● Feel free to test
    ● send us log samples
    ● give us feedback !
    33

    View Slide

  35. ICOLC 2014 - Lisbon - 2014/10/21
    Any Questions?
    http://ezpaarse.couperin.org
    http://analogist.couperin.org
    https://twitter.com/ezpaarse
    nuage de tag avec termes appropriés
    https://github.com/ezpaarse-project/ezpaarse

    View Slide

  36. ICOLC 2014 - Lisbon - 2014/10/21

    View Slide

  37. ICOLC 2014 - Lisbon - 2014/10/21
    Detection of an anomaly (CNRS)
    The consultation peak corresponds to an abuse of an e-resource.
    Detection allows to react promptly to this incident.
    5. ezPAARSE : using the results
    31

    View Slide

  38. ICOLC 2014 - Lisbon - 2014/10/21
    AnalogIST and ezPAARSE
    Univ 1
    Univ 2
    ...
    local installations
    collaborative space
    + global installation
    4. AnalogIST and ezPAARSE
    19

    View Slide

  39. ICOLC 2014 - Lisbon - 2014/10/21
    http://analogist.couperin.org/platforms/analyse-
    helper/start
    The rest is
    automatically
    processed
    The URL is the only information
    you need to enter
    dokuwiki syntax
    generated

    View Slide

  40. ICOLC 2014 - Lisbon - 2014/10/21
    More features : exploiting the results with
    geolocalization

    View Slide

  41. ICOLC 2014 - Lisbon - 2014/10/21
    What do we count?
    Serials E-books Law databases Inst. repositories
    Articles (ARTICLE) Book by title (BOOK) Law encyclopedia
    (ENCYCLOPEDIES)
    PHD_THESIS
    Abstract (ABS) Chapter, section
    (BOOK_SECTION)
    Law memento
    (FORMULES)
    MD_THESIS
    Table of contents (TOC) Book series
    (BOOKSERIE)
    Law manual
    (BROCHES)
    MASTER_THESIS
    Reference (REF) Manuals, handbooks
    (HANDBOOK)
    Law codes (CODES)
    Article preview (for ex.
    “Look inside” function of
    SpringerLink)
    (PREVIEW)
    Article in basket/personal
    folder (BOOKMARK)
    - The availability of these items depend on the elements present in
    the URL
    - The Law databases currently covered are only French ones
    3. Parsers and analyses

    View Slide