Upgrade to Pro — share decks privately, control downloads, hide ads and more …

jstor: An R Package for Analysing Scientific Articles

Thomas Klebel
July 13, 2018
850

jstor: An R Package for Analysing Scientific Articles

Slides for the talk "jstor: An R Package for Analysing Scientific Articles" at useR!2018 (https://user2018.r-project.org).

The package and instructions for installation can be found at https://ropensci.github.io/jstor/

Thomas Klebel

July 13, 2018
Tweet

Transcript

  1. jstor: An R Package for Analysing Scientific
    Articles
    Thomas Klebel
    13 July 2018
    University of Graz
    https://github.com/tklebel
    https://twitter.com/klebel_t
    Package: http://bit.ly/jstor2018
    Slides: http://bit.ly/jstor2018_slides
    1

    View Slide

  2. Main message
    if (interested_in_research_on_sciences) {
    get_data_from_JSTOR() %>%
    use_package_jstor() %>%
    deal_with_data_limitations()
    } else {
    hopefully(
    get_interested_in_research_on_sciences()
    )
    }
    2

    View Slide

  3. JSTOR and Data for Research (DfR)
    3

    View Slide

  4. Analysing citation patterns
    Figure 1: (Bjork, Offer, and Söderberg 2014, 191)
    4

    View Slide

  5. Analysing ngrams
    Figure 2: http://bit.ly/jstor_ngrams
    5

    View Slide

  6. Why do we need a new package?
    Many existing packages can be used for analysis:
    • topic modeling:
    • topicmodels, lda, mallet, . . .
    • ngram analysis
    • tidytext, tm, quanteda, . . .
    • citation analysis
    • base regex, stringr, . . .
    But: No existing solution to import DfR-metadata in a convenient
    way.
    6

    View Slide

  7. How to use the package - a fictional example
    Analysing articles about “facebook”
    Possible research questions:
    • In which fields are researchers interested in facebook and why?
    • In which ways do they write about facebook, what are the key
    topics?
    • . . . .
    7

    View Slide

  8. Requesting a dataset at: https://www.jstor.org/dfr/
    8

    View Slide

  9. 9

    View Slide

  10. Get overview of zip-archive
    jst_preview_zip("facebook.zip")
    type meta_type n
    metadata book_chapter 7068
    metadata journal_article 11494
    metadata research_report 430
    ngram1 ngram1 19493
    ngram2 ngram2 19493
    10

    View Slide

  11. Structure of meta-data files


    ...
    A title
    2011


    ...

    H. Wickham. ggplot2: Elegant Graphics for Data
    Analysis. Springer-Verlag New York, 2016.



    11

    View Slide

  12. Extract parts of meta-data files with jstor
    • Articles:
    • jst_get_article
    • jst_get_references
    • jst_get_footnotes
    • Books:
    • jst_get_book
    • jst_get_chapters
    • Both:
    • jst_get_authors
    12

    View Slide

  13. jst_get_article("example.xml") %>%
    tidyr::gather(columns, rows)
    columns rows
    file_name example
    journal_title Journal of Transport and Land Use
    article_title Photos, tweets, and trails
    pub_year 2017
    . . . . . .
    13

    View Slide

  14. jst_get_article("example.xml") %>%
    tidyr::gather(columns, rows)
    columns rows
    file_name example
    journal_title Journal of Transport and Land Use
    article_title Photos, tweets, and trails
    pub_year 2017
    . . . . . .
    jst_get_references("example.xml")
    file_name references
    example Backstrom, L., Sun, E., & Marlow, C. (2010). Fi. . .
    example Cranshaw, J., Schwartz, R., Hong, J. I., & Sade. . .
    . . . . . .
    13

    View Slide

  15. Joining separate tables
    XML-file(s) jst_get_article filter(pub_year > 2000)
    left_join
    combined data.frame
    jst_get_references
    file_name article_title pub_year references
    example Photos, tweets, a. . . 2017 References
    example Photos, tweets, a. . . 2017 Backstrom, L., Su. . .
    example Photos, tweets, a. . . 2017 Cranshaw, J., Sch. . .
    example Photos, tweets, a. . . 2017 El Esawey, M., Li. . .
    . . . . . . . . . . . .
    14

    View Slide

  16. Import from archive: jst_import_zip
    • Read data directly from .zip-file
    • Choose which parts you want to import
    • Write data to .csv-files
    15

    View Slide

  17. Import from archive: jst_import_zip
    • Read data directly from .zip-file
    • Choose which parts you want to import
    • Write data to .csv-files
    jst_import_zip(
    zip_archive = "facebook.zip",
    import_spec = jst_define_import(
    article = c(jst_get_article,
    jst_get_footnotes,
    jst_get_references),
    ngram2 = jst_get_ngram
    ),
    out_file = "out_file"
    )
    15

    View Slide

  18. Data Limitations – Citation Analysis
    16

    View Slide

  19. 0%
    25%
    50%
    75%
    100%
    0%
    25%
    50%
    75%
    100%
    1890 1920 1950 1980 2010
    1890 1920 1950 1980 2010
    Proportion of articles with data on references
    Proportion of articles with data on footnotes
    Articles (n = 192,986) come from 215 sociological journals.
    The red line is a running median over 11 years.
    17

    View Slide

  20. 0%
    25%
    50%
    75%
    100%
    1940 1950 1960 1970 1980 1990 2000 2010
    Proportion with footnotes Proportion with references
    Social Research
    18

    View Slide

  21. 0%
    25%
    50%
    75%
    100%
    1940 1960 1980 2000
    Proportion with footnotes Proportion with references
    American Sociological Review
    19

    View Slide

  22. 0%
    25%
    50%
    75%
    100%
    1890 1920 1950 1980 2010
    Proportion with footnotes Proportion with references
    American Journal of Sociology
    20

    View Slide

  23. Collecting all the quirks
    Collected known quirks about data from DfR:
    https://ropensci.github.io/jstor/articles/known-quirks.html
    Contributions are welcome!
    • Open an issue: https://github.com/ropensci/jstor/issues
    • Make a pull request
    21

    View Slide

  24. Installation and Documentation
    Thanks to rOpenSci, Elin Waring and Jason Becker for the package
    review!
    Install the package with: install.packages("jstor")1
    Documentation: http://bit.ly/jstor2018
    Slides: http://bit.ly/jstor2018_slides
    1Edited after the talk to reflect acceptance to CRAN.
    22

    View Slide

  25. References
    Bjork, Samuel, Avner Offer, and Gabriel Söderberg. 2014. “Time
    Series Citation Data: The Nobel Prize in Economics.” Scientometrics
    98 (1):185–96. https://doi.org/10.1007/s11192-013-0989-5.
    23

    View Slide