jstor: An R Package for Analysing Scientific Articles

75f433e6cf1aff2dce7b207a842965c9?s=47 Thomas Klebel
July 13, 2018
580

jstor: An R Package for Analysing Scientific Articles

Slides for the talk "jstor: An R Package for Analysing Scientific Articles" at useR!2018 (https://user2018.r-project.org).

The package and instructions for installation can be found at https://ropensci.github.io/jstor/

75f433e6cf1aff2dce7b207a842965c9?s=128

Thomas Klebel

July 13, 2018
Tweet

Transcript

  1. jstor: An R Package for Analysing Scientific Articles Thomas Klebel

    13 July 2018 University of Graz https://github.com/tklebel https://twitter.com/klebel_t Package: http://bit.ly/jstor2018 Slides: http://bit.ly/jstor2018_slides 1
  2. Main message if (interested_in_research_on_sciences) { get_data_from_JSTOR() %>% use_package_jstor() %>% deal_with_data_limitations()

    } else { hopefully( get_interested_in_research_on_sciences() ) } 2
  3. JSTOR and Data for Research (DfR) 3

  4. Analysing citation patterns Figure 1: (Bjork, Offer, and Söderberg 2014,

    191) 4
  5. Analysing ngrams Figure 2: http://bit.ly/jstor_ngrams 5

  6. Why do we need a new package? Many existing packages

    can be used for analysis: • topic modeling: • topicmodels, lda, mallet, . . . • ngram analysis • tidytext, tm, quanteda, . . . • citation analysis • base regex, stringr, . . . But: No existing solution to import DfR-metadata in a convenient way. 6
  7. How to use the package - a fictional example Analysing

    articles about “facebook” Possible research questions: • In which fields are researchers interested in facebook and why? • In which ways do they write about facebook, what are the key topics? • . . . . 7
  8. Requesting a dataset at: https://www.jstor.org/dfr/ 8

  9. 9

  10. Get overview of zip-archive jst_preview_zip("facebook.zip") type meta_type n metadata book_chapter

    7068 metadata journal_article 11494 metadata research_report 430 ngram1 ngram1 19493 ngram2 ngram2 19493 10
  11. Structure of meta-data files <article> <front> ... <article-title>A title</article-title> <pub-year>2011</pub-year>

    </front> <back> ... <ref> H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. </ref> </back> </article> 11
  12. Extract parts of meta-data files with jstor • Articles: •

    jst_get_article • jst_get_references • jst_get_footnotes • Books: • jst_get_book • jst_get_chapters • Both: • jst_get_authors 12
  13. jst_get_article("example.xml") %>% tidyr::gather(columns, rows) columns rows file_name example journal_title Journal

    of Transport and Land Use article_title Photos, tweets, and trails pub_year 2017 . . . . . . 13
  14. jst_get_article("example.xml") %>% tidyr::gather(columns, rows) columns rows file_name example journal_title Journal

    of Transport and Land Use article_title Photos, tweets, and trails pub_year 2017 . . . . . . jst_get_references("example.xml") file_name references example Backstrom, L., Sun, E., & Marlow, C. (2010). Fi. . . example Cranshaw, J., Schwartz, R., Hong, J. I., & Sade. . . . . . . . . 13
  15. Joining separate tables XML-file(s) jst_get_article filter(pub_year > 2000) left_join combined

    data.frame jst_get_references file_name article_title pub_year references example Photos, tweets, a. . . 2017 References example Photos, tweets, a. . . 2017 Backstrom, L., Su. . . example Photos, tweets, a. . . 2017 Cranshaw, J., Sch. . . example Photos, tweets, a. . . 2017 El Esawey, M., Li. . . . . . . . . . . . . . . 14
  16. Import from archive: jst_import_zip • Read data directly from .zip-file

    • Choose which parts you want to import • Write data to .csv-files 15
  17. Import from archive: jst_import_zip • Read data directly from .zip-file

    • Choose which parts you want to import • Write data to .csv-files jst_import_zip( zip_archive = "facebook.zip", import_spec = jst_define_import( article = c(jst_get_article, jst_get_footnotes, jst_get_references), ngram2 = jst_get_ngram ), out_file = "out_file" ) 15
  18. Data Limitations – Citation Analysis 16

  19. 0% 25% 50% 75% 100% 0% 25% 50% 75% 100%

    1890 1920 1950 1980 2010 1890 1920 1950 1980 2010 Proportion of articles with data on references Proportion of articles with data on footnotes Articles (n = 192,986) come from 215 sociological journals. The red line is a running median over 11 years. 17
  20. 0% 25% 50% 75% 100% 1940 1950 1960 1970 1980

    1990 2000 2010 Proportion with footnotes Proportion with references Social Research 18
  21. 0% 25% 50% 75% 100% 1940 1960 1980 2000 Proportion

    with footnotes Proportion with references American Sociological Review 19
  22. 0% 25% 50% 75% 100% 1890 1920 1950 1980 2010

    Proportion with footnotes Proportion with references American Journal of Sociology 20
  23. Collecting all the quirks Collected known quirks about data from

    DfR: https://ropensci.github.io/jstor/articles/known-quirks.html Contributions are welcome! • Open an issue: https://github.com/ropensci/jstor/issues • Make a pull request 21
  24. Installation and Documentation Thanks to rOpenSci, Elin Waring and Jason

    Becker for the package review! Install the package with: install.packages("jstor")1 Documentation: http://bit.ly/jstor2018 Slides: http://bit.ly/jstor2018_slides 1Edited after the talk to reflect acceptance to CRAN. 22
  25. References Bjork, Samuel, Avner Offer, and Gabriel Söderberg. 2014. “Time

    Series Citation Data: The Nobel Prize in Economics.” Scientometrics 98 (1):185–96. https://doi.org/10.1007/s11192-013-0989-5. 23