jstor: An R Package for Analysing Scientific Articles Thomas Klebel 13 July 2018 University of Graz https://github.com/tklebel https://twitter.com/klebel_t Package: http://bit.ly/jstor2018 Slides: http://bit.ly/jstor2018_slides 1
Why do we need a new package? Many existing packages can be used for analysis: • topic modeling: • topicmodels, lda, mallet, . . . • ngram analysis • tidytext, tm, quanteda, . . . • citation analysis • base regex, stringr, . . . But: No existing solution to import DfR-metadata in a convenient way. 6
How to use the package - a fictional example Analysing articles about “facebook” Possible research questions: • In which fields are researchers interested in facebook and why? • In which ways do they write about facebook, what are the key topics? • . . . . 7
jst_get_article("example.xml") %>% tidyr::gather(columns, rows) columns rows file_name example journal_title Journal of Transport and Land Use article_title Photos, tweets, and trails pub_year 2017 . . . . . . 13
jst_get_article("example.xml") %>% tidyr::gather(columns, rows) columns rows file_name example journal_title Journal of Transport and Land Use article_title Photos, tweets, and trails pub_year 2017 . . . . . . jst_get_references("example.xml") file_name references example Backstrom, L., Sun, E., & Marlow, C. (2010). Fi. . . example Cranshaw, J., Schwartz, R., Hong, J. I., & Sade. . . . . . . . . 13
Import from archive: jst_import_zip • Read data directly from .zip-file • Choose which parts you want to import • Write data to .csv-files jst_import_zip( zip_archive = "facebook.zip", import_spec = jst_define_import( article = c(jst_get_article, jst_get_footnotes, jst_get_references), ngram2 = jst_get_ngram ), out_file = "out_file" ) 15
0% 25% 50% 75% 100% 0% 25% 50% 75% 100% 1890 1920 1950 1980 2010 1890 1920 1950 1980 2010 Proportion of articles with data on references Proportion of articles with data on footnotes Articles (n = 192,986) come from 215 sociological journals. The red line is a running median over 11 years. 17
Collecting all the quirks Collected known quirks about data from DfR: https://ropensci.github.io/jstor/articles/known-quirks.html Contributions are welcome! • Open an issue: https://github.com/ropensci/jstor/issues • Make a pull request 21
Installation and Documentation Thanks to rOpenSci, Elin Waring and Jason Becker for the package review! Install the package with: install.packages("jstor")1 Documentation: http://bit.ly/jstor2018 Slides: http://bit.ly/jstor2018_slides 1Edited after the talk to reflect acceptance to CRAN. 22
References Bjork, Samuel, Avner Offer, and Gabriel Söderberg. 2014. “Time Series Citation Data: The Nobel Prize in Economics.” Scientometrics 98 (1):185–96. https://doi.org/10.1007/s11192-013-0989-5. 23