Upgrade to Pro — share decks privately, control downloads, hide ads and more …

jstor: An R Package for Analysing Scientific Articles

Thomas Klebel
July 13, 2018
970

jstor: An R Package for Analysing Scientific Articles

Slides for the talk "jstor: An R Package for Analysing Scientific Articles" at useR!2018 (https://user2018.r-project.org).

The package and instructions for installation can be found at https://ropensci.github.io/jstor/

Thomas Klebel

July 13, 2018
Tweet

Transcript

  1. jstor: An R Package for Analysing Scientific Articles Thomas Klebel

    13 July 2018 University of Graz https://github.com/tklebel https://twitter.com/klebel_t Package: http://bit.ly/jstor2018 Slides: http://bit.ly/jstor2018_slides 1
  2. Why do we need a new package? Many existing packages

    can be used for analysis: • topic modeling: • topicmodels, lda, mallet, . . . • ngram analysis • tidytext, tm, quanteda, . . . • citation analysis • base regex, stringr, . . . But: No existing solution to import DfR-metadata in a convenient way. 6
  3. How to use the package - a fictional example Analysing

    articles about “facebook” Possible research questions: • In which fields are researchers interested in facebook and why? • In which ways do they write about facebook, what are the key topics? • . . . . 7
  4. 9

  5. Get overview of zip-archive jst_preview_zip("facebook.zip") type meta_type n metadata book_chapter

    7068 metadata journal_article 11494 metadata research_report 430 ngram1 ngram1 19493 ngram2 ngram2 19493 10
  6. Structure of meta-data files <article> <front> ... <article-title>A title</article-title> <pub-year>2011</pub-year>

    </front> <back> ... <ref> H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. </ref> </back> </article> 11
  7. Extract parts of meta-data files with jstor • Articles: •

    jst_get_article • jst_get_references • jst_get_footnotes • Books: • jst_get_book • jst_get_chapters • Both: • jst_get_authors 12
  8. jst_get_article("example.xml") %>% tidyr::gather(columns, rows) columns rows file_name example journal_title Journal

    of Transport and Land Use article_title Photos, tweets, and trails pub_year 2017 . . . . . . 13
  9. jst_get_article("example.xml") %>% tidyr::gather(columns, rows) columns rows file_name example journal_title Journal

    of Transport and Land Use article_title Photos, tweets, and trails pub_year 2017 . . . . . . jst_get_references("example.xml") file_name references example Backstrom, L., Sun, E., & Marlow, C. (2010). Fi. . . example Cranshaw, J., Schwartz, R., Hong, J. I., & Sade. . . . . . . . . 13
  10. Joining separate tables XML-file(s) jst_get_article filter(pub_year > 2000) left_join combined

    data.frame jst_get_references file_name article_title pub_year references example Photos, tweets, a. . . 2017 References example Photos, tweets, a. . . 2017 Backstrom, L., Su. . . example Photos, tweets, a. . . 2017 Cranshaw, J., Sch. . . example Photos, tweets, a. . . 2017 El Esawey, M., Li. . . . . . . . . . . . . . . 14
  11. Import from archive: jst_import_zip • Read data directly from .zip-file

    • Choose which parts you want to import • Write data to .csv-files 15
  12. Import from archive: jst_import_zip • Read data directly from .zip-file

    • Choose which parts you want to import • Write data to .csv-files jst_import_zip( zip_archive = "facebook.zip", import_spec = jst_define_import( article = c(jst_get_article, jst_get_footnotes, jst_get_references), ngram2 = jst_get_ngram ), out_file = "out_file" ) 15
  13. 0% 25% 50% 75% 100% 0% 25% 50% 75% 100%

    1890 1920 1950 1980 2010 1890 1920 1950 1980 2010 Proportion of articles with data on references Proportion of articles with data on footnotes Articles (n = 192,986) come from 215 sociological journals. The red line is a running median over 11 years. 17
  14. 0% 25% 50% 75% 100% 1940 1950 1960 1970 1980

    1990 2000 2010 Proportion with footnotes Proportion with references Social Research 18
  15. 0% 25% 50% 75% 100% 1940 1960 1980 2000 Proportion

    with footnotes Proportion with references American Sociological Review 19
  16. 0% 25% 50% 75% 100% 1890 1920 1950 1980 2010

    Proportion with footnotes Proportion with references American Journal of Sociology 20
  17. Collecting all the quirks Collected known quirks about data from

    DfR: https://ropensci.github.io/jstor/articles/known-quirks.html Contributions are welcome! • Open an issue: https://github.com/ropensci/jstor/issues • Make a pull request 21
  18. Installation and Documentation Thanks to rOpenSci, Elin Waring and Jason

    Becker for the package review! Install the package with: install.packages("jstor")1 Documentation: http://bit.ly/jstor2018 Slides: http://bit.ly/jstor2018_slides 1Edited after the talk to reflect acceptance to CRAN. 22
  19. References Bjork, Samuel, Avner Offer, and Gabriel Söderberg. 2014. “Time

    Series Citation Data: The Nobel Prize in Economics.” Scientometrics 98 (1):185–96. https://doi.org/10.1007/s11192-013-0989-5. 23