Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges and Opportunities: statistics and d...

Jo Hardin
July 10, 2018
110

Challenges and Opportunities: statistics and data science undergraduate programs

Recent curricular working groups (e.g., the ASA Undergraduate Guidelines for Statistics Programs and the Park City Math Institute Data Science Guidelines) have provided useful guidance for undergraduate programs in statistics and data science. We review these guidelines, compare and contrast them, and explore successful implementation strategies and problematic hurdles for the teaching of statistics and data science at the post-secondary level. What key emphases from data science (e.g., the increased role of computation and communication) need to be further infused in statistics programs? How can these topics be integrated to ensure that students emerge with the capacity to “think with data”? What are some of the other issues that we need to address to ensure that statistics is interwoven into our data science programs?

Jo Hardin

July 10, 2018
Tweet

Transcript

  1. Challenges and Opportunities: statistics and data science undergraduate programs Jo

    Hardin & Nicholas J Horton Pomona / Amherst July 10, 2018 image credit: xkcd.com
  2. Committee on Applied and Theoretical Statistics (1992) At its August

    1992 meeting in Boston, the Committee on Applied and Theoretical Statistics (CATS) noted widespread sentiment in the statistical community that upper-level undergraduate and graduate curricula for statistics majors and postdoctoral training for statisticians are currently structured in ways that do not provide sufficient exposure to modern statistical analysis, computational and graphical tools, communication skills, and the ever growing interdisciplinary uses of statistics. Approaches and materials once considered standard are being rethought. The growth that statistics has undergone is often not reflected in the education that future statisticians receive. There is a need to incorporate more meaningfully into the curriculum the computational and graphical tools that are today so important to many professional statisticians. (CATS, 1994, p. vii).
  3. Recent guidelines / working groups • CATS: Modern Interdisciplinary University

    Statistics Education: Proceedings of a Symposium (1994) • ASA: Curriculum Guidelines for Undergraduate Statistics Programs (2015) • GAISE College Report (2016) • Park City Mathematics Institute : Curriculum guidelines for undergraduate programs in data science (2017) • National Academies of Science: Undergraduate Data Science: Opportunities and Options (2018) • National Academies of Sciences, Engineering, and Medicine: Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report (2018)
  4. Greater Data Science (Donoho 2017) 1. Data gathering, preparation, and

    exploration 2. Data representation and transformation 3. Computing with data 4. Data visualization and presentation 5. Data modeling 6. Science of data science
  5. update • FULL DATA ANALYSIS • COMPUTATION IN STATISTICAL THEORY

    • ETHICS • NEW LEARNING OUTCOMES • FACULTY DEVELOPMENT • REPRODUCIBILITY • VISUALIZATION SKILLS
  6. update: Perform a Full Data Analysis Capstone • Appreciation of

    the entire process of analysis • Appreciation of the broader statistical research endeavor • Creative and iterative model building and validation • Involvement with the course content • Excitement, enthusiasm, and diligence • Impact of statistics on an organization's work Impact on the Student Experience
  7. update: Perform a Full Data Analysis Capstone • Appreciation of

    the entire process of analysis • Appreciation of the broader statistical research endeavor • Creative and iterative model building and validation • Involvement with the course content • Excitement, enthusiasm, and diligence • Impact of statistics on an organization's work • Student Performance Improvements • Job and Graduate School Training Benefits Impact on the Student Experience
  8. update: Perform a Full Data Analysis Madison Hobbs, David Xu,

    Alex Gui, and Vedant Vohra (Rob Gould) Image credit: Mine Centinkaya-Rundel
  9. update: Computation in Statistical Theory • StatLabs: mathematical statistics through

    applications, Nolan & Speed (2001) • Teaching Statistics: a bag of tricks, Gelman & Nolan (2017) • I Hear, I Forget. I Do, I Understand. Horton (TAS 2013)
  10. update: Computation in Statistical Theory Prediction Intervals for Random Forests

    • MSE • Prediction vs. Confidence intervals • Linear model vs. Random Forest Benji Lu
  11. update: Ethics • Examples abound: O’Neil, Weapons of Math Destruction

    (2016) • Assignments / assessment exist: Baumer et al. (2017) • Google: “machine bias” “ethics in AI” “algorithmic bias”
  12. What data are being used missing? Table credit: How not

    to be wrong, Ellenberg, 2014 Image credit: Cameron Moll https://cameronmoll.carrd.co/ Image credit: public, https://www.militaryfactory.com/ Abraham Wald
  13. What data are being used missing? Table credit: How not

    to be wrong, Ellenberg, 2014 Image credit: Cameron Moll https://cameronmoll.carrd.co/ Image credit: public, https://www.militaryfactory.com/ Abraham Wald
  14. Palantir’s algorithm for network connections • Human decisions at different

    steps o What data are used? (which demographic variables are considered?) o What data are missing? (arrest data, not crime data) o What does a “connection” mean? How is a connection “scored”?
  15. etwork in Palantir Source: Palantir Technologies. Using crime arrest data

    (LAPD) • Guy Cross has a high point value • Connected to siblings, lovers, co-workers, co-arrestees • Once linked, individuals are auto-tracked Image credit: Brayne, “Big Data Surveillance: The Case of Policing”, American Sociological Revie w, 2017. Image credit: www.palantir.com
  16. Using crime arrest data (LAPD) • LAPD uses Palantir database

    to encourage officers to police specific areas • detective can see the number of times a name has been queried by other people. “if you aren’t doing anything wrong, [the cops are not going to be looking you up]. Just because you haven’t been arrested doesn’t mean you haven’t been caught.” • In other words, in auditable big data systems, queries can serve as quantified proxies for suspiciousness Information taken from: Brayne, “Big Data Surveillance: The Case of Policing”, American Sociological Review, 2017.
  17. Confounding when the levels of one factor are associated with

    the levels of another factor so that their effects cannot be separated Average teacher salary State average SAT 850 900 950 1000 1050 1100 25 30 35 40 45 50 Average teacher salary State average SAT 850 900 950 1000 1050 1100 25 30 35 40 45 50 low fraction medium fraction high fraction Nicholas J Horton
  18. Cross Validation Bias – Variance tradeoff (CART) Stephanie Yee &

    Tony Chu http://www.r2d3.us/visual-intro-to- machine-learning-part-1/ http://www.r2d3.us/visual-intro-to- machine-learning-part-2/
  19. update • FULL DATA ANALYSIS • COMPUTATION IN STATISTICAL THEORY

    • ETHICS • NEW LEARNING OUTCOMES • FACULTY DEVELOPMENT • REPRODUCIBILITY • VISUALIZATION SKILLS
  20. Conclusion • data science has a tremendous amount to offer

    to the statistics community • it's never been easier to extract meaning from data (improved tools) • tools are available, cheap, and accessible • faculty effort results in students who are more engaged and more ready for post-undergraduate opportunities • you have the ability to update one thing in each of your classes
  21. • ASA Undergraduate Statistics Guidelines Group (2014). ASA Curriculum Guidelines

    for Undergraduate Statistics Programs, www.amstat.org/asa/education/Curriculum-Guidelinesfor-Undergraduate-Programs-in-StatisticalScience.aspx. • ASA/MAA Guidelines. (2014). ASA/MAA develop guidelines for teaching introductory statistics course, https://magazine.amstat.org/blog/2014/04/01/asamaaguidelines. • Baumer B.S., Cetinkaya-Rundel M., Bray A., Loi L. and Horton N.J. (2014). R Markdown: integrating a reproducible analysis tool into introductory statistics, Technology Innovations in Statistics Education, 8(1). • Baumer, B.S., Kaplan, D. T., and Horton, N.J. (2017). Modern Data Science with R, Boca Raton, FL: CRC Press, http://mdsr-book.github.io. • Bryan, J. (2018). Excuse me, do you have a moment to talk about version control? The American Statistician, Vol 72 (1) 20-27. • Çetinkaya-Rundel and Rundel, C. (2018). Infrastructure and tools for teaching computing throughout the statistical curriculum, The American Statistician, Vol 72 (1), 58-65. • Chambers, J.M. (1993). Greater or lesser statistics, Statistics and Computing, 3(4):182-184. • Committee on Applied and Theoretical Statistics, National Research Council (1994). Modern Interdisciplinary University Statistics Education: Proceedings of a Symposium. http://www.nap.edu/catalog/2355.html. • Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report (2017). Washington, DC: National Academies Press, https://www.nap.edu/catalog/24886/envisioning-the-data-science-discipline-the-undergraduate-perspective-interim-report. • De Veaux, R.D., et al. (2017). Curriculum guidelines for undergraduate programs in data science, Ann Rev of Statistics and its Applications, DOI:10.1146/annurev-statistics-060116-053930. • Donoho, D. 50 years of data science (plus discussion) (2017). JCGS, 26(4):745-756. • Gould, R. (2014). DataFest: celebrating data in the data deluge, ICOTS 9 proceedings. • Horton, N.J., Baumer B.S., and Wickham H. (2015). Setting the stage for data science: integration of data management skills in introductory and second courses in statistics, CHANCE, 28(2):40-50. • Horton, N.J., Brown E.R., and Qian L. (2004). Use of R as a toolbox for mathematical statistics exploration. The American Statistician, 58(4):343-357. • Horton, N.J., and Hardin, J.S. (2017). Ensuring that mathematics is relevant in a world of data science. Notices of the American Mathematical Society, 64(9):986-990, https://www.ams.org/publications/journals/notices/201709/rnoti-p986.pdf • James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. NY, NY: Springer. • Martinosi, S.E. and Williams, T.D. (2016). A survey of statistical capstones, Journal of Statistics Education, 24(3). • McNamara A., Horton N.J., and Baumer B.S. (2017). Greater data science at baccalaureate institutions, JCGS, 26(4):781-783. • Nolan, D. and Perrett, J. (2016). Teaching and learning data visualization: ideas and assignments, The American Statistician, 70(3):260-269. • O'Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality & Threatens Democracy. NY, NY: Crown Publishing Group. • Wickham, H., and Grolemund, G. (2017). R for Data Science. Sebastopol, CA: O'Reilly Media, http://r4ds.had.co.nz. References