R Stories from the Trenches - Budapest R Meetup - Aug 2015

Ce8e94cc306ba164175f693fb01aa8b0?s=47 szilard
August 26, 2015
440

R Stories from the Trenches - Budapest R Meetup - Aug 2015

Ce8e94cc306ba164175f693fb01aa8b0?s=128

szilard

August 26, 2015
Tweet

Transcript

  1. R Stories from the Trenches Szilárd Pafka, PhD Chief Scientist,

    Epoch Budapest R Meetup August 2015
  2. None
  3. None
  4. None
  5. None
  6. None
  7. None
  8. T-25 ~ 1990

  9. T-20 ~ 1996

  10. T-15 ~ 2001

  11. T-10 ~ 2006 - cost was not an issue! -

    data.frame - 800 packages
  12. ~ 2009 aka data mining, aka (today) data science

  13. T ~ 2014 Data Science

  14. 1999 CRISP Data Mining

  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. 2006

  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. 5 yrs

  36. None
  37. None
  38. None
  39. (2009)

  40. (2009)

  41. None
  42. None
  43. None
  44. None
  45. None
  46. (382 RSVPs)

  47. None
  48. None
  49. high-level API fast environment reproducibility

  50. high-level API fast environment reproducibility

  51. high-level API fast environment reproducibility

  52. high-level API fast environment reproducibility

  53. high-level API fast environment reproducibility

  54. Data frames: “in-memory table” with (fast) bulk operations (“vectorized”) thousands

    of packages (providing high-level API) R, Python (pandas), Spark best way to work with structured data
  55. Data frames: “in-memory table” with (fast) bulk operations (“vectorized”) thousands

    of packages (providing high-level API) R, Python (pandas), Spark best way to work with structured data
  56. None
  57. R data.table (on one server!) aggregation 100M rows 1M groups

    1.3 sec join 100M rows x 1M rows 1.7sec
  58. aggregation 100M rows 1M groups join 100M rows x 1M

    rows
  59. None
  60. A colleague from work, asked me to investigate about Spark

    and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records.
  61. A colleague from work, asked me to investigate about Spark

    and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records. Spark: 26.45734 seconds for a million records? Nice job -:) R: 48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger…
  62. A colleague from work, asked me to investigate about Spark

    and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records. Spark: 26.45734 seconds for a million records? Nice job -:) R: 48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger… HOLLY CRAP UPDATE! Markus gave me this code on the comments... [R: 0.1791632 seconds]. I just added a couple of things to make complaint...but...damn...I wish I could code like that in R
  63. None
  64. None
  65. None
  66. None
  67. None
  68. None