Upgrade to Pro — share decks privately, control downloads, hide ads and more …

R Stories from the Trenches - Budapest R Meetup - Aug 2015

szilard
August 26, 2015
490

R Stories from the Trenches - Budapest R Meetup - Aug 2015

szilard

August 26, 2015
Tweet

More Decks by szilard

Transcript

  1. T-10 ~ 2006 - cost was not an issue! -

    data.frame - 800 packages
  2. Data frames: “in-memory table” with (fast) bulk operations (“vectorized”) thousands

    of packages (providing high-level API) R, Python (pandas), Spark best way to work with structured data
  3. Data frames: “in-memory table” with (fast) bulk operations (“vectorized”) thousands

    of packages (providing high-level API) R, Python (pandas), Spark best way to work with structured data
  4. R data.table (on one server!) aggregation 100M rows 1M groups

    1.3 sec join 100M rows x 1M rows 1.7sec
  5. A colleague from work, asked me to investigate about Spark

    and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records.
  6. A colleague from work, asked me to investigate about Spark

    and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records. Spark: 26.45734 seconds for a million records? Nice job -:) R: 48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger…
  7. A colleague from work, asked me to investigate about Spark

    and R. So the most obvious thing to was to investigate about SparkR I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b". I prepared a file with 5 columns and 1 million records. Spark: 26.45734 seconds for a million records? Nice job -:) R: 48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger… HOLLY CRAP UPDATE! Markus gave me this code on the comments... [R: 0.1791632 seconds]. I just added a couple of things to make complaint...but...damn...I wish I could code like that in R